Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Replication Factor calculation for Tablets effective replication map. #21289

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

bitpathfinder
Copy link
Contributor

@bitpathfinder bitpathfinder commented Oct 25, 2024

Current implementation of replication factor calculation does not take into account Tablets migration. Since a table can be represented by multiple Tablets, during Tablets migration there are different Tablets could be in different states, thus there cannot be a single state for the table and previously used logic of taking the RF from the schema as it worked for VNodes, is not valid when tablets are in a migration state.

In this PR, the effective_replication_map::get_replication_factor() implementation was modified for Tablets. Instead of retrieving the replication factor from the schema, it now calculates the replication factor based on the actual replica set from the Tablets map. This change addresses potential inconsistencies between the schema and the state of individual tablets during migration.

Fixes #20625
Fixes #20282

Since this fixes an important issue with Tablets, we need backports to 6.0, 6.1, 6.2.

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 topology_custom/test_change_replication_factor_1_to_0
✅ - Container Test
✅ - dtest with tablets
✅ - dtest
✅ - dtest with topology changes
✅ - Unit Tests

Build Details:

  • Duration: 7 hr 44 min
  • Builder: i-07177465d1a75ff45 (m5ad.8xlarge)

@@ -921,7 +921,7 @@ future<> database::add_column_family(keyspace& ks, schema_ptr schema, column_fam
auto&& rs = ks.get_replication_strategy();
locator::effective_replication_map_ptr erm;
if (auto pt_rs = rs.maybe_as_per_table()) {
erm = pt_rs->make_replication_map(schema->id(), _shared_token_metadata.get());
erm = co_await pt_rs->make_replication_map(schema->id(), _shared_token_metadata.get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds preemption point between mark_synced() and adding the table. If a request sneaks in between, and sees the schema version as synced, it may fail unnecessarily due to missing table.

We should move mark_synced() later, which should have been done in 8842bd8

Copy link
Contributor

@kbr-scylla kbr-scylla Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a recent bug that introduced a regression in this area due to adding preemption point:
#19833
and later fix:
#20064
@bitpathfinder I recommend studying this case to see whether we don't break some other isolation properties that the code relies on (in addition to what @tgrabiec said)

Generally not being super careful when adding yields has caused us a lot of pain in the past. One investigation where an innocent yield in gossiper mode caused a super tough concurrency bug took me a more than a month... and the fix was 3 LOC: a0b331b

You personally encountered a problem like this recently 612a141

So please do quadruple-check this code and area around it before adding a yield anywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After performing some benchmarks, I decided to remove the preemption point from the build_dc_replication_factor_map since with the standard testing parameters of 2048 tablets per table, it takes a half of a microsecond).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out my previous benchmark test was incorrect as it was measuring tables with 32 Tablets. I have used perf-load-balancing to test the actual time, and for 131072 Tablets the replication factor map building time would be about 9.5ms. So preemption point should be kept.

locator/abstract_replication_strategy.cc Show resolved Hide resolved
locator/tablets.cc Outdated Show resolved Hide resolved
absl::flat_hash_map<sstring, replication_factor_t> _datacenter_map;
// The `_tablet_to_dc_replication_factor_map` stores replication factors
// per data center for each Tablet.
std::vector<replication_factor_list> _tablet_to_dc_replication_factor_map;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be chunked_vector to avoid large contiguous allocations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate why is it bad to have a large allocation? Do you mean our allocator's storage might be already so fragmented that there would not be such a big chunk of memory (~5 Mb for 100k tablets ) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the larger the allocation to more likely it is to cause premature OOM. In practice, we will try to reclaim contiguous space back from the LSA region (memtables/cache), which occupies higher addresses, but that may still require a lot of work to evict data and move segments around, and may cause reactor stalls. We maintain a reserve of free standard allocator space which is supposed to amortizes this cost, but if the request cannot be met due to fragmentation, it cannot make use of that reserve.

As a general rule, we avoid contiguous allocations larger than 128K (LSA segment size). 128K is a unit of memory reclamation from LSA to the standard allocator's pool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the benchmark I see that the number of tablets per table is 2048. Is that a typical number for production or it's just a testing number? Also when I set it to 65k I get an error "Exiting on unhandled exception: std::bad_alloc".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgrabiec So I suppose the max Tablets per table will be 8192, since otherwise our own tests can't be run. In that case the memory allocated by the _tablet_to_dc_replication_factor_map vector is 8192*8=65536 bytes (counting there are 4 data centers). Do you still think it makes sense to switch to a chunked vector (which allocates by default by 128k).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change to chunked vector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the benchmark I see that the number of tablets per table is 2048. Is that a typical number for production or it's just a testing number? Also when I set it to 65k I get an error "Exiting on unhandled exception: std::bad_alloc".

Can you share the command line which fails?

We aim at 100k tablets per cluster.

Some tests may not be able to handle that in a single shard due to overhead of tablet replica. In a real cluster, that overhead would be spread among many shards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgrabiec I run the following cmd
./build/dev/scylla perf-tablets --tablets-per-table 65536
It fails with std::bad_alloc. In fact everything higher than 8192 is failing.

Copy link
Contributor

@tgrabiec tgrabiec Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgrabiec I run the following cmd ./build/dev/scylla perf-tablets --tablets-per-table 65536 It fails with std::bad_alloc. In fact everything higher than 8192 is failing.

The default number of tables is 100, so with --tablets-per-table 65536 you have 6.5 M tablets. That's beyond our target of 100k.

tablet_map takes 600 MB. It probably bad_allocs due to mutation object size when persisting. With -m8G it proceeds, but then it takes almost a minute to save:

INFO  2024-11-06 14:06:48,099 [shard 0:stat] testlog - Saved in 41656.312500 [ms]
INFO  2024-11-06 14:07:05,748 [shard 0:stat] testlog - Read in 17534.886719 [ms]

And we fail in read_tablet_mutations() due to:

ERROR 2024-11-06 14:07:08,778 [shard 0:main] seastar - Exiting on unhandled exception: std::runtime_error (Maximum amount of memory for building query results is exhausted, unpaged query cannot be finished)

That one is used in migration_manager.cc, in schema pull handler:

            if (options->group0_snapshot_transfer) {
                cm.emplace_back(co_await db::system_keyspace::get_group0_history(db));
                if (proxy.local().local_db().get_config().enable_tablets()) {
                    for (auto&& m: co_await replica::read_tablet_mutations(db)) {

We shouldn't use that path anymore for snapshot transfer, right @kbr-scylla ?

We use that path for group0 snapshot transfer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #21456

const sstring& datacenter = topo.get_datacenter(node.host);
const auto it = _datacenter_map.find(datacenter);
if (it == _datacenter_map.end()) {
tablet_logger.debug("Could not find datacenter: {} for node: {} ", datacenter, node);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a rather serious error because it will result in an inconsistency between replication factor and read replica set. So the problem the patch tries to fix. And this one if happens, may be permanent. Probably deserves on_internal_error().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment also applies to the if (node_ptr == nullptr) { branch. If we can't find the node* it means in particular that we can't find the DC for this replica, and we become inconsistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But sometimes this is the case, due to a short time of inconsistency between the Tablets and the schema.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this code is performing lookup in the locator::topology data structure. This data structure is completely independent from any schema definitions. It's based purely on topology state. It should never be inconsistent with tablets.

Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not investigate much, but I was getting these errors in the tests consistently, when ERM was created before the node was initialized.

continue;
}
const sstring& datacenter = topo.get_datacenter(node.host);
const auto it = _datacenter_map.find(datacenter);
Copy link
Contributor

@tgrabiec tgrabiec Oct 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does erm construction time significantly increase because of this for a large tablet count (100k)?

If so, this could be optimized by keeping dc index in topology.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I suppose the time spent in the map will be proportional to the number of the data centers and their name length, especially if the names are longer than the SSO limit.

locator/tablets.cc Outdated Show resolved Hide resolved
locator/tablets.cc Outdated Show resolved Hide resolved
locator/tablets.cc Show resolved Hide resolved
auto rs = db.find_keyspace(table->schema()->ks_name()).get_replication_strategy_ptr();
locator::effective_replication_map_ptr erm;
if (auto pt_rs = rs->maybe_as_per_table()) {
erm = pt_rs->make_replication_map(id, tmptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above -- please check carefully if this yield won't cause problems

locator/abstract_replication_strategy.hh Outdated Show resolved Hide resolved
locator/tablets.cc Outdated Show resolved Hide resolved
// For each tablet index in the tablet map, it retrieves the current replica set using `get_endpoints_for_reading`
// and counts replicas per data center. Each data center's replication factor (RF) is saved under
// its index according to `_datacenter_map`.
seastar::future<> build_dc_replication_factor_map() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we usually omit seastar:: prefix for futures

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

locator/tablets.cc Outdated Show resolved Hide resolved
const sstring& datacenter = topo.get_datacenter(node.host);
const auto it = _datacenter_map.find(datacenter);
if (it == _datacenter_map.end()) {
tablet_logger.debug("Could not find datacenter: {} for node: {} ", datacenter, node);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment also applies to the if (node_ptr == nullptr) { branch. If we can't find the node* it means in particular that we can't find the DC for this replica, and we become inconsistent.

locator/tablets.cc Outdated Show resolved Hide resolved
locator/tablets.cc Outdated Show resolved Hide resolved
@@ -41,6 +41,11 @@ void everywhere_replication_strategy::validate_options(const gms::feature_servic
}

sstring everywhere_replication_strategy::sanity_check_read_replicas(const effective_replication_map& erm, const inet_address_vector_replica_set& read_replicas) const {
const auto& replication_strategy = erm.get_replication_strategy();
if (replication_strategy.uses_tablets()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're inside everywhere_replication_strategy code, so erm.get_replication_strategy() is this strategy, no? In which case this branch is dead code (only NTS can use tablets)

locator/network_topology_strategy.cc Outdated Show resolved Hide resolved
return make_ready_future<effective_replication_map_ptr>(
seastar::make_shared<tablet_effective_replication_map>(table, std::move(rs), std::move(tm), replication_factor));
auto erm = seastar::make_shared<tablet_effective_replication_map>(table, std::move(rs), std::move(tm), replication_factor);
co_await erm->build_dc_replication_factor_map();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we detect that there is no tablets in transition and do not populate anything? If the array is not populated we return values according to RF.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this won't work because when applying mixed_change or snapshot there is a short period of time where the schema is updated, but topology is not updated yet. So we construct e_r_m based on updated schema (e.g. RF reduced from 1 to 0) but old topology (tablets are still not in transition state). This state in practice holds for less than a second -- it is basically happening between these two co_awaits:

    [&] (mixed_change& chng) -> future<> {
        co_await _mm.merge_schema_from(netw::messaging_service::msg_addr(std::move(cmd.creator_addr)), std::move(chng.mutations));
        co_await _ss.topology_transition();

but it's still happening and if we allowed it, the bug is still present.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to reduce lookups and memory in the case of shared RF, we could compute the shared RF map from erm if we notice that all tablets have the same replica count.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way we should probably fix mixed_change so it reloads topology and schema in isolated fashion. (I believe @nuivall's work will address it)

@bitpathfinder bitpathfinder force-pushed the bugfix/20625_fix_tablets_read_replica_set_calculation branch from c48a66b to bfadf49 Compare November 8, 2024 10:41
@bitpathfinder
Copy link
Contributor Author

New version:

  • Address possible issues with added preemption point in make_replication_map
  • Rename replication_factor_t to replication_factor_index_t
  • Change _tablet_to_dc_replication_factor_map from std::vector to chunked_vector to avoid large memory allocations.
  • Remove unnecessary asserts in build_dc_replication_factor_map
  • Move build_dc_replication_factor_map and build_datacenter_map out of tablets ERM, and pass these containers as constructor parameters; make _datacenter_map and _tablet_to_dc_replication_factor_map in tablet_effective_replication_map immutable.
  • Fix some comments typos.
  • Add small_vector(size_t n, const T& value) to allow construction in place and make the small_vector API more consistent with std::vector
  • Change sanity_check_read_replicas to accept token parameter and use the new get_replication_factor functions in ERM.

@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/small_vector_test
🔹 topology_custom/test_change_replication_factor_1_to_0
✅ - Docker Test
✅ - dtest with tablets
✅ - dtest
✅ - dtest with topology changes
❌ - Unit Tests

Failed Tests (1/36333):

Build Details:

  • Duration: 8 hr 8 min
  • Builder: i-06849aeee3a7f0e22 (m5ad.8xlarge)

locator/tablets.cc Outdated Show resolved Hide resolved
@bitpathfinder
Copy link
Contributor Author

❌ - Unit Tests

Looks like this issue is use-after-free like in this issue #20772.

@bitpathfinder bitpathfinder force-pushed the bugfix/20625_fix_tablets_read_replica_set_calculation branch from bfadf49 to 0cc90d5 Compare November 11, 2024 11:07
@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/small_vector_test
🔹 topology_custom/test_change_replication_factor_1_to_0
✅ - Docker Test
❌ - dtest with topology changes
❌ - dtest
❌ - dtest with tablets
❌ - Unit Tests

Failed Tests (935/36791):

Build Details:

  • Duration: 8 hr 12 min
  • Builder: i-00041ad359a60777e (m5ad.8xlarge)

@bitpathfinder bitpathfinder added the status/blocked Describe in a comment what is pausing/blocking this issue label Dec 10, 2024
@bitpathfinder
Copy link
Contributor Author

Blocked on #21856

@bhalevy
Copy link
Member

bhalevy commented Dec 11, 2024

Based on #21289 (comment) removing the release blocker label as well.

* so CDC generation service takes sharded<db::sys_dist_ks> and must check local_is_initialized()
* every time it accesses it (because it may have been stopped already), then take local_shared()
* which will prevent sys_dist_ks from being destroyed while the service operates on it.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, this comment was wrong before this change, sys_disk_ks was stopped after CDC generation service.

Btw, how is setup_group0_if_exists() depending on cdc_generation_service? It's not used explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code in storage_service::topology_state_load(state_change_hint hint) requires it.

for (const auto& gen_id : _topology_state_machine._topology.committed_cdc_generations) {
        rtlogger.trace("topology_state_load: process committed cdc generation {}", gen_id);
>>        co_await _cdc_gens.local().handle_cdc_generation(gen_id);
        if (gen_id == _topology_state_machine._topology.committed_cdc_generations.back()) {
            co_await _sys_ks.local().update_cdc_generation_id(gen_id);
            rtlogger.debug("topology_state_load: the last committed CDC generation ID: {}", gen_id);
        }
    }

@bitpathfinder bitpathfinder force-pushed the bugfix/20625_fix_tablets_read_replica_set_calculation branch from 5d2cb3f to 1d9eb54 Compare December 23, 2024 15:05
@bitpathfinder
Copy link
Contributor Author

New version: rebase and fix merge conflicts.

@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/small_vector_test
🔹 topology_custom/test_change_replication_factor_1_to_0
✅ - Docker Test
✅ - dtest with topology changes
❌ - dtest
✅ - Offline-installer Artifact Tests
❌ - dtest with tablets
❌ - Unit Tests

Failed Tests (1054/37085):

Build Details:

  • Duration: 4 hr 3 min
  • Builder: spider7.cloudius-systems.com

Since the construction of the tablet_effective_replication_map will
use `maybe_yield()` to avoid reactor stalls `do_make_replication_map()`
needs to be converted into a coroutine.
As part of introducing a new method for calculating the replication
factor for Tablets, rename the existing `get_replication_factor` to
`get_schema_replication_factor`, since it returns the replication
factor from the replication strategy, whereas the new method will
calculate the replication factor based on the current state of the
Tablets.
1. Split the functionality of get_endpoints_for_reading into two functions:

- `get_endpoints_for_reading(const tablet_id tablet)` returns the current
   read replica host ID set from Tablets. This change is necessary as part
   of introducing new replication factor calculation functions.
- `get_endpoints_for_reading(const token& search_token)` uses the new function
   that inputs a Tablet id.

2. Move `get_endpoints_for_reading`, `get_pending_endpoints`, and `get_replicas_for_write`
   logic for finding replicas in the Tablet map to the `tablet_map` class.
   This change is needed to build the replication factor map outside of the ERM
   and improves encapsulation of the `tablet_map` logic.
Implement a constructor that initializes `small_vector` with `n` elements, each set to a specified value.
To ensure ERM consistency, the topology initialization has been
moved to occur before the creation of tablet-based ERMs during
the initialization of non-system keyspaces. Since topology is
initialized during the Group 0 snapshot loading process, the
Group0 initialization has been shifted to take place before
the `init_non_system_keyspaces`.
Add implementations for new get_replication_factor functions in the
effective replication map that for Tablets calculate the replication
factor based on actual replica counts from the Tablets map. This accounts for
potential inconsistencies between the schema and Tablets during migration.
`topology::add_node`

In add_node the trace log was incorrectly printing the added node, since
`nptr` was no more valid after move at line 185.
Replace `get_schema_replication_factor` with the new `get_replication_factor`,
which, for tablets, returns the actual number of replicas while accounting
for the transition state of Tablets.

Fixes scylladb#20625
@bitpathfinder bitpathfinder force-pushed the bugfix/20625_fix_tablets_read_replica_set_calculation branch from 93596b9 to eccd809 Compare January 3, 2025 14:10
@bitpathfinder
Copy link
Contributor Author

New version:

  • Rebase and fix conflicts.

@scylladb-promoter
Copy link
Contributor

🔴 CI State: ABORTED

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/small_vector_test
🔹 topology_custom/test_change_replication_factor_1_to_0
✅ - dtest with topology changes
✅ - dtest
❌ - dtest with tablets
❌ - Docker Test
✅ - Offline-installer Artifact Tests
❌ - Unit Tests

Failed Tests (1535/36346):

Build Details:

  • Duration: 12 hr
  • Builder: i-0dab6233f96d3b430 (m5ad.12xlarge)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tablets backport/6.1 should be backported to 6.1 backport/6.2 should be backported to 6.2 P2 High Priority status/blocked Describe in a comment what is pausing/blocking this issue
Projects
None yet