Fix Replication Factor calculation for Tablets effective replication map. #21289

bitpathfinder · 2024-10-25T16:57:25Z

Current implementation of replication factor calculation does not take into account Tablets migration. Since a table can be represented by multiple Tablets, during Tablets migration there are different Tablets could be in different states, thus there cannot be a single state for the table and previously used logic of taking the RF from the schema as it worked for VNodes, is not valid when tablets are in a migration state.

In this PR, the effective_replication_map::get_replication_factor() implementation was modified for Tablets. Instead of retrieving the replication factor from the schema, it now calculates the replication factor based on the actual replica set from the Tablets map. This change addresses potential inconsistencies between the schema and the state of individual tablets during migration.

Fixes #20625
Fixes #20282

Since this fixes an important issue with Tablets, we need backports to 6.0, 6.1, 6.2.

scylladb-promoter · 2024-10-26T06:47:12Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 topology_custom/test_change_replication_factor_1_to_0
✅ - Container Test
✅ - dtest with tablets
✅ - dtest
✅ - dtest with topology changes
✅ - Unit Tests

Build Details:

Duration: 7 hr 44 min
Builder: i-07177465d1a75ff45 (m5ad.8xlarge)

tgrabiec · 2024-10-26T10:56:39Z

replica/database.cc

@@ -921,7 +921,7 @@ future<> database::add_column_family(keyspace& ks, schema_ptr schema, column_fam
    auto&& rs = ks.get_replication_strategy();
    locator::effective_replication_map_ptr erm;
    if (auto pt_rs = rs.maybe_as_per_table()) {
-        erm = pt_rs->make_replication_map(schema->id(), _shared_token_metadata.get());
+        erm = co_await pt_rs->make_replication_map(schema->id(), _shared_token_metadata.get());


This adds preemption point between mark_synced() and adding the table. If a request sneaks in between, and sees the schema version as synced, it may fail unnecessarily due to missing table.

We should move mark_synced() later, which should have been done in 8842bd8

There was a recent bug that introduced a regression in this area due to adding preemption point:
#19833
and later fix:
#20064
@bitpathfinder I recommend studying this case to see whether we don't break some other isolation properties that the code relies on (in addition to what @tgrabiec said)

Generally not being super careful when adding yields has caused us a lot of pain in the past. One investigation where an innocent yield in gossiper mode caused a super tough concurrency bug took me a more than a month... and the fix was 3 LOC: a0b331b

You personally encountered a problem like this recently 612a141

So please do quadruple-check this code and area around it before adding a yield anywhere

After performing some benchmarks, I decided to remove the preemption point from the build_dc_replication_factor_map since with the standard testing parameters of 2048 tablets per table, it takes a half of a microsecond).

It turns out my previous benchmark test was incorrect as it was measuring tables with 32 Tablets. I have used perf-load-balancing to test the actual time, and for 131072 Tablets the replication factor map building time would be about 9.5ms. So preemption point should be kept.

locator/abstract_replication_strategy.cc

locator/tablets.cc

tgrabiec · 2024-10-26T11:11:08Z

locator/tablets.cc

+    absl::flat_hash_map<sstring, replication_factor_t> _datacenter_map;
+    // The `_tablet_to_dc_replication_factor_map` stores replication factors
+    // per data center for each Tablet.
+    std::vector<replication_factor_list> _tablet_to_dc_replication_factor_map;


Should be chunked_vector to avoid large contiguous allocations.

Can you please elaborate why is it bad to have a large allocation? Do you mean our allocator's storage might be already so fragmented that there would not be such a big chunk of memory (~5 Mb for 100k tablets ) ?

Yes, the larger the allocation to more likely it is to cause premature OOM. In practice, we will try to reclaim contiguous space back from the LSA region (memtables/cache), which occupies higher addresses, but that may still require a lot of work to evict data and move segments around, and may cause reactor stalls. We maintain a reserve of free standard allocator space which is supposed to amortizes this cost, but if the request cannot be met due to fragmentation, it cannot make use of that reserve.

As a general rule, we avoid contiguous allocations larger than 128K (LSA segment size). 128K is a unit of memory reclamation from LSA to the standard allocator's pool.

In the benchmark I see that the number of tablets per table is 2048. Is that a typical number for production or it's just a testing number? Also when I set it to 65k I get an error "Exiting on unhandled exception: std::bad_alloc".

@tgrabiec So I suppose the max Tablets per table will be 8192, since otherwise our own tests can't be run. In that case the memory allocated by the _tablet_to_dc_replication_factor_map vector is 8192*8=65536 bytes (counting there are 4 data centers). Do you still think it makes sense to switch to a chunked vector (which allocates by default by 128k).

Will change to chunked vector.

In the benchmark I see that the number of tablets per table is 2048. Is that a typical number for production or it's just a testing number? Also when I set it to 65k I get an error "Exiting on unhandled exception: std::bad_alloc".

Can you share the command line which fails?

We aim at 100k tablets per cluster.

Some tests may not be able to handle that in a single shard due to overhead of tablet replica. In a real cluster, that overhead would be spread among many shards.

@tgrabiec I run the following cmd
./build/dev/scylla perf-tablets --tablets-per-table 65536
It fails with std::bad_alloc. In fact everything higher than 8192 is failing.

@tgrabiec I run the following cmd ./build/dev/scylla perf-tablets --tablets-per-table 65536 It fails with std::bad_alloc. In fact everything higher than 8192 is failing.

The default number of tables is 100, so with --tablets-per-table 65536 you have 6.5 M tablets. That's beyond our target of 100k.

tablet_map takes 600 MB. It probably bad_allocs due to mutation object size when persisting. With -m8G it proceeds, but then it takes almost a minute to save:

INFO 2024-11-06 14:06:48,099 [shard 0:stat] testlog - Saved in 41656.312500 [ms] INFO 2024-11-06 14:07:05,748 [shard 0:stat] testlog - Read in 17534.886719 [ms]

And we fail in read_tablet_mutations() due to:

ERROR 2024-11-06 14:07:08,778 [shard 0:main] seastar - Exiting on unhandled exception: std::runtime_error (Maximum amount of memory for building query results is exhausted, unpaged query cannot be finished)

That one is used in migration_manager.cc, in schema pull handler:

if (options->group0_snapshot_transfer) { cm.emplace_back(co_await db::system_keyspace::get_group0_history(db)); if (proxy.local().local_db().get_config().enable_tablets()) { for (auto&& m: co_await replica::read_tablet_mutations(db)) {

~~We shouldn't use that path anymore for snapshot transfer, right @kbr-scylla ?~~

We use that path for group0 snapshot transfer.

I opened #21456

tgrabiec · 2024-10-26T11:17:52Z

locator/tablets.cc

+                const sstring& datacenter = topo.get_datacenter(node.host);
+                const auto it = _datacenter_map.find(datacenter);
+                if (it == _datacenter_map.end()) {
+                    tablet_logger.debug("Could not find datacenter: {} for node: {} ", datacenter, node);


That's a rather serious error because it will result in an inconsistency between replication factor and read replica set. So the problem the patch tries to fix. And this one if happens, may be permanent. Probably deserves on_internal_error().

I think this comment also applies to the if (node_ptr == nullptr) { branch. If we can't find the node* it means in particular that we can't find the DC for this replica, and we become inconsistent.

But sometimes this is the case, due to a short time of inconsistency between the Tablets and the schema.

But this code is performing lookup in the locator::topology data structure. This data structure is completely independent from any schema definitions. It's based purely on topology state. It should never be inconsistent with tablets.

Am I missing something?

I did not investigate much, but I was getting these errors in the tests consistently, when ERM was created before the node was initialized.

tgrabiec · 2024-10-26T11:20:45Z

locator/tablets.cc

+                    continue;
+                }
+                const sstring& datacenter = topo.get_datacenter(node.host);
+                const auto it = _datacenter_map.find(datacenter);


Does erm construction time significantly increase because of this for a large tablet count (100k)?

If so, this could be optimized by keeping dc index in topology.

Good idea. I suppose the time spent in the map will be proportional to the number of the data centers and their name length, especially if the names are longer than the SSO limit.

locator/tablets.cc

kbr-scylla · 2024-10-29T12:46:36Z

service/storage_service.cc

                auto rs = db.find_keyspace(table->schema()->ks_name()).get_replication_strategy_ptr();
                locator::effective_replication_map_ptr erm;
                if (auto pt_rs = rs->maybe_as_per_table()) {
-                    erm = pt_rs->make_replication_map(id, tmptr);


Same as above -- please check carefully if this yield won't cause problems

locator/abstract_replication_strategy.hh

locator/tablets.cc

kbr-scylla · 2024-10-29T13:48:06Z

locator/tablets.cc

+    // For each tablet index in the tablet map, it retrieves the current replica set using `get_endpoints_for_reading` 
+    // and counts replicas per data center.  Each data center's replication factor (RF) is saved under 
+    // its index according to `_datacenter_map`.
+    seastar::future<> build_dc_replication_factor_map() {


nit: we usually omit seastar:: prefix for futures

locator/tablets.cc

kbr-scylla · 2024-10-29T13:54:31Z

locator/tablets.cc

+                const sstring& datacenter = topo.get_datacenter(node.host);
+                const auto it = _datacenter_map.find(datacenter);
+                if (it == _datacenter_map.end()) {
+                    tablet_logger.debug("Could not find datacenter: {} for node: {} ", datacenter, node);


I think this comment also applies to the if (node_ptr == nullptr) { branch. If we can't find the node* it means in particular that we can't find the DC for this replica, and we become inconsistent.

locator/tablets.cc

kbr-scylla · 2024-10-29T14:00:05Z

locator/everywhere_replication_strategy.cc

@@ -41,6 +41,11 @@ void everywhere_replication_strategy::validate_options(const gms::feature_servic
 }

 sstring everywhere_replication_strategy::sanity_check_read_replicas(const effective_replication_map& erm, const inet_address_vector_replica_set& read_replicas) const {
+    const auto& replication_strategy = erm.get_replication_strategy();
+    if (replication_strategy.uses_tablets()) {


We're inside everywhere_replication_strategy code, so erm.get_replication_strategy() is this strategy, no? In which case this branch is dead code (only NTS can use tablets)

locator/network_topology_strategy.cc

gleb-cloudius · 2024-10-29T15:56:50Z

locator/tablets.cc

-    return make_ready_future<effective_replication_map_ptr>(
-            seastar::make_shared<tablet_effective_replication_map>(table, std::move(rs), std::move(tm), replication_factor));
+    auto erm = seastar::make_shared<tablet_effective_replication_map>(table, std::move(rs), std::move(tm), replication_factor);
+    co_await erm->build_dc_replication_factor_map();


Can we detect that there is no tablets in transition and do not populate anything? If the array is not populated we return values according to RF.

IIRC this won't work because when applying mixed_change or snapshot there is a short period of time where the schema is updated, but topology is not updated yet. So we construct e_r_m based on updated schema (e.g. RF reduced from 1 to 0) but old topology (tablets are still not in transition state). This state in practice holds for less than a second -- it is basically happening between these two co_awaits:

[&] (mixed_change& chng) -> future<> { co_await _mm.merge_schema_from(netw::messaging_service::msg_addr(std::move(cmd.creator_addr)), std::move(chng.mutations)); co_await _ss.topology_transition();

but it's still happening and if we allowed it, the bug is still present.

If the goal is to reduce lookups and memory in the case of shared RF, we could compute the shared RF map from erm if we notice that all tablets have the same replica count.

Either way we should probably fix mixed_change so it reloads topology and schema in isolated fashion. (I believe @nuivall's work will address it)

bitpathfinder · 2024-11-08T10:54:14Z

New version:

Address possible issues with added preemption point in make_replication_map
Rename replication_factor_t to replication_factor_index_t
Change _tablet_to_dc_replication_factor_map from std::vector to chunked_vector to avoid large memory allocations.
Remove unnecessary asserts in build_dc_replication_factor_map
Move build_dc_replication_factor_map and build_datacenter_map out of tablets ERM, and pass these containers as constructor parameters; make _datacenter_map and _tablet_to_dc_replication_factor_map in tablet_effective_replication_map immutable.
Fix some comments typos.
Add small_vector(size_t n, const T& value) to allow construction in place and make the small_vector API more consistent with std::vector
Change sanity_check_read_replicas to accept token parameter and use the new get_replication_factor functions in ERM.

scylladb-promoter · 2024-11-08T18:51:26Z

🔴 CI State: FAILURE

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/small_vector_test
🔹 topology_custom/test_change_replication_factor_1_to_0
✅ - Docker Test
✅ - dtest with tablets
✅ - dtest
✅ - dtest with topology changes
❌ - Unit Tests

Failed Tests (1/36333):

[empty] 🔍

Build Details:

Duration: 8 hr 8 min
Builder: i-06849aeee3a7f0e22 (m5ad.8xlarge)

locator/tablets.cc

bitpathfinder · 2024-11-11T10:42:35Z

❌ - Unit Tests

Looks like this issue is use-after-free like in this issue #20772.

scylladb-promoter · 2024-12-09T23:59:44Z

🔴 CI State: FAILURE

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/small_vector_test
🔹 topology_custom/test_change_replication_factor_1_to_0
✅ - Docker Test
❌ - dtest with topology changes
❌ - dtest
❌ - dtest with tablets
❌ - Unit Tests

Failed Tests (935/36791):

Build Details:

Duration: 8 hr 12 min
Builder: i-00041ad359a60777e (m5ad.8xlarge)

bitpathfinder · 2024-12-10T08:10:11Z

Blocked on #21856

bhalevy · 2024-12-11T15:41:55Z

Based on #21289 (comment) removing the release blocker label as well.

tgrabiec · 2024-12-17T09:25:10Z

main.cc

+             * so CDC generation service takes sharded<db::sys_dist_ks> and must check local_is_initialized()
+             * every time it accesses it (because it may have been stopped already), then take local_shared()
+             * which will prevent sys_dist_ks from being destroyed while the service operates on it.
+             */


Interestingly, this comment was wrong before this change, sys_disk_ks was stopped after CDC generation service.

Btw, how is setup_group0_if_exists() depending on cdc_generation_service? It's not used explicitly.

This code in storage_service::topology_state_load(state_change_hint hint) requires it.

for (const auto& gen_id : _topology_state_machine._topology.committed_cdc_generations) { rtlogger.trace("topology_state_load: process committed cdc generation {}", gen_id); >> co_await _cdc_gens.local().handle_cdc_generation(gen_id); if (gen_id == _topology_state_machine._topology.committed_cdc_generations.back()) { co_await _sys_ks.local().update_cdc_generation_id(gen_id); rtlogger.debug("topology_state_load: the last committed CDC generation ID: {}", gen_id); } }

bitpathfinder · 2024-12-23T15:06:49Z

New version: rebase and fix merge conflicts.

scylladb-promoter · 2024-12-23T19:28:11Z

🔴 CI State: FAILURE

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/small_vector_test
🔹 topology_custom/test_change_replication_factor_1_to_0
✅ - Docker Test
✅ - dtest with topology changes
❌ - dtest
✅ - Offline-installer Artifact Tests
❌ - dtest with tablets
❌ - Unit Tests

Failed Tests (1054/37085):

Build Details:

Duration: 4 hr 3 min
Builder: spider7.cloudius-systems.com

Since the construction of the tablet_effective_replication_map will use `maybe_yield()` to avoid reactor stalls `do_make_replication_map()` needs to be converted into a coroutine.

As part of introducing a new method for calculating the replication factor for Tablets, rename the existing `get_replication_factor` to `get_schema_replication_factor`, since it returns the replication factor from the replication strategy, whereas the new method will calculate the replication factor based on the current state of the Tablets.

1. Split the functionality of get_endpoints_for_reading into two functions: - `get_endpoints_for_reading(const tablet_id tablet)` returns the current read replica host ID set from Tablets. This change is necessary as part of introducing new replication factor calculation functions. - `get_endpoints_for_reading(const token& search_token)` uses the new function that inputs a Tablet id. 2. Move `get_endpoints_for_reading`, `get_pending_endpoints`, and `get_replicas_for_write` logic for finding replicas in the Tablet map to the `tablet_map` class. This change is needed to build the replication factor map outside of the ERM and improves encapsulation of the `tablet_map` logic.

Implement a constructor that initializes `small_vector` with `n` elements, each set to a specified value.

To ensure ERM consistency, the topology initialization has been moved to occur before the creation of tablet-based ERMs during the initialization of non-system keyspaces. Since topology is initialized during the Group 0 snapshot loading process, the Group0 initialization has been shifted to take place before the `init_non_system_keyspaces`.

Add implementations for new get_replication_factor functions in the effective replication map that for Tablets calculate the replication factor based on actual replica counts from the Tablets map. This accounts for potential inconsistencies between the schema and Tablets during migration.

`topology::add_node` In add_node the trace log was incorrectly printing the added node, since `nptr` was no more valid after move at line 185.

Replace `get_schema_replication_factor` with the new `get_replication_factor`, which, for tablets, returns the actual number of replicas while accounting for the transition state of Tablets. Fixes scylladb#20625

bitpathfinder · 2025-01-03T14:10:53Z

New version:

Rebase and fix conflicts.

scylladb-promoter · 2025-01-04T04:26:22Z

🔴 CI State: ABORTED

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/small_vector_test
🔹 topology_custom/test_change_replication_factor_1_to_0
✅ - dtest with topology changes
✅ - dtest
❌ - dtest with tablets
❌ - Docker Test
✅ - Offline-installer Artifact Tests
❌ - Unit Tests

Failed Tests (1535/36346):

Build Details:

Duration: 12 hr
Builder: i-0dab6233f96d3b430 (m5ad.12xlarge)

bitpathfinder requested review from tgrabiec, nyh and piodul as code owners October 25, 2024 16:57

bitpathfinder self-assigned this Oct 25, 2024

bitpathfinder requested review from kbr-scylla, emdroid and gleb-cloudius October 25, 2024 16:58

scylladbbot added the status/ci in progress label Oct 25, 2024

bitpathfinder added backport/6.0 backport/6.1 should be backported to 6.1 backport/6.2 should be backported to 6.2 labels Oct 25, 2024

scylladbbot removed the status/ci in progress label Oct 26, 2024

tgrabiec requested changes Oct 26, 2024

View reviewed changes

kbr-scylla mentioned this pull request Oct 29, 2024

(tablets only) All DC nodes got Coredump and error of: "failed to log message: fmt='Requested location for node {} not in topology" on altering keyspace replication-factor of one data-center (out of 2) #20282

Open

2 tasks

kbr-scylla reviewed Oct 29, 2024

View reviewed changes

kbr-scylla requested changes Oct 29, 2024

View reviewed changes

gleb-cloudius reviewed Oct 29, 2024

View reviewed changes

dani-tweig removed the backport/6.0 label Nov 5, 2024

bitpathfinder force-pushed the bugfix/20625_fix_tablets_read_replica_set_calculation branch from c48a66b to bfadf49 Compare November 8, 2024 10:41

scylladbbot added the status/ci in progress label Nov 8, 2024

bitpathfinder requested review from kbr-scylla, tgrabiec and gleb-cloudius November 8, 2024 10:54

scylladbbot removed the status/ci in progress label Nov 8, 2024

emdroid reviewed Nov 11, 2024

View reviewed changes

locator/tablets.cc Outdated Show resolved Hide resolved

bitpathfinder force-pushed the bugfix/20625_fix_tablets_read_replica_set_calculation branch from bfadf49 to 0cc90d5 Compare November 11, 2024 11:07

bitpathfinder added the status/blocked Describe in a comment what is pausing/blocking this issue label Dec 10, 2024

bhalevy removed the status/release blocker Preventing from a release to be promoted label Dec 11, 2024

tgrabiec mentioned this pull request Dec 17, 2024

storage_service: replicate_to_all_cores: prevent stalls when preparing per-table erms #21924

Closed

tgrabiec reviewed Dec 17, 2024

View reviewed changes

bitpathfinder force-pushed the bugfix/20625_fix_tablets_read_replica_set_calculation branch from 5d2cb3f to 1d9eb54 Compare December 23, 2024 15:05

scylladbbot added the status/ci in progress label Dec 23, 2024

bitpathfinder force-pushed the bugfix/20625_fix_tablets_read_replica_set_calculation branch from 1d9eb54 to 93596b9 Compare December 23, 2024 15:24

scylladbbot added status/ci in progress and removed status/ci in progress labels Dec 23, 2024

bitpathfinder added 10 commits January 3, 2025 14:41

Fix error message 'Invalid replica selector' format.

002c042

storage_proxy: Coroutinize do_make_replication_map.

7805e4b

Since the construction of the tablet_effective_replication_map will use `maybe_yield()` to avoid reactor stalls `do_make_replication_map()` needs to be converted into a coroutine.

Add fill constructor to small_vector.

c418cc0

Implement a constructor that initializes `small_vector` with `n` elements, each set to a specified value.

Fix logging: node ptr nptr is printed after move in

b6eb5de

`topology::add_node` In add_node the trace log was incorrectly printing the added node, since `nptr` was no more valid after move at line 185.

Modify storage_proxy to use the tablet-aware get_replication_factor.

02cb9e0

Replace `get_schema_replication_factor` with the new `get_replication_factor`, which, for tablets, returns the actual number of replicas while accounting for the transition state of Tablets. Fixes scylladb#20625

Enable test_change_replication_factor_1_to_0 test for Tablets.

eccd809

bitpathfinder force-pushed the bugfix/20625_fix_tablets_read_replica_set_calculation branch from 93596b9 to eccd809 Compare January 3, 2025 14:10

scylladbbot added status/ci in progress and removed status/ci in progress labels Jan 3, 2025

Fix Replication Factor calculation for Tablets effective replication map. #21289

Are you sure you want to change the base?

Fix Replication Factor calculation for Tablets effective replication map. #21289

Conversation

bitpathfinder commented Oct 25, 2024 • edited Loading

scylladb-promoter commented Oct 26, 2024

🟢 CI State: SUCCESS

Build Details:

Choose a reason for hiding this comment

kbr-scylla Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgrabiec Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgrabiec Oct 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bitpathfinder commented Nov 8, 2024

scylladb-promoter commented Nov 8, 2024

🔴 CI State: FAILURE

Failed Tests (1/36333):

Build Details:

bitpathfinder commented Nov 11, 2024

scylladb-promoter commented Dec 9, 2024

🔴 CI State: FAILURE

Failed Tests (935/36791):

Build Details:

bitpathfinder commented Dec 10, 2024

bhalevy commented Dec 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bitpathfinder commented Dec 23, 2024

scylladb-promoter commented Dec 23, 2024

🔴 CI State: FAILURE

Failed Tests (1054/37085):

Build Details:

bitpathfinder commented Jan 3, 2025

scylladb-promoter commented Jan 4, 2025

🔴 CI State: ABORTED

Failed Tests (1535/36346):

Build Details:

bitpathfinder commented Oct 25, 2024 •

edited

Loading

kbr-scylla Oct 29, 2024 •

edited

Loading

tgrabiec Nov 6, 2024 •

edited

Loading

tgrabiec Oct 26, 2024 •

edited

Loading