-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Replication Factor calculation for Tablets effective replication map. #21289
base: master
Are you sure you want to change the base?
Fix Replication Factor calculation for Tablets effective replication map. #21289
Conversation
🟢 CI State: SUCCESS✅ - Build Build Details:
|
replica/database.cc
Outdated
@@ -921,7 +921,7 @@ future<> database::add_column_family(keyspace& ks, schema_ptr schema, column_fam | |||
auto&& rs = ks.get_replication_strategy(); | |||
locator::effective_replication_map_ptr erm; | |||
if (auto pt_rs = rs.maybe_as_per_table()) { | |||
erm = pt_rs->make_replication_map(schema->id(), _shared_token_metadata.get()); | |||
erm = co_await pt_rs->make_replication_map(schema->id(), _shared_token_metadata.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds preemption point between mark_synced() and adding the table. If a request sneaks in between, and sees the schema version as synced, it may fail unnecessarily due to missing table.
We should move mark_synced() later, which should have been done in 8842bd8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a recent bug that introduced a regression in this area due to adding preemption point:
#19833
and later fix:
#20064
@bitpathfinder I recommend studying this case to see whether we don't break some other isolation properties that the code relies on (in addition to what @tgrabiec said)
Generally not being super careful when adding yields has caused us a lot of pain in the past. One investigation where an innocent yield in gossiper mode caused a super tough concurrency bug took me a more than a month... and the fix was 3 LOC: a0b331b
You personally encountered a problem like this recently 612a141
So please do quadruple-check this code and area around it before adding a yield anywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After performing some benchmarks, I decided to remove the preemption point from the build_dc_replication_factor_map
since with the standard testing parameters of 2048 tablets per table, it takes a half of a microsecond).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out my previous benchmark test was incorrect as it was measuring tables with 32 Tablets. I have used perf-load-balancing
to test the actual time, and for 131072 Tablets the replication factor map building time would be about 9.5ms. So preemption point should be kept.
locator/tablets.cc
Outdated
absl::flat_hash_map<sstring, replication_factor_t> _datacenter_map; | ||
// The `_tablet_to_dc_replication_factor_map` stores replication factors | ||
// per data center for each Tablet. | ||
std::vector<replication_factor_list> _tablet_to_dc_replication_factor_map; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be chunked_vector to avoid large contiguous allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please elaborate why is it bad to have a large allocation? Do you mean our allocator's storage might be already so fragmented that there would not be such a big chunk of memory (~5 Mb for 100k tablets ) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the larger the allocation to more likely it is to cause premature OOM. In practice, we will try to reclaim contiguous space back from the LSA region (memtables/cache), which occupies higher addresses, but that may still require a lot of work to evict data and move segments around, and may cause reactor stalls. We maintain a reserve of free standard allocator space which is supposed to amortizes this cost, but if the request cannot be met due to fragmentation, it cannot make use of that reserve.
As a general rule, we avoid contiguous allocations larger than 128K (LSA segment size). 128K is a unit of memory reclamation from LSA to the standard allocator's pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the benchmark I see that the number of tablets per table is 2048. Is that a typical number for production or it's just a testing number? Also when I set it to 65k I get an error "Exiting on unhandled exception: std::bad_alloc".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgrabiec So I suppose the max Tablets per table will be 8192, since otherwise our own tests can't be run. In that case the memory allocated by the _tablet_to_dc_replication_factor_map
vector is 8192*8=65536 bytes (counting there are 4 data centers). Do you still think it makes sense to switch to a chunked vector (which allocates by default by 128k).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change to chunked vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the benchmark I see that the number of tablets per table is 2048. Is that a typical number for production or it's just a testing number? Also when I set it to 65k I get an error "Exiting on unhandled exception: std::bad_alloc".
Can you share the command line which fails?
We aim at 100k tablets per cluster.
Some tests may not be able to handle that in a single shard due to overhead of tablet replica. In a real cluster, that overhead would be spread among many shards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgrabiec I run the following cmd
./build/dev/scylla perf-tablets --tablets-per-table 65536
It fails with std::bad_alloc
. In fact everything higher than 8192 is failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgrabiec I run the following cmd
./build/dev/scylla perf-tablets --tablets-per-table 65536
It fails withstd::bad_alloc
. In fact everything higher than 8192 is failing.
The default number of tables is 100, so with --tablets-per-table 65536 you have 6.5 M tablets. That's beyond our target of 100k.
tablet_map takes 600 MB. It probably bad_allocs due to mutation object size when persisting. With -m8G it proceeds, but then it takes almost a minute to save:
INFO 2024-11-06 14:06:48,099 [shard 0:stat] testlog - Saved in 41656.312500 [ms]
INFO 2024-11-06 14:07:05,748 [shard 0:stat] testlog - Read in 17534.886719 [ms]
And we fail in read_tablet_mutations() due to:
ERROR 2024-11-06 14:07:08,778 [shard 0:main] seastar - Exiting on unhandled exception: std::runtime_error (Maximum amount of memory for building query results is exhausted, unpaged query cannot be finished)
That one is used in migration_manager.cc, in schema pull handler:
if (options->group0_snapshot_transfer) {
cm.emplace_back(co_await db::system_keyspace::get_group0_history(db));
if (proxy.local().local_db().get_config().enable_tablets()) {
for (auto&& m: co_await replica::read_tablet_mutations(db)) {
We shouldn't use that path anymore for snapshot transfer, right @kbr-scylla ?
We use that path for group0 snapshot transfer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #21456
locator/tablets.cc
Outdated
const sstring& datacenter = topo.get_datacenter(node.host); | ||
const auto it = _datacenter_map.find(datacenter); | ||
if (it == _datacenter_map.end()) { | ||
tablet_logger.debug("Could not find datacenter: {} for node: {} ", datacenter, node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a rather serious error because it will result in an inconsistency between replication factor and read replica set. So the problem the patch tries to fix. And this one if happens, may be permanent. Probably deserves on_internal_error().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment also applies to the if (node_ptr == nullptr) {
branch. If we can't find the node*
it means in particular that we can't find the DC for this replica, and we become inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But sometimes this is the case, due to a short time of inconsistency between the Tablets and the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this code is performing lookup in the locator::topology
data structure. This data structure is completely independent from any schema definitions. It's based purely on topology state. It should never be inconsistent with tablets.
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not investigate much, but I was getting these errors in the tests consistently, when ERM was created before the node was initialized.
locator/tablets.cc
Outdated
continue; | ||
} | ||
const sstring& datacenter = topo.get_datacenter(node.host); | ||
const auto it = _datacenter_map.find(datacenter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does erm construction time significantly increase because of this for a large tablet count (100k)?
If so, this could be optimized by keeping dc index in topology.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I suppose the time spent in the map will be proportional to the number of the data centers and their name length, especially if the names are longer than the SSO limit.
auto rs = db.find_keyspace(table->schema()->ks_name()).get_replication_strategy_ptr(); | ||
locator::effective_replication_map_ptr erm; | ||
if (auto pt_rs = rs->maybe_as_per_table()) { | ||
erm = pt_rs->make_replication_map(id, tmptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above -- please check carefully if this yield won't cause problems
locator/tablets.cc
Outdated
// For each tablet index in the tablet map, it retrieves the current replica set using `get_endpoints_for_reading` | ||
// and counts replicas per data center. Each data center's replication factor (RF) is saved under | ||
// its index according to `_datacenter_map`. | ||
seastar::future<> build_dc_replication_factor_map() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we usually omit seastar::
prefix for future
s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
locator/tablets.cc
Outdated
const sstring& datacenter = topo.get_datacenter(node.host); | ||
const auto it = _datacenter_map.find(datacenter); | ||
if (it == _datacenter_map.end()) { | ||
tablet_logger.debug("Could not find datacenter: {} for node: {} ", datacenter, node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment also applies to the if (node_ptr == nullptr) {
branch. If we can't find the node*
it means in particular that we can't find the DC for this replica, and we become inconsistent.
@@ -41,6 +41,11 @@ void everywhere_replication_strategy::validate_options(const gms::feature_servic | |||
} | |||
|
|||
sstring everywhere_replication_strategy::sanity_check_read_replicas(const effective_replication_map& erm, const inet_address_vector_replica_set& read_replicas) const { | |||
const auto& replication_strategy = erm.get_replication_strategy(); | |||
if (replication_strategy.uses_tablets()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're inside everywhere_replication_strategy
code, so erm.get_replication_strategy()
is this strategy, no? In which case this branch is dead code (only NTS can use tablets)
locator/tablets.cc
Outdated
return make_ready_future<effective_replication_map_ptr>( | ||
seastar::make_shared<tablet_effective_replication_map>(table, std::move(rs), std::move(tm), replication_factor)); | ||
auto erm = seastar::make_shared<tablet_effective_replication_map>(table, std::move(rs), std::move(tm), replication_factor); | ||
co_await erm->build_dc_replication_factor_map(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we detect that there is no tablets in transition and do not populate anything? If the array is not populated we return values according to RF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC this won't work because when applying mixed_change
or snapshot there is a short period of time where the schema is updated, but topology is not updated yet. So we construct e_r_m based on updated schema (e.g. RF reduced from 1 to 0) but old topology (tablets are still not in transition state). This state in practice holds for less than a second -- it is basically happening between these two co_awaits:
[&] (mixed_change& chng) -> future<> {
co_await _mm.merge_schema_from(netw::messaging_service::msg_addr(std::move(cmd.creator_addr)), std::move(chng.mutations));
co_await _ss.topology_transition();
but it's still happening and if we allowed it, the bug is still present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the goal is to reduce lookups and memory in the case of shared RF, we could compute the shared RF map from erm if we notice that all tablets have the same replica count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way we should probably fix mixed_change
so it reloads topology and schema in isolated fashion. (I believe @nuivall's work will address it)
c48a66b
to
bfadf49
Compare
New version:
|
🔴 CI State: FAILURE✅ - Build Failed Tests (1/36333):Build Details:
|
Looks like this issue is use-after-free like in this issue #20772. |
bfadf49
to
0cc90d5
Compare
🔴 CI State: FAILURE✅ - Build Failed Tests (935/36791):
Build Details:
|
Blocked on #21856 |
Based on #21289 (comment) removing the |
* so CDC generation service takes sharded<db::sys_dist_ks> and must check local_is_initialized() | ||
* every time it accesses it (because it may have been stopped already), then take local_shared() | ||
* which will prevent sys_dist_ks from being destroyed while the service operates on it. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, this comment was wrong before this change, sys_disk_ks was stopped after CDC generation service.
Btw, how is setup_group0_if_exists() depending on cdc_generation_service? It's not used explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code in storage_service::topology_state_load(state_change_hint hint)
requires it.
for (const auto& gen_id : _topology_state_machine._topology.committed_cdc_generations) {
rtlogger.trace("topology_state_load: process committed cdc generation {}", gen_id);
>> co_await _cdc_gens.local().handle_cdc_generation(gen_id);
if (gen_id == _topology_state_machine._topology.committed_cdc_generations.back()) {
co_await _sys_ks.local().update_cdc_generation_id(gen_id);
rtlogger.debug("topology_state_load: the last committed CDC generation ID: {}", gen_id);
}
}
5d2cb3f
to
1d9eb54
Compare
New version: rebase and fix merge conflicts. |
1d9eb54
to
93596b9
Compare
🔴 CI State: FAILURE✅ - Build Failed Tests (1054/37085):
Build Details:
|
Since the construction of the tablet_effective_replication_map will use `maybe_yield()` to avoid reactor stalls `do_make_replication_map()` needs to be converted into a coroutine.
As part of introducing a new method for calculating the replication factor for Tablets, rename the existing `get_replication_factor` to `get_schema_replication_factor`, since it returns the replication factor from the replication strategy, whereas the new method will calculate the replication factor based on the current state of the Tablets.
1. Split the functionality of get_endpoints_for_reading into two functions: - `get_endpoints_for_reading(const tablet_id tablet)` returns the current read replica host ID set from Tablets. This change is necessary as part of introducing new replication factor calculation functions. - `get_endpoints_for_reading(const token& search_token)` uses the new function that inputs a Tablet id. 2. Move `get_endpoints_for_reading`, `get_pending_endpoints`, and `get_replicas_for_write` logic for finding replicas in the Tablet map to the `tablet_map` class. This change is needed to build the replication factor map outside of the ERM and improves encapsulation of the `tablet_map` logic.
Implement a constructor that initializes `small_vector` with `n` elements, each set to a specified value.
To ensure ERM consistency, the topology initialization has been moved to occur before the creation of tablet-based ERMs during the initialization of non-system keyspaces. Since topology is initialized during the Group 0 snapshot loading process, the Group0 initialization has been shifted to take place before the `init_non_system_keyspaces`.
Add implementations for new get_replication_factor functions in the effective replication map that for Tablets calculate the replication factor based on actual replica counts from the Tablets map. This accounts for potential inconsistencies between the schema and Tablets during migration.
`topology::add_node` In add_node the trace log was incorrectly printing the added node, since `nptr` was no more valid after move at line 185.
Replace `get_schema_replication_factor` with the new `get_replication_factor`, which, for tablets, returns the actual number of replicas while accounting for the transition state of Tablets. Fixes scylladb#20625
93596b9
to
eccd809
Compare
New version:
|
🔴 CI State: ABORTED✅ - Build Failed Tests (1535/36346):
Build Details:
|
Current implementation of replication factor calculation does not take into account Tablets migration. Since a table can be represented by multiple Tablets, during Tablets migration there are different Tablets could be in different states, thus there cannot be a single state for the table and previously used logic of taking the RF from the schema as it worked for VNodes, is not valid when tablets are in a migration state.
In this PR, the
effective_replication_map::get_replication_factor()
implementation was modified for Tablets. Instead of retrieving the replication factor from the schema, it now calculates the replication factor based on the actual replica set from the Tablets map. This change addresses potential inconsistencies between the schema and the state of individual tablets during migration.Fixes #20625
Fixes #20282
Since this fixes an important issue with Tablets, we need backports to 6.0, 6.1, 6.2.