-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort by proximity: shuffle equal-distance replicas #21958
Sort by proximity: shuffle equal-distance replicas #21958
Conversation
🔴 CI State: FAILURE❌ - Build Build Details:
|
service/storage_proxy.cc
Outdated
if (topology.can_sort_by_proximity()) { | ||
topology.do_sort_by_proximity(my_id, ids); | ||
} else { | ||
// FIXME: before dynamic snitch is implement put local address (if present) at the beginning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// FIXME: before dynamic snitch is implement put local address (if present) at the beginning | |
// FIXME: before dynamic snitch is implemented put local address (if present) at the beginning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in v2 (2489444)
d1e35b0
to
2489444
Compare
In v2 (2489444):
|
using clock_type = std::chrono::high_resolution_clock; | ||
|
||
// Called in a seastar thread | ||
void test_sort_by_proximity(clock_type::duration duration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably we could leverage the facilities like PERF_TEST_F()
provided by seastar? for instance, see https://github.com/scylladb/scylladb/blob/master/test/perf/perf_big_decimal.cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In v3 (bc3acd5)
2489444
to
c86978a
Compare
c86978a
to
bc3acd5
Compare
In v3 (bc3acd5):
|
🔴 CI State: ABORTEDBuild Details:
|
return compare_endpoints(address, a1, a2) < 0; | ||
}); | ||
if (can_sort_by_proximity()) { | ||
do_sort_by_proximity(address, addresses); | ||
} | ||
} | ||
|
||
void topology::sort_by_proximity(locator::host_id address, host_id_vector_replica_set& addresses) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gleb-cloudius do we still need duplicates of these functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually it will be remove. You just queued a series that removes some of its users. May be last once. Need to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like after the series to move streaming/repair to host id will be merged the function can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still callers that use it with inet_address like: range_streamer::get_all_ranges_with_sources_for
, row_level_repair::sort_peer_nodes
, storage_service::get_new_source_ranges
.
Cc @asias
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bhalevy did you check in the next branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Now I see 5a849b0 (Merge "Move more subsystems to use host ids instead of ips" from Gleb)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rebase on top of it once it's promoted
@@ -643,6 +643,7 @@ def find_ninja(): | |||
'test/perf/perf_idl', | |||
'test/perf/perf_vint', | |||
'test/perf/perf_big_decimal', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort_by_proximity_topology.perf_sort_by_proximity 994665 996.540ns 0.900ns 995.639ns 998.420ns 0.000 0.000 13230.4 2982.9
13k instructions to sort by proximity?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check if that's normalized by iteration, as each call does 64 iterations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is normalized by the number of iterations in each run.
See https://github.com/scylladb/seastar/blob/master/tests/perf/perf_tests.cc#L332
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we need 1100 instructions to sort three items?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least you get 4.5IPC.
btw std::sort is likely terrible for such low counts. Bubble sort would likely be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(or insertion sort)
locator/topology.cc
Outdated
@@ -580,20 +580,31 @@ void topology::sort_by_proximity(locator::host_id address, host_id_vector_replic | |||
|
|||
template <typename T> | |||
void topology::do_sort_by_proximity(T address, utils::small_vector<T, 3>& addresses) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2.8k cycles is much better than 13k cycles, but still outrageous.
What are we measuring here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort_by_proximity of 15 nodes using a reference node that is one of them
locator/topology.cc
Outdated
host_infos.emplace_back(id, distance(address, loc, id, loc1)); | ||
} | ||
std::ranges::sort(host_infos, [&](const info& i1, const info& i2) { | ||
return i1.distance < i2.distance; | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::ranges::sort(host_infos, std::ranges::less(), std::mem_fn(&host_info::distance));
locator/topology.cc
Outdated
for (const auto& id : addresses) { | ||
const auto& loc1 = get_location(id); | ||
host_infos.emplace_back(id, distance(address, loc, id, loc1)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using std::range transformations and std::ranges::to would be faster since it can use the from_range_t constructor which avoids checks after every emplace_back (which the compiler may not be able to eliminate).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The distance calculations could be done during effective_replication_map preparation. In fact we could prepare all the shuffled variants and pick one randomly, for small replication factors. But out of scope for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using std::range transformations and std::ranges::to would be faster since it can use the from_range_t constructor which avoids checks after every emplace_back (which the compiler may not be able to eliminate).
good idea, will do
locator/topology.cc
Outdated
@@ -580,6 +581,9 @@ void topology::sort_by_proximity(locator::host_id address, host_id_vector_replic | |||
|
|||
template <typename T> | |||
void topology::do_sort_by_proximity(T address, utils::small_vector<T, 3>& addresses) const { | |||
static thread_local std::mt19937_64 random_engine(std::random_device{}()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need some repeatable seed thing here for tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can consider a simpler random engine like linear congruential (may or may not help).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The internet favors mt, so let's keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are others (xoshiro's) that are faster with smaller state size, I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need some repeatable seed thing here for tests?
I'm not sure it will make any difference, but we can have the random_engine live in topology
and expose a private method for the test to seed it.
locator/topology.cc
Outdated
} | ||
shuffler = std::rotr(shuffler, 1); | ||
} | ||
*it++ = host_infos[i-1].id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is executed n-1 times, in the previous code n times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, the last component must be copied too
@@ -11,6 +11,7 @@ | |||
#include <seastar/core/on_internal_error.hh> | |||
#include <seastar/util/lazy.hh> | |||
#include <utility> | |||
#include <bit> | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a huge discrepancy between the hit in instruction count and the hit in throughput.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What metrics do you consider having a huge discrepancy?
Before:
test iterations median mad min max allocs tasks inst cycles
sort_by_proximity_topology.perf_sort_by_proximity 994665 996.540ns 0.900ns 995.639ns 998.420ns 0.000 0.000 13230.4 2982.9
After:
sort_by_proximity_topology.perf_sort_by_proximity 4260690 236.851ns 0.523ns 236.327ns 241.161ns 1.000 0.000 2847.9 710.0
Instructions: 13230.4 / 2847.9 = 4.65
Cycles: 2982.9 / 710.0 = 4.20
Iterations: 4260690. / 994665 = 4.28
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what you mean. The last patch causes a much bigger hit in cycles than instructions.
test_compare_endpoints(topo, address, a1, a2); | ||
test_compare_endpoints(topo, address, a2, a1); | ||
|
||
test_compare_endpoints(topo, bogus_address, bogus_address, bogus_address); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you are here make the test to work on host ids not ips (I have have the patch to do that but it will conflict with your series). And also drop bogus_address checks since topology::get_location(host_id) does not work for non existing node. topology::get_location(inet_address) has a hack that make it work, but the test was always bogus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, can do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 CI State: FAILURE✅ - Build Failed Tests (2065/36725):
Build Details:
|
bc3acd5
to
0c0ce62
Compare
3ce6840
to
6d58ccd
Compare
|
🟢 CI State: SUCCESS✅ - Build Build Details:
|
auto it = std::ranges::find(ids, my_id); | ||
if (it != ids.end() && it != ids.begin()) { | ||
std::iter_swap(it, ids.begin()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to enhance SimpleSnitch to contain rudimentary location metadata (like: I'm close and everyone else is far) to save this extra test. But not in scope here.
Extract can_sort_by_proximity() out so it can be used later by storage_proxy, and introduce do_sort_by_proximity that sorts unconditionally. Signed-off-by: Benny Halevy <[email protected]>
benchmark sort_by_proximity Baseline results on my desktop for sorting 3 nodes: single run iterations: 0 single run duration: 1.000s number of runs: 5 number of cores: 1 random seed: 20241224 test iterations median mad min max allocs tasks inst cycles sort_by_proximity_topology.perf_sort_by_proximity 12808773 77.368ns 0.062ns 77.300ns 77.873ns 0.000 0.000 1194.2 231.6 Signed-off-by: Benny Halevy <[email protected]>
…ot sort by proximity topology::sort_by_proximity already sorts the local node address first, if present, so look it up only when using SimpleSnitch, where sort_by_proximity() is a no-op. Signed-off-by: Benny Halevy <[email protected]>
So we can use it for defining other small_vector deriving their internal capacity from another small_vector type. Signed-off-by: Benny Halevy <[email protected]>
And use a temporary vector to use the precalculated distances. A later patch will add some randomization to shuffle nodes at the same distance from the reference node. This improves the function performance by 50% for 3 replicas, from 77.4 ns to 39.2 ns, larger replica sets show greater improvement (over 4X for 15 nodes): Before: test iterations median mad min max allocs tasks inst cycles sort_by_proximity_topology.perf_sort_by_proximity 12808773 77.368ns 0.062ns 77.300ns 77.873ns 0.000 0.000 1194.2 231.6 After: sort_by_proximity_topology.perf_sort_by_proximity 25541973 39.225ns 0.114ns 38.966ns 39.339ns 0.000 0.000 588.5 116.6 Signed-off-by: Benny Halevy <[email protected]>
6d58ccd
to
0451d0c
Compare
To improve balancing when reading in 1 < CL < ALL This implementation has a moderate impact on the function performance in contrast to full std::shuffle of the vector before stable_sort:ing it (especially with large number of nodes to sort). Before: test iterations median mad min max allocs tasks inst cycles sort_by_proximity_topology.perf_sort_by_proximity 25541973 39.225ns 0.114ns 38.966ns 39.339ns 0.000 0.000 588.5 116.6 After: sort_by_proximity_topology.perf_sort_by_proximity 19689561 50.195ns 0.119ns 50.076ns 51.145ns 0.000 0.000 622.5 150.6 Signed-off-by: Benny Halevy <[email protected]>
0451d0c
to
d1490bb
Compare
In v6 (d1490bb):
|
🟢 CI State: SUCCESS✅ - Build Build Details:
|
This series re-implements locator::topology::sort_by_proximity
and adds some randomization to shuffle equal-distance replicas for improving load-balancing
when reading with 1 < consistency level < replication factor.
This change also adds a manual test for benchmarking sort_by_proximity,
as it's not exercised by the single-node perf-simple-query.
The benchmark shows performance improvement of over 20% (from about 71 ns to 56 ns
per call for 3 nodes vectors), mainly due to "calculate distance only once" which
pre-calculates the distance from the reference node for each replica once, rather than
each time to comparator is called by std::sort