Sort by proximity: shuffle equal-distance replicas #21958

bhalevy · 2024-12-17T17:33:45Z

This series re-implements locator::topology::sort_by_proximity
and adds some randomization to shuffle equal-distance replicas for improving load-balancing
when reading with 1 < consistency level < replication factor.

This change also adds a manual test for benchmarking sort_by_proximity,
as it's not exercised by the single-node perf-simple-query.

The benchmark shows performance improvement of over 20% (from about 71 ns to 56 ns
per call for 3 nodes vectors), mainly due to "calculate distance only once" which
pre-calculates the distance from the reference node for each replica once, rather than
each time to comparator is called by std::sort

Improvement. No backport needed

scylladb-promoter · 2024-12-17T18:21:53Z

🔴 CI State: FAILURE

❌ - Build

Build Details:

Duration: 47 min
Builder: i-003011695b0397d0b (m5d.12xlarge)

mykaul · 2024-12-18T07:45:07Z

service/storage_proxy.cc

+    if (topology.can_sort_by_proximity()) {
+        topology.do_sort_by_proximity(my_id, ids);
+    } else {
+        // FIXME: before dynamic snitch is implement put local address (if present) at the beginning


Suggested change

// FIXME: before dynamic snitch is implement put local address (if present) at the beginning

// FIXME: before dynamic snitch is implemented put local address (if present) at the beginning

Fixed in v2 (2489444)

bhalevy · 2024-12-18T08:10:57Z

In v2 (2489444):

rebased
fixed grammar in comment as per Sort by proximity: shuffle equal-distance replicas #21958 (comment)
fixed build error of test/boost/network_topology_strategy_test by specializing the topology::distance template in locator/topology.cc
locator/topology: sort_by_proximity: calculate distance only once
- cleaned up patch to reduce diff to the minimum

tchaikov · 2024-12-18T08:20:48Z

test/manual/sort_by_proximity_test.cc

+using clock_type = std::chrono::high_resolution_clock;
+
+// Called in a seastar thread
+void test_sort_by_proximity(clock_type::duration duration) {


probably we could leverage the facilities like PERF_TEST_F() provided by seastar? for instance, see https://github.com/scylladb/scylladb/blob/master/test/perf/perf_big_decimal.cc

good idea, thanks

In v3 (bc3acd5)

bhalevy · 2024-12-18T10:59:12Z

In v3 (bc3acd5):

moved manual test to test/perf/perf_sort_by_proximity
updated git log messages

scylladb-promoter · 2024-12-18T11:03:22Z

🔴 CI State: ABORTED

Build Details:

Duration: 14 min
Builder: i-01e8caf90d8ad6b4a (m5ad.8xlarge)

avikivity · 2024-12-18T12:47:46Z

locator/topology.cc

-            return compare_endpoints(address, a1, a2) < 0;
-        });
+    if (can_sort_by_proximity()) {
+        do_sort_by_proximity(address, addresses);
    }
 }

 void topology::sort_by_proximity(locator::host_id address, host_id_vector_replica_set& addresses) const {


@gleb-cloudius do we still need duplicates of these functions?

Eventually it will be remove. You just queued a series that removes some of its users. May be last once. Need to check.

It looks like after the series to move streaming/repair to host id will be merged the function can be removed.

There are still callers that use it with inet_address like: range_streamer::get_all_ranges_with_sources_for, row_level_repair::sort_peer_nodes, storage_service::get_new_source_ranges.

Cc @asias

@bhalevy did you check in the next branch?

No. Now I see 5a849b0 (Merge "Move more subsystems to use host ids instead of ips" from Gleb)

I'll rebase on top of it once it's promoted

avikivity · 2024-12-18T12:49:33Z

configure.py

@@ -643,6 +643,7 @@ def find_ninja():
    'test/perf/perf_idl',
    'test/perf/perf_vint',
    'test/perf/perf_big_decimal',


sort_by_proximity_topology.perf_sort_by_proximity 994665 996.540ns 0.900ns 995.639ns 998.420ns 0.000 0.000 13230.4 2982.9

13k instructions to sort by proximity?!

I'll check if that's normalized by iteration, as each call does 64 iterations.

It is normalized by the number of iterations in each run.
See https://github.com/scylladb/seastar/blob/master/tests/perf/perf_tests.cc#L332

So we need 1100 instructions to sort three items?

At least you get 4.5IPC.

btw std::sort is likely terrible for such low counts. Bubble sort would likely be better.

(or insertion sort)

avikivity · 2024-12-18T12:53:02Z

locator/topology.cc

@@ -580,20 +580,31 @@ void topology::sort_by_proximity(locator::host_id address, host_id_vector_replic

 template <typename T>
 void topology::do_sort_by_proximity(T address, utils::small_vector<T, 3>& addresses) const {


2.8k cycles is much better than 13k cycles, but still outrageous.

What are we measuring here?

sort_by_proximity of 15 nodes using a reference node that is one of them

avikivity · 2024-12-18T12:53:57Z

locator/topology.cc

+        host_infos.emplace_back(id, distance(address, loc, id, loc1));
+    }
+    std::ranges::sort(host_infos, [&](const info& i1, const info& i2) {
+        return i1.distance < i2.distance;
    });


std::ranges::sort(host_infos, std::ranges::less(), std::mem_fn(&host_info::distance));

avikivity · 2024-12-18T12:55:38Z

locator/topology.cc

+    for (const auto& id : addresses) {
+        const auto& loc1 = get_location(id);
+        host_infos.emplace_back(id, distance(address, loc, id, loc1));
+    }


Using std::range transformations and std::ranges::to would be faster since it can use the from_range_t constructor which avoids checks after every emplace_back (which the compiler may not be able to eliminate).

The distance calculations could be done during effective_replication_map preparation. In fact we could prepare all the shuffled variants and pick one randomly, for small replication factors. But out of scope for now.

Using std::range transformations and std::ranges::to would be faster since it can use the from_range_t constructor which avoids checks after every emplace_back (which the compiler may not be able to eliminate).

good idea, will do

avikivity · 2024-12-18T13:01:18Z

locator/topology.cc

@@ -580,6 +581,9 @@ void topology::sort_by_proximity(locator::host_id address, host_id_vector_replic

 template <typename T>
 void topology::do_sort_by_proximity(T address, utils::small_vector<T, 3>& addresses) const {
+    static thread_local std::mt19937_64 random_engine(std::random_device{}());


Do we need some repeatable seed thing here for tests?

Can consider a simpler random engine like linear congruential (may or may not help).

The internet favors mt, so let's keep it.

There are others (xoshiro's) that are faster with smaller state size, I believe.

Do we need some repeatable seed thing here for tests?

I'm not sure it will make any difference, but we can have the random_engine live in topology
and expose a private method for the test to seed it.

avikivity · 2024-12-18T13:05:57Z

locator/topology.cc

+            }
+            shuffler = std::rotr(shuffler, 1);
+        }
+        *it++ = host_infos[i-1].id;


This is executed n-1 times, in the previous code n times.

You're right, the last component must be copied too

avikivity · 2024-12-18T13:06:59Z

locator/topology.cc

@@ -11,6 +11,7 @@
 #include <seastar/core/on_internal_error.hh>
 #include <seastar/util/lazy.hh>
 #include <utility>
+#include <bit>



There's a huge discrepancy between the hit in instruction count and the hit in throughput.

What metrics do you consider having a huge discrepancy?

Before: test iterations median mad min max allocs tasks inst cycles sort_by_proximity_topology.perf_sort_by_proximity 994665 996.540ns 0.900ns 995.639ns 998.420ns 0.000 0.000 13230.4 2982.9 After: sort_by_proximity_topology.perf_sort_by_proximity 4260690 236.851ns 0.523ns 236.327ns 241.161ns 1.000 0.000 2847.9 710.0

Instructions: 13230.4 / 2847.9 = 4.65
Cycles: 2982.9 / 710.0 = 4.20
Iterations: 4260690. / 994665 = 4.28

Ah, I see what you mean. The last patch causes a much bigger hit in cycles than instructions.

gleb-cloudius · 2024-12-18T13:34:08Z

test/boost/network_topology_strategy_test.cc

+        test_compare_endpoints(topo, address, a1, a2);
+        test_compare_endpoints(topo, address, a2, a1);
+
+        test_compare_endpoints(topo, bogus_address, bogus_address, bogus_address);


While you are here make the test to work on host ids not ips (I have have the patch to do that but it will conflict with your series). And also drop bogus_address checks since topology::get_location(host_id) does not work for non existing node. topology::get_location(inet_address) has a hack that make it work, but the test was always bogus.

Yeah, can do

scylladb-promoter · 2024-12-18T18:20:02Z

🔴 CI State: FAILURE

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/network_topology_strategy_test
✅ - Docker Test
✅ - Offline-installer Artifact Tests
❌ - dtest
❌ - dtest with tablets
❌ - dtest with topology changes
❌ - Unit Tests

Failed Tests (2065/36725):

Build Details:

Duration: 7 hr 21 min
Builder: i-0c816aa108b9c683b (m5ad.8xlarge)

bhalevy · 2024-12-21T08:02:32Z

rebased
updated license
ready to re-review now that Retire topology::sort_by_proximity and compare_endpoints flavors using gms::inet_address #21976 is in

scylladb-promoter · 2024-12-21T12:20:27Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/network_topology_strategy_test
✅ - Docker Test
✅ - dtest with tablets
✅ - dtest with topology changes
✅ - dtest
✅ - Offline-installer Artifact Tests
✅ - Unit Tests

Build Details:

Duration: 4 hr 17 min
Builder: spider2.cloudius-systems.com

test/perf/perf_sort_by_proximity.cc

configure.py

avikivity · 2024-12-22T15:43:15Z

service/storage_proxy.cc

+        auto it = std::ranges::find(ids, my_id);
+        if (it != ids.end() && it != ids.begin()) {
+            std::iter_swap(it, ids.begin());
+        }
    }


It would be better to enhance SimpleSnitch to contain rudimentary location metadata (like: I'm close and everyone else is far) to save this extra test. But not in scope here.

locator/topology.cc

Extract can_sort_by_proximity() out so it can be used later by storage_proxy, and introduce do_sort_by_proximity that sorts unconditionally. Signed-off-by: Benny Halevy <[email protected]>

benchmark sort_by_proximity Baseline results on my desktop for sorting 3 nodes: single run iterations: 0 single run duration: 1.000s number of runs: 5 number of cores: 1 random seed: 20241224 test iterations median mad min max allocs tasks inst cycles sort_by_proximity_topology.perf_sort_by_proximity 12808773 77.368ns 0.062ns 77.300ns 77.873ns 0.000 0.000 1194.2 231.6 Signed-off-by: Benny Halevy <[email protected]>

…ot sort by proximity topology::sort_by_proximity already sorts the local node address first, if present, so look it up only when using SimpleSnitch, where sort_by_proximity() is a no-op. Signed-off-by: Benny Halevy <[email protected]>

So we can use it for defining other small_vector deriving their internal capacity from another small_vector type. Signed-off-by: Benny Halevy <[email protected]>

And use a temporary vector to use the precalculated distances. A later patch will add some randomization to shuffle nodes at the same distance from the reference node. This improves the function performance by 50% for 3 replicas, from 77.4 ns to 39.2 ns, larger replica sets show greater improvement (over 4X for 15 nodes): Before: test iterations median mad min max allocs tasks inst cycles sort_by_proximity_topology.perf_sort_by_proximity 12808773 77.368ns 0.062ns 77.300ns 77.873ns 0.000 0.000 1194.2 231.6 After: sort_by_proximity_topology.perf_sort_by_proximity 25541973 39.225ns 0.114ns 38.966ns 39.339ns 0.000 0.000 588.5 116.6 Signed-off-by: Benny Halevy <[email protected]>

To improve balancing when reading in 1 < CL < ALL This implementation has a moderate impact on the function performance in contrast to full std::shuffle of the vector before stable_sort:ing it (especially with large number of nodes to sort). Before: test iterations median mad min max allocs tasks inst cycles sort_by_proximity_topology.perf_sort_by_proximity 25541973 39.225ns 0.114ns 38.966ns 39.339ns 0.000 0.000 588.5 116.6 After: sort_by_proximity_topology.perf_sort_by_proximity 19689561 50.195ns 0.119ns 50.076ns 51.145ns 0.000 0.000 622.5 150.6 Signed-off-by: Benny Halevy <[email protected]>

bhalevy · 2024-12-24T11:01:11Z

In v6 (d1490bb):

rebased (and resolved conflicts)
test/perf: add perf_sort_by_proximity benchmark
- allocate 3 nodes per rack
- pre-calculate all possible replica sets with 1 replica per rack
- measure sort of each replica set against each node
- Note: there's no significant change in measured performance
locator/topology: sort_by_proximity: calculate distance only once
- use std::ranges::to

scylladb-promoter · 2024-12-24T15:20:25Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/network_topology_strategy_test
✅ - dtest with tablets
✅ - Docker Test
✅ - Offline-installer Artifact Tests
✅ - dtest
✅ - dtest with topology changes
✅ - Unit Tests

Build Details:

Duration: 4 hr 12 min
Builder: spider8.cloudius-systems.com

bhalevy added the backport/none Backport is not required label Dec 17, 2024

bhalevy requested review from tchaikov and denesb December 17, 2024 17:33

scylladbbot added status/ci in progress and removed status/ci in progress labels Dec 17, 2024

mykaul reviewed Dec 18, 2024

View reviewed changes

bhalevy force-pushed the randomize-sort-by-proximity branch 2 times, most recently from d1e35b0 to 2489444 Compare December 18, 2024 08:08

scylladbbot added the status/ci in progress label Dec 18, 2024

tchaikov reviewed Dec 18, 2024

View reviewed changes

bhalevy force-pushed the randomize-sort-by-proximity branch from 2489444 to c86978a Compare December 18, 2024 10:38

scylladbbot removed the status/ci in progress label Dec 18, 2024

bhalevy force-pushed the randomize-sort-by-proximity branch from c86978a to bc3acd5 Compare December 18, 2024 10:58

scylladbbot added the status/ci in progress label Dec 18, 2024

avikivity reviewed Dec 18, 2024

View reviewed changes

gleb-cloudius reviewed Dec 18, 2024

View reviewed changes

scylladbbot removed the status/ci in progress label Dec 18, 2024

bhalevy force-pushed the randomize-sort-by-proximity branch from bc3acd5 to 0c0ce62 Compare December 19, 2024 12:51

bhalevy force-pushed the randomize-sort-by-proximity branch from 3ce6840 to 6d58ccd Compare December 21, 2024 08:01

bhalevy marked this pull request as ready for review December 21, 2024 08:02

bhalevy requested a review from avikivity December 21, 2024 08:02

scylladbbot added status/ci in progress and removed status/ci in progress labels Dec 21, 2024