Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort by proximity: shuffle equal-distance replicas #21958

Merged

Conversation

bhalevy
Copy link
Member

@bhalevy bhalevy commented Dec 17, 2024

This series re-implements locator::topology::sort_by_proximity
and adds some randomization to shuffle equal-distance replicas for improving load-balancing
when reading with 1 < consistency level < replication factor.

This change also adds a manual test for benchmarking sort_by_proximity,
as it's not exercised by the single-node perf-simple-query.

The benchmark shows performance improvement of over 20% (from about 71 ns to 56 ns
per call for 3 nodes vectors), mainly due to "calculate distance only once" which
pre-calculates the distance from the reference node for each replica once, rather than
each time to comparator is called by std::sort

  • Improvement. No backport needed

@bhalevy bhalevy added the backport/none Backport is not required label Dec 17, 2024
@bhalevy bhalevy requested review from tchaikov and denesb December 17, 2024 17:33
@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

❌ - Build

Build Details:

  • Duration: 47 min
  • Builder: i-003011695b0397d0b (m5d.12xlarge)

if (topology.can_sort_by_proximity()) {
topology.do_sort_by_proximity(my_id, ids);
} else {
// FIXME: before dynamic snitch is implement put local address (if present) at the beginning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// FIXME: before dynamic snitch is implement put local address (if present) at the beginning
// FIXME: before dynamic snitch is implemented put local address (if present) at the beginning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in v2 (2489444)

@bhalevy bhalevy force-pushed the randomize-sort-by-proximity branch 2 times, most recently from d1e35b0 to 2489444 Compare December 18, 2024 08:08
@bhalevy
Copy link
Member Author

bhalevy commented Dec 18, 2024

In v2 (2489444):

  • rebased
  • fixed grammar in comment as per Sort by proximity: shuffle equal-distance replicas #21958 (comment)
  • fixed build error of test/boost/network_topology_strategy_test by specializing the topology::distance template in locator/topology.cc
  • locator/topology: sort_by_proximity: calculate distance only once
    • cleaned up patch to reduce diff to the minimum

using clock_type = std::chrono::high_resolution_clock;

// Called in a seastar thread
void test_sort_by_proximity(clock_type::duration duration) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably we could leverage the facilities like PERF_TEST_F() provided by seastar? for instance, see https://github.com/scylladb/scylladb/blob/master/test/perf/perf_big_decimal.cc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, thanks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In v3 (bc3acd5)

@bhalevy bhalevy force-pushed the randomize-sort-by-proximity branch from 2489444 to c86978a Compare December 18, 2024 10:38
@bhalevy bhalevy force-pushed the randomize-sort-by-proximity branch from c86978a to bc3acd5 Compare December 18, 2024 10:58
@bhalevy
Copy link
Member Author

bhalevy commented Dec 18, 2024

In v3 (bc3acd5):

  • moved manual test to test/perf/perf_sort_by_proximity
  • updated git log messages

@scylladb-promoter
Copy link
Contributor

🔴 CI State: ABORTED

Build Details:

  • Duration: 14 min
  • Builder: i-01e8caf90d8ad6b4a (m5ad.8xlarge)

return compare_endpoints(address, a1, a2) < 0;
});
if (can_sort_by_proximity()) {
do_sort_by_proximity(address, addresses);
}
}

void topology::sort_by_proximity(locator::host_id address, host_id_vector_replica_set& addresses) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gleb-cloudius do we still need duplicates of these functions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually it will be remove. You just queued a series that removes some of its users. May be last once. Need to check.

Copy link
Contributor

@gleb-cloudius gleb-cloudius Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like after the series to move streaming/repair to host id will be merged the function can be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still callers that use it with inet_address like: range_streamer::get_all_ranges_with_sources_for, row_level_repair::sort_peer_nodes, storage_service::get_new_source_ranges.

Cc @asias

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bhalevy did you check in the next branch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Now I see 5a849b0 (Merge "Move more subsystems to use host ids instead of ips" from Gleb)

Copy link
Member Author

@bhalevy bhalevy Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rebase on top of it once it's promoted

@@ -643,6 +643,7 @@ def find_ninja():
'test/perf/perf_idl',
'test/perf/perf_vint',
'test/perf/perf_big_decimal',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort_by_proximity_topology.perf_sort_by_proximity 994665 996.540ns 0.900ns 995.639ns 998.420ns 0.000 0.000 13230.4 2982.9

13k instructions to sort by proximity?!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check if that's normalized by iteration, as each call does 64 iterations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is normalized by the number of iterations in each run.
See https://github.com/scylladb/seastar/blob/master/tests/perf/perf_tests.cc#L332

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we need 1100 instructions to sort three items?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least you get 4.5IPC.

btw std::sort is likely terrible for such low counts. Bubble sort would likely be better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or insertion sort)

@@ -580,20 +580,31 @@ void topology::sort_by_proximity(locator::host_id address, host_id_vector_replic

template <typename T>
void topology::do_sort_by_proximity(T address, utils::small_vector<T, 3>& addresses) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.8k cycles is much better than 13k cycles, but still outrageous.

What are we measuring here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort_by_proximity of 15 nodes using a reference node that is one of them

host_infos.emplace_back(id, distance(address, loc, id, loc1));
}
std::ranges::sort(host_infos, [&](const info& i1, const info& i2) {
return i1.distance < i2.distance;
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::ranges::sort(host_infos, std::ranges::less(), std::mem_fn(&host_info::distance));

for (const auto& id : addresses) {
const auto& loc1 = get_location(id);
host_infos.emplace_back(id, distance(address, loc, id, loc1));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using std::range transformations and std::ranges::to would be faster since it can use the from_range_t constructor which avoids checks after every emplace_back (which the compiler may not be able to eliminate).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distance calculations could be done during effective_replication_map preparation. In fact we could prepare all the shuffled variants and pick one randomly, for small replication factors. But out of scope for now.

Copy link
Member Author

@bhalevy bhalevy Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using std::range transformations and std::ranges::to would be faster since it can use the from_range_t constructor which avoids checks after every emplace_back (which the compiler may not be able to eliminate).

good idea, will do

@@ -580,6 +581,9 @@ void topology::sort_by_proximity(locator::host_id address, host_id_vector_replic

template <typename T>
void topology::do_sort_by_proximity(T address, utils::small_vector<T, 3>& addresses) const {
static thread_local std::mt19937_64 random_engine(std::random_device{}());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need some repeatable seed thing here for tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can consider a simpler random engine like linear congruential (may or may not help).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The internet favors mt, so let's keep it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are others (xoshiro's) that are faster with smaller state size, I believe.

Copy link
Member Author

@bhalevy bhalevy Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need some repeatable seed thing here for tests?

I'm not sure it will make any difference, but we can have the random_engine live in topology
and expose a private method for the test to seed it.

}
shuffler = std::rotr(shuffler, 1);
}
*it++ = host_infos[i-1].id;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is executed n-1 times, in the previous code n times.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the last component must be copied too

@@ -11,6 +11,7 @@
#include <seastar/core/on_internal_error.hh>
#include <seastar/util/lazy.hh>
#include <utility>
#include <bit>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a huge discrepancy between the hit in instruction count and the hit in throughput.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What metrics do you consider having a huge discrepancy?

Before:
test                                               iterations      median         mad         min         max      allocs       tasks        inst      cycles
sort_by_proximity_topology.perf_sort_by_proximity      994665   996.540ns     0.900ns   995.639ns   998.420ns       0.000       0.000     13230.4      2982.9

After:
sort_by_proximity_topology.perf_sort_by_proximity     4260690   236.851ns     0.523ns   236.327ns   241.161ns       1.000       0.000      2847.9       710.0

Instructions: 13230.4 / 2847.9 = 4.65
Cycles: 2982.9 / 710.0 = 4.20
Iterations: 4260690. / 994665 = 4.28

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you mean. The last patch causes a much bigger hit in cycles than instructions.

test_compare_endpoints(topo, address, a1, a2);
test_compare_endpoints(topo, address, a2, a1);

test_compare_endpoints(topo, bogus_address, bogus_address, bogus_address);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you are here make the test to work on host ids not ips (I have have the patch to do that but it will conflict with your series). And also drop bogus_address checks since topology::get_location(host_id) does not work for non existing node. topology::get_location(inet_address) has a hack that make it work, but the test was always bogus.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, can do

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scylladb-promoter
Copy link
Contributor

@bhalevy bhalevy force-pushed the randomize-sort-by-proximity branch from bc3acd5 to 0c0ce62 Compare December 19, 2024 12:51
@bhalevy bhalevy force-pushed the randomize-sort-by-proximity branch from 3ce6840 to 6d58ccd Compare December 21, 2024 08:01
@bhalevy bhalevy marked this pull request as ready for review December 21, 2024 08:02
@bhalevy
Copy link
Member Author

bhalevy commented Dec 21, 2024

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/network_topology_strategy_test
✅ - Docker Test
✅ - dtest with tablets
✅ - dtest with topology changes
✅ - dtest
✅ - Offline-installer Artifact Tests
✅ - Unit Tests

Build Details:

  • Duration: 4 hr 17 min
  • Builder: spider2.cloudius-systems.com

auto it = std::ranges::find(ids, my_id);
if (it != ids.end() && it != ids.begin()) {
std::iter_swap(it, ids.begin());
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to enhance SimpleSnitch to contain rudimentary location metadata (like: I'm close and everyone else is far) to save this extra test. But not in scope here.

locator/topology.cc Outdated Show resolved Hide resolved
Extract can_sort_by_proximity() out so it can be used
later by storage_proxy, and introduce do_sort_by_proximity
that sorts unconditionally.

Signed-off-by: Benny Halevy <[email protected]>
benchmark sort_by_proximity

Baseline results on my desktop for sorting 3 nodes:

single run iterations:    0
single run duration:      1.000s
number of runs:           5
number of cores:          1
random seed:              20241224

test                                               iterations      median         mad         min         max      allocs       tasks        inst      cycles
sort_by_proximity_topology.perf_sort_by_proximity    12808773    77.368ns     0.062ns    77.300ns    77.873ns       0.000       0.000      1194.2       231.6

Signed-off-by: Benny Halevy <[email protected]>
…ot sort by proximity

topology::sort_by_proximity already sorts the local node
address first, if present, so look it up only when
using SimpleSnitch, where sort_by_proximity() is a no-op.

Signed-off-by: Benny Halevy <[email protected]>
So we can use it for defining other small_vector
deriving their internal capacity from another small_vector
type.

Signed-off-by: Benny Halevy <[email protected]>
And use a temporary vector to use the precalculated distances.
A later patch will add some randomization to shuffle nodes
at the same distance from the reference node.

This improves the function performance by 50% for 3 replicas,
from 77.4 ns to 39.2 ns, larger replica sets show greater improvement
(over 4X for 15 nodes):

Before:
test                                               iterations      median         mad         min         max      allocs       tasks        inst      cycles
sort_by_proximity_topology.perf_sort_by_proximity    12808773    77.368ns     0.062ns    77.300ns    77.873ns       0.000       0.000      1194.2       231.6

After:
sort_by_proximity_topology.perf_sort_by_proximity    25541973    39.225ns     0.114ns    38.966ns    39.339ns       0.000       0.000       588.5       116.6

Signed-off-by: Benny Halevy <[email protected]>
To improve balancing when reading in 1 < CL < ALL

This implementation has a moderate impact on
the function performance in contrast to full
std::shuffle of the vector before stable_sort:ing it
(especially with large number of nodes to sort).

Before:
test                                               iterations      median         mad         min         max      allocs       tasks        inst      cycles
sort_by_proximity_topology.perf_sort_by_proximity    25541973    39.225ns     0.114ns    38.966ns    39.339ns       0.000       0.000       588.5       116.6

After:
sort_by_proximity_topology.perf_sort_by_proximity    19689561    50.195ns     0.119ns    50.076ns    51.145ns       0.000       0.000       622.5       150.6

Signed-off-by: Benny Halevy <[email protected]>
@bhalevy bhalevy force-pushed the randomize-sort-by-proximity branch from 0451d0c to d1490bb Compare December 24, 2024 11:00
@bhalevy
Copy link
Member Author

bhalevy commented Dec 24, 2024

In v6 (d1490bb):

  • rebased (and resolved conflicts)
  • test/perf: add perf_sort_by_proximity benchmark
    • allocate 3 nodes per rack
    • pre-calculate all possible replica sets with 1 replica per rack
    • measure sort of each replica set against each node
    • Note: there's no significant change in measured performance
  • locator/topology: sort_by_proximity: calculate distance only once
    • use std::ranges::to

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/network_topology_strategy_test
✅ - dtest with tablets
✅ - Docker Test
✅ - Offline-installer Artifact Tests
✅ - dtest
✅ - dtest with topology changes
✅ - Unit Tests

Build Details:

  • Duration: 4 hr 12 min
  • Builder: spider8.cloudius-systems.com

@scylladb-promoter scylladb-promoter merged commit f9c3ab0 into scylladb:master Dec 25, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/none Backport is not required promoted-to-master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants