Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEW] Primary replica role at the slot level #1372

Open
zvi-code opened this issue Nov 28, 2024 · 5 comments
Open

[NEW] Primary replica role at the slot level #1372

zvi-code opened this issue Nov 28, 2024 · 5 comments
Labels
client-changes-needed Client changes are required for this feature

Comments

@zvi-code
Copy link
Contributor

The problem/use-case that the feature addresses

Currently, replicas provide availability, increased durability (from a domain failure perspective), and performance improvements when using Read from Replica (RFR). However, performance scaling is limited to stale reads and does not extend to regular writes\reads. Many customers do not use RFR for various reasons. Additionally, when a primary node fails, all write traffic to that node fails [potentially requiring application-level logic to handle the failure].

Description of the feature

We propose redefining role assignments from the node level to the slot level. In this model, a node can be the primary for certain slots and a replica for others. This involves adjusting the codebase so that any primary/replica designations are applied to slots rather than nodes. Essentially, the node becomes a logical container of compute, memory, and services that manages atomic data entities (slots).

With this approach, we can scale the performance of both writes and reads based on the number of nodes in a shard, eliminating the concept of a replica node. If a node fails, only the slots for which it was the primary are directly impacted, improving fault granularity\isolation.

The recent introduction of the dict-per-slot has shifted many processes to operate at the slot level, which facilitates the transition to this model. As part of this feature we will need to continue going down this path for other flows in the system, including bgsave, for example.

This change would require client support, but for clients that do not have the support we can initially implement the feature in a degenerated form where all slots in a shard have the same primary node, maintaining backward compatibility.

Additional information

An added benefit of this approach is the potential to reduce code complexity by unifying the code paths of replication and slot migration, which are currently two similar processes for maintaining data consistency between nodes.

For Cluster Mode Disabled (CMD), we can consider all data to reside in slot 0. In the long term, we might consider enabling slots (or logical grouping) for CMD, allowing customers to gain the benefits of this model without adopting Cluster Mode Enabled (CME).

@zvi-code
Copy link
Contributor Author

Even if we support the idea directionally. there are so many things to consider, opened this issue to hear initial feedback to the direction, see if there is community interest, or maybe it was discussed before and so on...

@madolson
Copy link
Member

madolson commented Dec 2, 2024

I've heard a lot of feedback that folks like the way hazel cast supports "homogenous clusters". The idea being that all nodes own some set of data and replicate it to others, which sounds very similar to what you're proposing. I overall really like this, and would dramatically simplify a lot of setups.

We did also recently introduce the concept of a "shard", which makes this flexible slot ownership more difficult. We could implement something like a shard can have multiple primaries over different parts of the data, either that or allow nodes to be part of multiple shards. It would also hopefully keep the replication stream pretty straightforward.

We also need per slot replication effectively for atomic slot migration.

@madolson madolson added this to Roadmap Dec 2, 2024
@madolson madolson moved this to Researching in Roadmap Dec 2, 2024
@zvi-code
Copy link
Contributor Author

zvi-code commented Dec 3, 2024

Interesting, didn't think of non-homogenous hardware explicitly, I was thinking of, "a mirror of the same problem" that is non-homogenous slots [in cpu\memory usage], essentially the same technical problem but with a different user/business problem[utilize non homogenous hw vs handle unbalanced slots].
IMO when going down this road we should choose the path that is incremental. Semantically I suggest to move the primary\replica role to the slot[or call it atomic-data-entity].
An example for this consideration is wether we have the concept of shard or not. So incremental change is for example, consider these 2 options: 1) No shard would mean we place copies of slot based on usage\availability and not based on any additional grouping assumption. 2) Have a shard idea and place copies of a slot within a shard with some flexibility.
So option 2 of keeping the shard concept with equal copies for each slot in the shard is the natural incremental step from current state.
I guess that this can serve well also the idea of sub-cluster, in the context of next get cluster design (cluster bus...), because we can have associate sub-cluster with set of slots and the sub-cluster can reconcile it's owned slots locations without the rest of the cluster - but this is not a well thought, by me, idea TBH

@murphyjacob4
Copy link
Contributor

murphyjacob4 commented Dec 3, 2024

+1 to this idea. I think this is the right long term vision for a feature like atomic slot migration. The main challenge for this will be that the code is very coupled to the idea of a node either being just a primary or just a replica, and this assumption is made in how some things are designed (I.e. we assume that serving traffic while full syncing isn't very important since it is just a replica).

Brainstorming some incremental improvements that we can make to reach this goal:

  1. The "replication link" concept will need to go from a "one-to-one" relationship with a node to a "many-to-one" relationship with a node. We would refactor existing replication logic to remove dependencies on server level global variables and instead make them operate on a per-link basis. I.e. we can have many SYNCs, PSYNCs, active replication streams ongoing at the same time and need to de-globalize the state of those
  2. Async loading needs improvements. If a node is going to be a primary and a replica at the same time, we need to support a true asynchronous load of RDB (for full sync) that doesn't drastically disrupt the primary traffic. Right now, async loading is more of an afterthought where we sometimes process events (at most once every 100ms IIUC and at most every 2 MiB of loaded data) and that won't work for a primary serving production workloads. I think delegating async loading to a background thread could work - so long as we are not serving traffic in the dictionary that is being loaded. I think there will be considerations to be had regarding modules here (modules probably won't support thread safety when getting an RDB load callback from a background thread). Probably it will come with some module compatibility flag to support background thread load, and if not supported module RDB loads will have to be enqueued back to the main thread.

Regarding appcompat - I think probably the best path forward is to add a CLIENT CAPA for "replication-at-slot-level" or something like that. If it isn't supported, we can sanitize the CLUSTER SLOTS output as follows:

  • If a node has any primaryship of any number of slots, it will be presented as primary of those slots
  • If a node has no primaryship of any slots, it will be displayed as a replica if and only if all slots are owned by only one node
  • If a node has no primaryshp of any slots, and it is replicating slots from >1 other nodes, we can present as a replica if we replicate all slots from at least one other node. I.e. we could present as replica of node A if it owns slots 1-10 and we replicate slots 1-10 and 100-110.
  • If a node has no primaryhsip of any slots, and it is not replicating 100% of the slots from at least one primary, it will have to either be presented as an empty primary or completely removed from the CLUSTER SLOTS output.

We can still have features like atomic slot migration at that point, but clients wouldn't be able to observe the intermediary state of "node B owns slot N and is a replica of slot M in node A" and instead would just see an immediate transition from "node B owns slot N and node A owns slot M" to "node B owns slot N and M and node A owns nothing".

@asafpamzn
Copy link

asafpamzn commented Dec 5, 2024

@madolson can you please add label Client support required. It will be easier for clients maintainers to track these changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client-changes-needed Client changes are required for this feature
Projects
Status: Researching
Development

No branches or pull requests

4 participants