Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEW] Rethink and Redesign the MEET Protocol to Address Increasing Brittleness and Complexity #1471

Open
PingXie opened this issue Dec 22, 2024 · 5 comments

Comments

@PingXie
Copy link
Member

PingXie commented Dec 22, 2024

          This code flow has become too brittle and we are always in catchup mode. I think we should step back and reconsider the `MEET` flow. 

@madolson / @PingXie / @enjoy-binbin

Originally posted by @hpatro in #1436 (review)

@madolson
Copy link
Member

madolson commented Dec 24, 2024

I'm not sure what you're suggesting. If we move to a single source of truth, then we would have a different meet system and would get consensus through the other library we pick.

The current meet approach just didn't handle partial timeouts, which is part of the price for pay for building a bespoke consensus algorithm.

@PingXie
Copy link
Member Author

PingXie commented Dec 24, 2024

If we move to a single source of truth,

Are you talking about cluster v2? If so, I don't think we should tie the improvement of existing cluster to the timeline of the v2 project. I think that we at least need to document the current meet behavior and then see if we can reason about future patches more easily. "Redesign" though does sound like the wrong word. Let me update the title to clarify the issue better.

@madolson
Copy link
Member

Are you talking about cluster v2? If so, I don't think we should tie the improvement of existing cluster to the timeline of the v2 project.

Yeah.

I think that we at least need to document the current meet behavior and then see if we can reason about future patches more easily. "Redesign" though does sound like the wrong word. Let me update the title to clarify the issue better.

I'm always here for better documentation. The more I think about cluster v2 the more I think we should really just take a fresh pass at the current cluster code and clean it up and try to simplify it. There are many undocumented assumptions, confusing code paths, and the algorithms that were used are brittle in failure modes. Some fuzzy/chaos testing would also might be a good idea.

@enjoy-binbin
Copy link
Member

What I want to add to MEET is to add some auth meet / auth handshake. We often suffer from some cluster merging problems, so in internal we had meet auth and handshake auth to control the CLUSTER MEET and the clusterAddNode(in the gossip one). Our current code relies too much on CLUSTER MEET / clusterAddNode. If the administrator enters the wrong IP/port, or there are some dead nodes in the cluster, the cluster will be in chaos when the IP/port is reused.

@madolson
Copy link
Member

What I want to add to MEET is to add some auth meet / auth handshake. We often suffer from some cluster merging problems, so in internal we had meet auth and handshake auth to control the CLUSTER MEET and the clusterAddNode(in the gossip one). Our current code relies too much on CLUSTER MEET / clusterAddNode. If the administrator enters the wrong IP/port, or there are some dead nodes in the cluster, the cluster will be in chaos when the IP/port is reused.

In AWS talked about having a config to isolate a node, so that it would just silently drop all meet and pong packets and wouldn't try to communicate with a cluster. Having a unique cluster secret that is shared between all the nodes in a cluster also makes a lot of sense. We could add that with a ping extension I think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants