[NEW] Rethink and Redesign the MEET Protocol to Address Increasing Brittleness and Complexity #1471

PingXie · 2024-12-22T09:34:41Z

          This code flow has become too brittle and we are always in catchup mode. I think we should step back and reconsider the `MEET` flow.

@madolson / @PingXie / @enjoy-binbin

Originally posted by @hpatro in #1436 (review)

The text was updated successfully, but these errors were encountered:

madolson · 2024-12-24T03:58:09Z

I'm not sure what you're suggesting. If we move to a single source of truth, then we would have a different meet system and would get consensus through the other library we pick.

The current meet approach just didn't handle partial timeouts, which is part of the price for pay for building a bespoke consensus algorithm.

PingXie · 2024-12-24T04:30:08Z

If we move to a single source of truth,

Are you talking about cluster v2? If so, I don't think we should tie the improvement of existing cluster to the timeline of the v2 project. I think that we at least need to document the current meet behavior and then see if we can reason about future patches more easily. "Redesign" though does sound like the wrong word. Let me update the title to clarify the issue better.

madolson · 2024-12-24T05:19:23Z

Are you talking about cluster v2? If so, I don't think we should tie the improvement of existing cluster to the timeline of the v2 project.

Yeah.

I think that we at least need to document the current meet behavior and then see if we can reason about future patches more easily. "Redesign" though does sound like the wrong word. Let me update the title to clarify the issue better.

I'm always here for better documentation. The more I think about cluster v2 the more I think we should really just take a fresh pass at the current cluster code and clean it up and try to simplify it. There are many undocumented assumptions, confusing code paths, and the algorithms that were used are brittle in failure modes. Some fuzzy/chaos testing would also might be a good idea.

enjoy-binbin · 2024-12-24T08:51:01Z

What I want to add to MEET is to add some auth meet / auth handshake. We often suffer from some cluster merging problems, so in internal we had meet auth and handshake auth to control the CLUSTER MEET and the clusterAddNode(in the gossip one). Our current code relies too much on CLUSTER MEET / clusterAddNode. If the administrator enters the wrong IP/port, or there are some dead nodes in the cluster, the cluster will be in chaos when the IP/port is reused.

madolson · 2024-12-28T07:04:42Z

What I want to add to MEET is to add some auth meet / auth handshake. We often suffer from some cluster merging problems, so in internal we had meet auth and handshake auth to control the CLUSTER MEET and the clusterAddNode(in the gossip one). Our current code relies too much on CLUSTER MEET / clusterAddNode. If the administrator enters the wrong IP/port, or there are some dead nodes in the cluster, the cluster will be in chaos when the IP/port is reused.

In AWS talked about having a config to isolate a node, so that it would just silently drop all meet and pong packets and wouldn't try to communicate with a cluster. Having a unique cluster secret that is shared between all the nodes in a cluster also makes a lot of sense. We could add that with a ping extension I think?

PingXie mentioned this issue Dec 22, 2024

Only (re-)send MEET packet once every handshake timeout period #1441

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NEW] Rethink and Redesign the MEET Protocol to Address Increasing Brittleness and Complexity #1471

[NEW] Rethink and Redesign the MEET Protocol to Address Increasing Brittleness and Complexity #1471

PingXie commented Dec 22, 2024

madolson commented Dec 24, 2024 •

edited

Loading

PingXie commented Dec 24, 2024

madolson commented Dec 24, 2024

enjoy-binbin commented Dec 24, 2024

madolson commented Dec 28, 2024

[NEW] Rethink and Redesign the MEET Protocol to Address Increasing Brittleness and Complexity #1471

[NEW] Rethink and Redesign the MEET Protocol to Address Increasing Brittleness and Complexity #1471

Comments

PingXie commented Dec 22, 2024

madolson commented Dec 24, 2024 • edited Loading

PingXie commented Dec 24, 2024

madolson commented Dec 24, 2024

enjoy-binbin commented Dec 24, 2024

madolson commented Dec 28, 2024

madolson commented Dec 24, 2024 •

edited

Loading