-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing new KafkaRoller #103
base: main
Are you sure you want to change the base?
Conversation
8c79a95
to
9c6154b
Compare
Signed-off-by: Gantigmaa Selenge <[email protected]>
9c6154b
to
c74f0b4
Compare
Signed-off-by: Gantigmaa Selenge <[email protected]>
ca68601
to
5abafe6
Compare
56d7a24
to
4baf73a
Compare
Made some improvements on the structure Signed-off-by: Gantigmaa Selenge <[email protected]>
4baf73a
to
33ec40e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a first pass, as I need more time to digest this. I think it would be useful to illustrate the new behavior with a couple of examples of the form: with this roller configuration and cluster state, these are the node groups and their restart order. Wdyt?
c060c24
to
433316f
Compare
Tidy up Signed-off-by: Gantigmaa Selenge <[email protected]>
433316f
to
4f91a5a
Compare
Signed-off-by: Gantigmaa Selenge <[email protected]>
97bdef2
to
941fe43
Compare
Signed-off-by: Gantigmaa Selenge <[email protected]>
@fvaleri Thank you for the feedback. I have added an example of rolling update. Please let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice proposal. Thanks for it 👍 .
STs POV:
I think we would need to also design multiple tests to cover all states, which KafkaRoller v2
. We have a few tests but for sure that's not 100% coverage. So, we should maybe have a meeting to talk about this...
Side note about performance:
What would be appropriate performance metrics for us to consider when designing performance tests? Are there any critical ones? For sure I can image that we would see significant improvement on RollingUpdates
of multiple nodes when we use batching mechanism...
Co-authored-by: Maros Orsak <[email protected]> Signed-off-by: Gantigmaa Selenge <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tinaselenge thanks for the example, it really helps.
I left some comments, let me know if something is not clear or you want to discuss further.
06x-new-kafka-roller.md
Outdated
- Cruise Control sends `removingReplicas` request to un-assign the partition from broker 2. | ||
- KafkaRoller is performing a rolling update to the cluster. It checks the availability impact for foo-0 partition before rolling broker 1. Since partition foo-0 has ISR [1, 2, 4], KafkaRoller decides that it is safe to restart broker 1. It is unaware of the `removingReplicas` request that is about to be processed. | ||
- The reassignment request is processed and foo-0 partition now has ISR [1, 4]. | ||
- KafkaRoller restarts broker 1 and foo-0 partition now has ISR [4] which is below the configured minimum in sync replica of 2 resulting in producers with acks-all no longer being able to produce to this partition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to rebalance, we have the same race condition with replication factor change (the new integration between CC and TO), maybe you can mention this.
The roller should be able to call the CC's user_tasks endpoint, and check if there is any pending task. In that case, the roller has two options: wait for all tasks completion, or continue as today with the potential issue you describe here. You can't really stop the tasks because the current batch will still be completed, and the operators will try to submit a new task in the next reconciliation loop.
I think that we should let the user decide which policy to apply through a configuration. By default the roller would wait for all CC tasks to complete, logging a warning. If the user set or switch to "force" policy, then the roller would behave like today. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be perhaps included/discussed in a separate proposal or issue? The idea was to mention that there is a race condition we could fix with the new roller in the future, which is not easy to fix with the old roller. How we fix it and other similar problems should be a separate discussion I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have a dedicated proposal IMO, but let's start by logging an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would calling the ListReassigningPartitions API be enough to know this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good to me, but I had a few questions and wording suggestions. I definitely think this will be useful since I've experienced first hand how tricky it is to debug the existing code.
931adbd
to
1060fee
Compare
Add possible transitions Signed-off-by: Gantigmaa Selenge <[email protected]>
1060fee
to
e56d1f8
Compare
- Improve the names for categories and states - Remove restarted/reconfigured states - Add a configuration for delay between restarts - Add a configuration for delay between restart and trigger of preferred leader election - Restart NOT_RUNNING nodes in parallel for quicker recovery - Improve the overall algorithm section, to make it clearer and concise Signed-off-by: Gantigmaa Selenge <[email protected]>
033da99
to
ec38009
Compare
Thanks everyone who reviewed the PR! I believe I now have addressed all the review comments except an update of the diagram (will push that in a follow up commit). @scholzj @ppatierno @fvaleri @tombentley , could you please take another look when you have time? Thank you very much. |
06x-new-kafka-roller.md
Outdated
Contexts are recreated in each reconciliation with the above initial data. | ||
|
||
2. **Transition Node States:** | ||
Update each node's state based on information from abstracted sources. If failed to retrieve information, the reconciliation fails and restarts from step 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that this is about "loading" or "building" the current state of the node. Usually in our FSMs (i.e. rebalancing), this state is coming from a custom resource, here it's coming from different sources. Maybe we can say it better instead of "update each node's ...." .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what would be better, does Load each node's state based on information...
sound better?
We build a context in step 1, which has state UNKNOWN. In this step, we are updating this state based on the information from the sources. So to me, update each node's state
sounds fine.
Signed-off-by: Gantigmaa Selenge <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tinaselenge, I had another look at the example and I think is great. Left few more comments, but I think this would work.
4c035fa
to
bf71ae6
Compare
Signed-off-by: Gantigmaa Selenge <[email protected]>
bf71ae6
to
660f2ac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for answering all my questions. Good job.
Thank you @fvaleri , I really appreciate you reviewing the proposal thoroughly. |
06x-new-kafka-roller.md
Outdated
- Although it is safe and straightforward to restart one broker at a time, this process is slow in large clusters ([related issue](https://github.com/strimzi/strimzi-kafka-operator/issues/8547)). | ||
- It does not account for partition preferred leadership. As a result, there may be more leadership changes than necessary during a rolling restart, consequently impacting tail latency. | ||
- Hard to reason about when things go wrong. The code is complex to understand and it's not easy to determine why a pod was restarted from logs that tend to be noisy. | ||
- Potential race condition between Cruise Control rebalance and KafkaRoller that could cause partitions under minimum in sync replica. This issue is described in more detail in the `Future Improvements` section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general Slack is not really ideal for keeping details of problems in the long term. Better to create an issue, which can be discovered more easily by anyone who faces a similar problem.
06x-new-kafka-roller.md
Outdated
Update each node's state based on information from abstracted sources. If failed to retrieve information, the current reconciliation immediately fails. When the next reconciliation is triggered, it will restart from step 1. | ||
|
||
3. **Handle `NOT_READY` Nodes:** | ||
Wait for `NOT_READY` nodes to become `READY` within `operationTimeoutMs`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Judging from the fact that the next step covers NOT_READY
, I'm guessing that we just fall through if the node is still NOT_READY
after operationTimeoutMs
. But you need to say that! And also explain, if we're prepared to fall though to the next step, why this timeout is even necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have explained why we do the wait and that it falls through the next step.
06x-new-kafka-roller.md
Outdated
- `NOP`: Nodes needing no operation. | ||
|
||
5. **Wait for Log Recovery:** | ||
Wait for `WAIT_FOR_LOG_RECOVERY` nodes to become `READY` within `operationTimeoutMs`. If timeout is reached and `numRetries` exceeds `maxRetries`, throw `UnrestartableNodesException`. Otherwise, increment `numRetries` and repeat from step 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all these steps I think it would be really valuable to explain the why. Here we're willing to wait for brokers in log recovery because the following steps will result in actions, like restarting other brokers, which will be directly visible to clients. We prefer to start from a cluster that's as close to fully functional as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to explain why are we willing to wait for log recovery here, but not willing to wait for all the brokers replicas to rejoin the ISR?
IIRC the reason we had was the KafkaRoller's job was to restart things, but we didn't want to give it any responsibility for throttling interbroker replication, and we can't guarantee (for all possible workloads) that brokers will always be able to catch up to the LEO (within a reasonable time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tombentley. I have added the reasons.
06x-new-kafka-roller.md
Outdated
Reconfigure nodes in the `RECONFIGURE` group: | ||
- Check if `numReconfigAttempts` exceeds `maxReconfigAttempts`. If exceeded, add a restart reason and repeat from step 2. Otherwise, continue. | ||
- Send `incrementalAlterConfig` request, transition state to `UNKNOWN`, and increment `numReconfigAttempts`. | ||
- Wait for each node's state to transition to `READY` within `operationTimeoutMs`. If timeout is reached, repeat from step 2, otherwise continue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By definition the node's state will already be READY (otherwise it would have been in the RESTART_NOT_RUNNING
group), therefore there is no transition to observe.
It's never been clear to me what would be a good safety check on a dynamic reconfig. Some of the reconfigurable configs could easily result in a borked cluster, so it feels like some kind of check is needed. I think we need to take into account any effects of the reconfiguring of this node on the other nodes in the cluster. I guess step 10 is intended to achieve this, but it's not clear to me how step 10 different from just always restarting from step 2 after each reconfiguration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The roller will transition the node to UNKNOWN state after taking an action so that the state can be observed again, but you are right, that would likely to return READY immediately. As you said, the step 10 will repeat from step 2 if at the point the reconfigured node has gone bad. When repeating from step 2, if the node is not ready but since there is no reason to restart and reconfigure (because it's already been reconfigured) , we would end up waiting for it to become ready until the reconciliation fails. Perhaps, we could fail the reconciliation with an error indicating that a node is not ready after reconfiguration, so we notify the human operator to do the investigation through the log.
Updated the text on the diagram Signed-off-by: Gantigmaa Selenge <[email protected]>
ad75fcf
to
e9a9859
Compare
Signed-off-by: Gantigmaa Selenge <[email protected]>
Hi @tombentley @scholzj @ppatierno, do you have any further comments on this proposal? |
Converting this PR to Draft, as I'm currently rewriting the proposal to make it more focused on the proposed solutions, rather than implementation details. However, I'm not changing the core of the proposal. I'm hoping that the upcoming update will make it easier to review. |
e29b13a
to
9faf987
Compare
9faf987
to
1d97d60
Compare
Removed the implementation details such as the algorithm. This will be included in the draft PR for the POC instead. Signed-off-by: Gantigmaa Selenge <[email protected]>
1d97d60
to
e43e667
Compare
POC implementation