Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing new KafkaRoller #103

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

tinaselenge
Copy link
Contributor

@tinaselenge tinaselenge commented Jan 2, 2024

Signed-off-by: Gantigmaa Selenge <[email protected]>
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
Made some improvements on the structure

Signed-off-by: Gantigmaa Selenge <[email protected]>
Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a first pass, as I need more time to digest this. I think it would be useful to illustrate the new behavior with a couple of examples of the form: with this roller configuration and cluster state, these are the node groups and their restart order. Wdyt?

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
@tinaselenge tinaselenge force-pushed the kafka-roller-2 branch 2 times, most recently from c060c24 to 433316f Compare March 15, 2024 12:11
Tidy up

Signed-off-by: Gantigmaa Selenge <[email protected]>
@tinaselenge tinaselenge marked this pull request as ready for review March 15, 2024 12:29
Signed-off-by: Gantigmaa Selenge <[email protected]>
@tinaselenge
Copy link
Contributor Author

@fvaleri Thank you for the feedback. I have added an example of rolling update. Please let me know what you think.

Copy link
Member

@see-quick see-quick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice proposal. Thanks for it 👍 .


STs POV:

I think we would need to also design multiple tests to cover all states, which KafkaRoller v2. We have a few tests but for sure that's not 100% coverage. So, we should maybe have a meeting to talk about this...

Side note about performance:

What would be appropriate performance metrics for us to consider when designing performance tests? Are there any critical ones? For sure I can image that we would see significant improvement on RollingUpdates of multiple nodes when we use batching mechanism...

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
Co-authored-by: Maros Orsak <[email protected]>
Signed-off-by: Gantigmaa Selenge <[email protected]>
Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tinaselenge thanks for the example, it really helps.

I left some comments, let me know if something is not clear or you want to discuss further.

06x-new-kafka-roller.md Show resolved Hide resolved
- Cruise Control sends `removingReplicas` request to un-assign the partition from broker 2.
- KafkaRoller is performing a rolling update to the cluster. It checks the availability impact for foo-0 partition before rolling broker 1. Since partition foo-0 has ISR [1, 2, 4], KafkaRoller decides that it is safe to restart broker 1. It is unaware of the `removingReplicas` request that is about to be processed.
- The reassignment request is processed and foo-0 partition now has ISR [1, 4].
- KafkaRoller restarts broker 1 and foo-0 partition now has ISR [4] which is below the configured minimum in sync replica of 2 resulting in producers with acks-all no longer being able to produce to this partition.
Copy link
Contributor

@fvaleri fvaleri Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to rebalance, we have the same race condition with replication factor change (the new integration between CC and TO), maybe you can mention this.

The roller should be able to call the CC's user_tasks endpoint, and check if there is any pending task. In that case, the roller has two options: wait for all tasks completion, or continue as today with the potential issue you describe here. You can't really stop the tasks because the current batch will still be completed, and the operators will try to submit a new task in the next reconciliation loop.

I think that we should let the user decide which policy to apply through a configuration. By default the roller would wait for all CC tasks to complete, logging a warning. If the user set or switch to "force" policy, then the roller would behave like today. Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be perhaps included/discussed in a separate proposal or issue? The idea was to mention that there is a race condition we could fix with the new roller in the future, which is not easy to fix with the old roller. How we fix it and other similar problems should be a separate discussion I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a dedicated proposal IMO, but let's start by logging an issue.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would calling the ListReassigningPartitions API be enough to know this?

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
Copy link
Contributor

@katheris katheris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me, but I had a few questions and wording suggestions. I definitely think this will be useful since I've experienced first hand how tricky it is to debug the existing code.

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
@tinaselenge tinaselenge force-pushed the kafka-roller-2 branch 3 times, most recently from 931adbd to 1060fee Compare April 30, 2024 13:58
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
Add possible transitions

Signed-off-by: Gantigmaa Selenge <[email protected]>
- Improve the names for categories and states
- Remove restarted/reconfigured states
- Add a configuration for delay between restarts
- Add a configuration for delay between restart and trigger of preferred leader election
- Restart NOT_RUNNING nodes in parallel for quicker recovery
- Improve the overall algorithm section, to make it clearer and concise

Signed-off-by: Gantigmaa Selenge <[email protected]>
@tinaselenge
Copy link
Contributor Author

Thanks everyone who reviewed the PR! I believe I now have addressed all the review comments except an update of the diagram (will push that in a follow up commit). @scholzj @ppatierno @fvaleri @tombentley , could you please take another look when you have time? Thank you very much.

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Show resolved Hide resolved
Contexts are recreated in each reconciliation with the above initial data.

2. **Transition Node States:**
Update each node's state based on information from abstracted sources. If failed to retrieve information, the reconciliation fails and restarts from step 1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that this is about "loading" or "building" the current state of the node. Usually in our FSMs (i.e. rebalancing), this state is coming from a custom resource, here it's coming from different sources. Maybe we can say it better instead of "update each node's ...." .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what would be better, does Load each node's state based on information... sound better?

We build a context in step 1, which has state UNKNOWN. In this step, we are updating this state based on the information from the sources. So to me, update each node's state sounds fine.

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tinaselenge, I had another look at the example and I think is great. Left few more comments, but I think this would work.

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
@tinaselenge tinaselenge force-pushed the kafka-roller-2 branch 3 times, most recently from 4c035fa to bf71ae6 Compare July 19, 2024 08:30
Signed-off-by: Gantigmaa Selenge <[email protected]>
Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for answering all my questions. Good job.

@tinaselenge
Copy link
Contributor Author

Thanks for answering all my questions. Good job.

Thank you @fvaleri , I really appreciate you reviewing the proposal thoroughly.

- Although it is safe and straightforward to restart one broker at a time, this process is slow in large clusters ([related issue](https://github.com/strimzi/strimzi-kafka-operator/issues/8547)).
- It does not account for partition preferred leadership. As a result, there may be more leadership changes than necessary during a rolling restart, consequently impacting tail latency.
- Hard to reason about when things go wrong. The code is complex to understand and it's not easy to determine why a pod was restarted from logs that tend to be noisy.
- Potential race condition between Cruise Control rebalance and KafkaRoller that could cause partitions under minimum in sync replica. This issue is described in more detail in the `Future Improvements` section.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general Slack is not really ideal for keeping details of problems in the long term. Better to create an issue, which can be discovered more easily by anyone who faces a similar problem.

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
Update each node's state based on information from abstracted sources. If failed to retrieve information, the current reconciliation immediately fails. When the next reconciliation is triggered, it will restart from step 1.

3. **Handle `NOT_READY` Nodes:**
Wait for `NOT_READY` nodes to become `READY` within `operationTimeoutMs`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging from the fact that the next step covers NOT_READY, I'm guessing that we just fall through if the node is still NOT_READY after operationTimeoutMs. But you need to say that! And also explain, if we're prepared to fall though to the next step, why this timeout is even necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have explained why we do the wait and that it falls through the next step.

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
- `NOP`: Nodes needing no operation.

5. **Wait for Log Recovery:**
Wait for `WAIT_FOR_LOG_RECOVERY` nodes to become `READY` within `operationTimeoutMs`. If timeout is reached and `numRetries` exceeds `maxRetries`, throw `UnrestartableNodesException`. Otherwise, increment `numRetries` and repeat from step 2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all these steps I think it would be really valuable to explain the why. Here we're willing to wait for brokers in log recovery because the following steps will result in actions, like restarting other brokers, which will be directly visible to clients. We prefer to start from a cluster that's as close to fully functional as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to explain why are we willing to wait for log recovery here, but not willing to wait for all the brokers replicas to rejoin the ISR?

IIRC the reason we had was the KafkaRoller's job was to restart things, but we didn't want to give it any responsibility for throttling interbroker replication, and we can't guarantee (for all possible workloads) that brokers will always be able to catch up to the LEO (within a reasonable time).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tombentley. I have added the reasons.

Reconfigure nodes in the `RECONFIGURE` group:
- Check if `numReconfigAttempts` exceeds `maxReconfigAttempts`. If exceeded, add a restart reason and repeat from step 2. Otherwise, continue.
- Send `incrementalAlterConfig` request, transition state to `UNKNOWN`, and increment `numReconfigAttempts`.
- Wait for each node's state to transition to `READY` within `operationTimeoutMs`. If timeout is reached, repeat from step 2, otherwise continue.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By definition the node's state will already be READY (otherwise it would have been in the RESTART_NOT_RUNNING group), therefore there is no transition to observe.

It's never been clear to me what would be a good safety check on a dynamic reconfig. Some of the reconfigurable configs could easily result in a borked cluster, so it feels like some kind of check is needed. I think we need to take into account any effects of the reconfiguring of this node on the other nodes in the cluster. I guess step 10 is intended to achieve this, but it's not clear to me how step 10 different from just always restarting from step 2 after each reconfiguration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The roller will transition the node to UNKNOWN state after taking an action so that the state can be observed again, but you are right, that would likely to return READY immediately. As you said, the step 10 will repeat from step 2 if at the point the reconfigured node has gone bad. When repeating from step 2, if the node is not ready but since there is no reason to restart and reconfigure (because it's already been reconfigured) , we would end up waiting for it to become ready until the reconciliation fails. Perhaps, we could fail the reconciliation with an error indicating that a node is not ready after reconfiguration, so we notify the human operator to do the investigation through the log.

Updated the text on the diagram

Signed-off-by: Gantigmaa Selenge <[email protected]>
Signed-off-by: Gantigmaa Selenge <[email protected]>
@tinaselenge
Copy link
Contributor Author

Hi @tombentley @scholzj @ppatierno, do you have any further comments on this proposal?

@tinaselenge tinaselenge marked this pull request as draft December 19, 2024 13:18
@tinaselenge
Copy link
Contributor Author

Converting this PR to Draft, as I'm currently rewriting the proposal to make it more focused on the proposed solutions, rather than implementation details. However, I'm not changing the core of the proposal. I'm hoping that the upcoming update will make it easier to review.

Removed the implementation details such as the algorithm. This will be included in the draft PR for the POC instead.

Signed-off-by: Gantigmaa Selenge <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants