Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add progress status for partition rebalances #140

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

kyguy
Copy link
Member

@kyguy kyguy commented Nov 26, 2024

This proposal introduces a new feature to monitor the progression of an ongoing partition rebalance executed by a Strimzi-managed Cruise Control instance via a KafkaRebalance custom resource. Implementation of this proposal should help to address strimzi/strimzi-kafka-operator#10278

Comment on lines 175 to 178
### Progress Update Cadence

For ease of implementation and minimizing the load on the CruiseControl REST API server, we would only query the CruiseControlState endpoint and update the “progress” section upon `KafkaRebalance` resource reconciliation.
The progress section will never be more out of date longer than the reconciliation period and even if the rebalance runs into an error or “NotReady” state, the “progress” section would still be updated on that KafkaRebalance resource reconciliation along with any error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you avoid tight reconciliation loop as update to the status will trigger new reconciliation that will update the status, trigger new reconciliation etc.?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very good point. Maybe we need to post a timestamp of last progress check and if it is less than the reconciliation period then skip?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general rule is to not include things like that in the status. Using some timestamp for that would probably need to be handled when getting the progress data and not when updating the status, as that is a shared code and it might be complicated to put it there.

Copy link
Contributor

@tinaselenge tinaselenge Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The progress section will never be more out of date longer than the reconciliation period this part might not be true always. For example, if CC REST API returned an error for some reason and the executor state could not retrieved, would we wait for the next reconciliation to retry? In which case, the progress section would be out of date.

Copy link
Member Author

@kyguy kyguy Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very good point. Maybe we need to post a timestamp of last progress check and if it is less than the reconciliation period then skip?

The general rule is to not include things like that in the status. Using some timestamp for that would probably need to be handled when getting the progress data and not when updating the status, as that is a shared code and it might be complicated to put it there.

I should be able to use the existing timestamp in metadata.managedFields[].time field of KafkaRebalance resource to know when the resource was last updated, then only update the progress section if that timestamp is older than the reconciliation period.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you should rely on metadata.managedFields[].time as those are internal Kafka fields with completely different purposes.

If we are not allowed to maintain a timestamp in the progress section specifying when it was last changed or rely on the metadata.managedFields[].time field of the custom resource then we will either have to find another way of tracking when the resource was last updated or try another approach for preventing tight reconciliation loops.

I'll see what I come up with and get back to you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you cannot have a timestamp in the status. The question is how you work with the timestamp, how you use it, and when/how you update it. But in general, the easiest solution is to store the progress in a config map which you can simply update in very reconciliation and as you don't watch you do not need to b worried about what it triggers. Event might be other option for the progress tracking maybe? I do not like them very much and I think they are pretty useless for tracking the restart events. But if you publish the events to the KafkaRebaance resource, it might be more useful than for Pods.

Copy link
Member Author

@kyguy kyguy Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you cannot have a timestamp in the status. The question is how you work with the timestamp, how you use it, and when/how you update it.

If we were to maintain a timestamp in the status we would add it with the progress section upon initial creation, then update it on the next reconciliation when its value was older than reconciliation period. With a timestamp of when the progress was last changed, we could easily avoid triggering unwanted reconciliations.

But in general, the easiest solution is to store the progress in a config map which you can simply update in very reconciliation and as you don't watch you do not need to b worried about what it triggers.

This is a really interesting idea. TBH I hadn't thought of storing the progress information in a ConfigMap instead of the KafkaRebalance status. We were planning on maintaining a ConfigMap for executor state information anyway and yes, in this way we could avoid triggering reconciliations upon progress updates. We would still need to add a progress section with a reference to the ConfigMap but this would only need to be added/removed once per state change.

Although maintaining the progress information in the ConfigMap would be the simplest solution, I still feel that the UX of maintaining the progress information in the KafkaRebalance status would still be worth the added implementation complexity. Any thoughts on this @ppatierno @tomncooper ?

Event might be other option for the progress tracking maybe? I do not like them very much and I think they are pretty useless for tracking the restart events. But if you publish the events to the KafkaRebaance resource, it might be more useful than for Pods.

I hadn't thought of using events either but I'll think more on this. My only concern for storing the progress in the events would be the UX of getting the progress information, the initial idea was that the progress information would be easily found and read by users in the KafkaRebalance resource.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to maintain a timestamp in the status we would add it with the progress section upon initial creation, then update it on the next reconciliation when its value was older than reconciliation period. With a timestamp of when the progress was last changed, we could easily avoid triggering unwanted reconciliations.

Yes, you would need some custom logic such as if the timestamp is older than X minutes, update the progress. If not, just reuse the old progress.

I hadn't thought of using events either but I'll think more on this. My only concern for storing the progress in the events would be the UX of getting the progress information, the initial idea was that the progress information would be easily found and read by users in the KafkaRebalance resource.

kubectl describe kr should show you the events I think. Also most UIs would normally show the events when you list the custom resource.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed w/ Paolo and Tom, they agreed storing the progress information in the ConfigMap would simplify the implementation and that doing so wouldn't significantly change the UX. Given the executor state information is already going to be stored in the ConfigMap it probably makes the most sense to maintain our progress information there as well. Let me update the proposal to show what it would look like

Copy link

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a first past. A lot of my comments are optional style/grammar/formatting suggestions, so feel free to ignore them.

My main comments are:

  • @scholzj makes a very good point about avoiding infinite reconciliation after a status update. You will need to solve that.
  • I think we should include a minimum estimated time for optimization proposals. Even if it is a ball park figure it is very useful guide. But lets see what others think.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can you add the .md suffix so GH can apply the right syntax highlighting.


In this “progress” section, we include the following fields:

- estimatedTimeToCompletion: The minimum estimated amount time it will take in minutes until partition rebalance is complete.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging from the formula used from this, this value is a prediction based on the past average data transfer rate. The rate could increase in future, so this estimation is not a minimum.


### Supported KafkaRebalance States

For initial implementation we will focus on including the “progress” section only in the following KafkaRebalance states:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For initial implementation we will focus on including the “progress” section only in the following KafkaRebalance states:
For the initial implementation, we will focus on including the “progress” section only in the following KafkaRebalance states:


helps users understand the cost of an ongoing partition rebalance, decide whether or not they should continue or cancel it, and know when future operations will be able to be safely executed.

Further, having this information readily available and easily accessible via `KafkaRebalance` custom resources allows users and third-party tools like the Kubernetes CLI or Strimzi Console to easily track the progression of a partition rebalance.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Further, having this information readily available and easily accessible via `KafkaRebalance` custom resources allows users and third-party tools like the Kubernetes CLI or Strimzi Console to easily track the progression of a partition rebalance.
Further, having this information readily available and easily accessible via `KafkaRebalance` custom resources, allows users and third-party tools like the Kubernetes CLI or Strimzi Console to easily track the progression of a partition rebalance.

- How much time an ongoing partition rebalance has left to take
- How much data an ongoing partition rebalance has left to transfer

helps users understand the cost of an ongoing partition rebalance, decide whether or not they should continue or cancel it, and know when future operations will be able to be safely executed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style Nit: I am not sure the bullet points make things clearer? Feels a bit disjointed when you read it. Maybe just make this a single sentence?


#### Adding “progress” section for other KafkaRebalance states

In addition to the “progress” the “Rebalancing” and “Stopped” KafkaRebalance states, we could provide the “progress” section for other states as well such as the “ProposalReady” and “Ready” states.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have previously suggested putting these states in code quotes (``), but double-quotes is fine too, so long as they are consistent throughout the doc.


In addition to the “progress” the “Rebalancing” and “Stopped” KafkaRebalance states, we could provide the “progress” section for other states as well such as the “ProposalReady” and “Ready” states.
Firstly, this would help emphasize that a rebalance had not started or had completed by having a percentageComplete: 0% on "ProposalReady" and a percentageComplete: 100% on "Ready".
This emphasis could help clear up ambiguity surrounding what the KafkaRebalance “Ready” state or “optimizationResult” field means.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nice.

This feature would be of great value to users.
However, providing an accurate estimation for this is non-trivial, namely the “estimatedTimeToCompletion” field for “ProposalReady" state, is non-trivial.

Leveraging the Cruise Control configurations and user-provided network capacity settings, we could provide a rough estimate for “estimatedTimeToCompletetion” field for inter-broker balances.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Leveraging the Cruise Control configurations and user-provided network capacity settings, we could provide a rough estimate for “estimatedTimeToCompletetion” field for inter-broker balances.
Leveraging the Cruise Control configurations and user-provided network capacity settings, we could provide a rough estimate for “estimatedTimeToCompletetion” field for inter-broker movements.

# The maximum number of partition movements given CC partition movement cap
max_partition_movements= min(<# of brokers> *
num.concurrent.partition.movements.per.broker)
max_partition_movements=min(max_partition_movements, max.num.cluster.partition.movements)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these two variables called the same thing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering the same thing, like are we overwriting the previous value? I think a different name would be better.

Copy link
Member Author

@kyguy kyguy Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with the reformatting

estimatedTimeToCompletion = intraBrokerDataToMoveMB / throughput
```

Given that its inclusion is not completely necessary and adds significant complexity to the proposal, it is out of scope for this proposal.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just ignore intra-broker movements and just base the estimate on inter-broker movements (they will take up the bulk of the time anyway). We could document it as a theoretical minimum and state that it will take longer than this. But it would give a ball park estimate, which is better than the current situation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just ignore intra-broker movements and just base the estimate on inter-broker movements (they will take up the bulk of the time anyway).

Assuming the disk throughput is always faster than the network throughput!

We could document it as a theoretical minimum and state that it will take longer than this. But it would give a ball park estimate, which is better than the current situation.

Let me think more on this

Copy link
Member Author

@kyguy kyguy Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looking into this, investigating possible alternatives and hacking it in the prototype to gauge how complicated it would be to implement. If it isn't too complicated, I'll add it into this proposal and we can aim for supporting all KafkaRebalance states in one go

Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kyguy, this seems to be useful.

I left few comments for your consideration. Please, also fix formatting.

[1] The “progress” section will be visible during the KafkaRebalance “Rebalancing” and “Stopped” states.
[2] The minimum estimated time it will take the rebalance to complete.
[3] The percentage complete of the ongoing rebalance in the range [0-100]%
[4] The ConfigMap where “non-verbose” JSON payload from Executor State from CruiseControlState endpoint is stored.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to store the state in a config map? Maybe we could simply document how to recover that from the REST endpoint in case it is needed for troubleshooting.

##### Rebalancing

```
rate = (finishedDataMovement)/(<task_trigger_time> - <current_time>)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to avoid mistakes, we should always specify the unit in the variable's name (e.g. finishedDataMovementMB).

When querying the Executor State of the CruiseControlState endpoint directly, we have the option to add a “verbose” parameter to request additional information surrounding the state.
The additional information could be of interest to third-party UI tools for exposing more details of a rebalance or to users debugging a problematic rebalance at the partition level.
However, to reduce the complexity of this initial enhancement, we have chosen not to use the “verbose” parameter.
One concern is that some of the fields like the “pendingParitionMovements” field can cause the JSON output to grow quite large.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
One concern is that some of the fields like the “pendingParitionMovements” field can cause the JSON output to grow quite large.
One concern is that some of the fields like the “pendingPartitionMovements” field can cause the JSON output to grow quite large.

Copy link
Contributor

@katheris katheris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally the proposal looks good to me. I agree with the comments from others and just had one comment about the field name of percentageComplete and a suggestion for an additional field we could include

provisionRecommendation: ""
provisionStatus: RIGHT_SIZED
recentWindows: 1
progress:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
progress:
progress: [1]

In this “progress” section, we include the following fields:

- estimatedTimeToCompletion: The minimum estimated amount time it will take in minutes until partition rebalance is complete.
- percentageComplete: The percentage of the partition rebalance that is completed e.g. values in the range [0-100]%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the calculations listed below I wonder if we should be explicit that this the percentage based on the data movement, rather than percentage of partitions done. We could also consider adding a separate field for percentagePartitionMovementComplete, but depends if that would be interesting to people or not.

Suggested change
- percentageComplete: The percentage of the partition rebalance that is completed e.g. values in the range [0-100]%
- percentageDataMovementComplete: The percentage of the partition rebalance that is completed e.g. values in the range [0-100]%


- estimatedTimeToCompletion: The minimum estimated amount time it will take in minutes until partition rebalance is complete.
- percentageComplete: The percentage of the partition rebalance that is completed e.g. values in the range [0-100]%
- rebalanceProgressConfigMap: The ConfigMap where “non-verbose” JSON payload from Executor State from CruiseControlState endpoint is stored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not reference the internal class CruiseControlState representing such endpoint but more what's the real user facing REST endpoint, so /kafkacruisecontrol/state?substates=executor

We could provide the “progress” section for other states as well such as the “ProposalReady” and “Ready” states but it is not completely necessary, nor is it trivial.
Further explanation as to why that is and why it should be saved as a future improvement is explained in the Future Improvements section near the bottom of this proposal.

All information required for estimating the values of “estimatedTimeToCompletion” and “percentageComplete” fields can be derived from either Cruise Control server configurations or CruiseControlState endpoint.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again let's refer to the user facing REST endpoint not the CruiseControlState class.

Further explanation as to why that is and why it should be saved as a future improvement is explained in the Future Improvements section near the bottom of this proposal.

All information required for estimating the values of “estimatedTimeToCompletion” and “percentageComplete” fields can be derived from either Cruise Control server configurations or CruiseControlState endpoint.
That being said, the method of estimation for these fields depends on the state of the KafkaRebalance resource.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a language issue on my side but what do you mean by the "method of estimation ... depends on the state ..."?

##### Stopped

Once a rebalance has been stopped, it cannot be completed.
Therefore, there is no “estimationTimeToCompletion” for a stopped rebalance, so we set estimatedTimeToCompletion = null to emphasize this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean we set estimatedTimeToCompletion = null in terms of custom resource? Do you really want something like estimatedTimeToCompletion: null? Maybe N/A or just removing the field? @tomncooper wdyt?


#### rebalanceProgressConfigMap

Will only be present in “Rebalancing” and “Stopped” states.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that the ConfigMap is deleted when the rebalance is in the other states? Will this field be just removed from the progress as well? should we make it clearer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a line to make this clearer

[1] The “progress” section will be visible during the KafkaRebalance “Rebalancing” and “Stopped” states.
[2] The minimum estimated time it will take the rebalance to complete.
[3] The percentage complete of the ongoing rebalance in the range [0-100]%
[4] The ConfigMap where “non-verbose” JSON payload from Executor State from CruiseControlState endpoint is stored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fvaleri which kind of troubleshooting are you talking about? The verbose information stored in the ConfigMap are coming from the state?substates=executor endpoint which has data when something is running, otherwise it just returns a NO_TASK_IN_PROGRESS so in case of issues, you can't get anything interesting from here AFAIK.


The “non-verbose” JSON payload from the ExecutorState is already too verbose to include in the `KafkaRebalance` status in its entirety.
However, having the information available to users is still useful especially when debugging the state of a partition rebalance.
Therefore, we will store the JSON payload in its own ConfigMap, “rebalanceProgressConfigMap”.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also define the name of such ConfigMap. It's not configurable and user cannot decided this name. I think it will be something pre-formatted starting from the KafkaRebalance name?


Given that its inclusion is not completely necessary and adds significant complexity to the proposal, it is out of scope for this proposal.

#### Configurable verbosity for Executor State
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we want to mention this and also having it as a future improvement. Today, we cannot specify verbose when getting the proposal as well. It's not exposed to the user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we worried that user's might ask for it if it is included in the proposal? Would we ever want to provide the verbose optimization proposal to the user in the future?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know to both questions. We can anyway create a new proposal at some point for the verbosity configuration if someone will come to us and ask for that. I would just avoid to make commitment for the future right now. @tomncooper wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we moved it to the "Rejected Alternatives" section? This way we avoid the commitment and we have the reasons why it was rejected documented there in case users ask for it in the future.

TBH I don't mind stripping the section out completely if we are worried keeping it will result in user fixation or confusion!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it would be a rejected alternatives, but just an addition. I am for removing this section.

@kyguy kyguy force-pushed the kr-exec-progress branch 7 times, most recently from d294906 to ff9df7e Compare December 11, 2024 00:07
@kyguy kyguy force-pushed the kr-exec-progress branch 8 times, most recently from a310c4d to b55824e Compare December 18, 2024 02:10
Signed-off-by: Kyle Liberti <[email protected]>

## Proposal

This proposal extends the status section of the `KafkaRebalance` custom resource to include a `progress` section with a nested `rebalanceProgressConfigMap` field that references a `ConfigMap` that contains information related to an ongoing partition rebalance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a new CM? Can't we use the existing one from the proposal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were considering using the existing ConfigMap, "afterBeforeLoadConfigMap" to store this progress information, but were concerned the additional data would contribute to hitting the 1 MB ConfigMap limit sooner. It is not so much of an issue for the constant "non-verbose" executor state information we plan on providing as part of this proposal. However, if we were to extend the feature to provide the variable "verbose" executor state information in the future, it would increase the chance of hitting the limit for larger production clusters that have a larger number of brokers and partitions.

If we have no requests/plans for providing "verbose" executor state information in the future, I don't see much of a problem of storing the information in the existing ConfigMap. At the least, it would simplify the proposal implementation . Any thoughts @ppatierno @tomncooper?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned before, I am not sure we'll never have the support for "verbose" so maybe we could think about the present and not the future. From this perspective, using the same ConfigMap seems to be reasonable.
Anyway you can keep the rebalanceProgressConfigMap field pointing to that ConfigMap.
When/if one day we have support for "verbose", that field will just point to a different ConfigMap. It should not be a big issue for users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to use the same CM just because I asked about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I just think it's a good compromise. Anyway let's see what @kyguy @tomncooper think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline with Paolo and Tom, since the progress information is constant, we can safely add it to the existing ConfigMap maintained for and tied to the KafkadRebalance resource. This keeps KafkaRebalance information organized in one place, simplifies the proposal implementation, and has insignificant impact on the storage of the ConfigMap. Refactored and added this note to the proposal.

- lastTransitionTime: "2024-11-05T15:28:23.995129903Z"
status: "True"
type: Rebalancing
message: "Failed to retrieve rebalance progress"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be probably a separate warning condition and not be part of the rebalancing condition as it is not clear what it means really?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, just updated this to a "Warning condition"

rebalanceProgressConfigMap: my-rebalance-progress
```

### Future Improvements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this chapter empty? Or do you need to fix the headers from here down?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

We could provide this information for other states as well, such as the `ProposalReady` and `Ready` states, but it is not completely necessary, nor is it trivial.
Further discussion on the inclusion of the progress information for these other states can be found in the [Future Improvements](#future-improvements) section near the bottom of this proposal.

All the information required for estimating the values of `estimatedTimeToCompletion` and `percentageDataMovementComplete` fields can be derived from either the Cruise Control server configurations or the [/kafkacruisecontrol/state?substates=executor](https://github.com/linkedin/cruise-control/wiki/REST-APIs#query-the-state-of-cruise-control) REST API endpoint.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"can be derived from either the Cruise Control server configurations or the REST API endpoint." ... this sounds like we have a choice from where estimating the values, while we should be clear in the proposal which one we are using, which I guess is REST API endpoint, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated "'or" -> "and" now that the ProposalReady state is being implemented as part of the proposal. This statement now makes more sense since the ProposalReady state estimations depend on the information from the Cruise Control server configurations, while the other states depend on the information from the /kafkacruisecontrol/state?substates=executor.


$$
\text{executorState} = \langle \text{Previous JSON payload from "/kafkacruisecontrol/state?substates=executor" endpoint} \rangle
$$
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the value of showing the above formulas? I mean we are just saying the executorState field will contain the above JSON returned by the state endpoint. I think we already mentioned it a few times, or?


it is best if we maintain the progress information somewhere else.

#### Including “ExecutorState” in “afterBeforeLoadConfigmap”
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we go with using the same CM we have to remove the section from here.

- `Stopped`

These are the states where this progress information will be able to be most accurately calculated and most useful for users.
We could provide this information for other states as well, such as the `ProposalReady` and `Ready` states, but it is not completely necessary, nor is it trivial.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still argue that having a lower bound for the estimated time of a KafkaRebalance in the ProposalReady state, based on the proposed data to move and the throttle settings would be a useful thing to have. It would obviously not be accurate as there is not disk throughput available.

But it is still a useful guide and helps users gauge the impact of a proposed rebalance, which data-to-move values alone don't give.

$$

Notes
- [1] `finishedDataMovement` is the number of megabytes already moved by rebalance, provided by [/kafkacruisecontrol/state?substates=executor](#field-executorstate) REST API endpoint.
Copy link

@tomncooper tomncooper Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the numerical indicators ([1]) in the formulas. You state the full name of the variable anyway so they don't help.

Knowing things like how much time an ongoing partition rebalance has left to take and how much data an ongoing partition rebalance has left to transfer helps users understand the cost of an ongoing partition rebalance.
This information helps users decide whether they should continue or cancel an ongoing rebalance, and know when future operations will be able to be safely executed.

Further, having this information readily available and easily accessible via Kubernetes primitives, allows users and third-party tools like the Kubernetes CLI or Strimzi Console to easily track the progression of a partition rebalance.
Copy link
Contributor

@fvaleri fvaleri Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a simple example of how the Kubernetes CLI could be used to get the Rebalance progress information?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Strimzi Console"? I think you mean the StreamsHub Console.

[2] The `ConfigMap` containing information related to the ongoing partition rebalance, generated with the name "<kafka_rebalance_resource_name>-progress".

In the `ConfigMap`, we will include the following fields:
- **estimatedTimeToCompletion**: The estimated amount time it will take in minutes until partition rebalance is complete.
Copy link
Contributor

@fvaleri fvaleri Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like in optimizationResult, I think that it would be good to add a unit suffix to this key, i.e. estimatedCompletionTimeInMinutes. IMO, we should follow this pattern in general, and avoid using unit suffixes in values. This would also make the formulas self explanatory.


In the `ConfigMap`, we will include the following fields:
- **estimatedTimeToCompletion**: The estimated amount time it will take in minutes until partition rebalance is complete.
- **percentageDataMovementComplete**: The percentage of the data movement of the partition rebalance that is completed e.g. values in the range [0-100]%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be named completedDataMovementPercentage, similar to monitoredPartitionsPercentage in optimizationResult?

"triggeredUserTaskId":"0230d401-6a36-430e-9858-fac8f2edde93"
}
```
[1] The estimated time it will take the rebalance to complete based on the average rate of data transfer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which time unit? There are other places in which the time unit is not defined.

Comment on lines 79 to 80
estimatedTimeToCompletionInMinutes: 5m [1]
completedDataMovementPercentage: 80% [2]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should either have the unit in the value and have users parse it or have it in the name and use an integer / double only. I'm fine with both ways, but you should pick one.

Copy link
Member

@ppatierno ppatierno Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My view is ...
For the completedDataMovementPercentage field it doesn't make much sense to have the % symbol in the value so I would be to just remove it and having completedDataMovementPercentage: 80.
Regarding the estimatedTimeToCompletionInMinutes, it could depends if we want the flexibility of showing the value in a different unit, but I don't see any value in it. I mean we could have estimatedTimeToCompletion: 300000ms or estimatedTimeToCompletion: 5m to say the same but does it make really sense? for this reason I would be more for just estimatedTimeToCompletionInMinutes: 5. The rebalancing is a long process and showing 1 minute or just 0 minute for a remaining time which is less than a minute could make sense (instead of something like estimatedTimeToCompletion: 36s).

Comment on lines 112 to 114
- `ProposalReady`
- `Rebalancing`
- `Stopped`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will be the actual values in these states?

  • Proposal ready -> 0% completion and the estimated time from once it would be approved?
  • Rebalancing -> an up to date information?
  • Stopped -> the last infor before it was stopped?
  • Ready -> 100% and 0 minutes remaining?

Maybe you can describe it here in bullet points in a human readable form and leave the formulas below for experts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a summary like this would be useful

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

name: my-rebalance
data:
estimatedTimeToCompletionInMinutes: 5m [1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the estimation counted? Is it reliable? How much is it affected by the issues with unknown real network capacity?

Copy link
Member Author

@kyguy kyguy Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the estimation counted?

Depends on the KafkaRebalance state, the specific details per state are in the "Field: estimatedTimeToCompletionInMinutes" section of the proposal.

Is it reliable?

In general, yes. The value for the Stopped and Ready states are hardcoded and the value for the Rebalancing state is based on the average rate of data transfer and easily calculated without the need of any user or capacity settings. The only state that could be potentially problematic is the estimation for the ProposalReady state which relies on accurate network capacity configuration from the user.

How much is it affected by the issues with unknown real network capacity?

If the default or user-configured network capacity is largely different from the real network capacity, the estimation for the ProposalReady state could be inaccurate. If the real network capacity is underestimated, the rebalance could take much less time than than estimatedTimeToCompletionInMinutes to complete. If the real network capacity is overestimated, the rebalance could take much more time than the estimatedTimeToCompletionInMinutes to complete. The latter case wouldn't be as much of an issue as we advertise estimatedTimeToCompletionInMinutes to be a theoretical minimum estimation in the ProposalReady state. However, the former case would be an issue since the estimatedTimeToCompletionInMinutes value wouldn't be a theoretical minimum.

To avoid issues like this, the current plan is to document the users must provide accurate network capacity settings to have accurate estimatedTimeToCompletionInMinutes values in the ProposalReady state. We already documented that users must provide accurate network capacity settings to have accurate rebalances based on network capacity and distribution anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid issues like this, the current plan is to document the users must provide accurate network capacity settings to have accurate estimatedTimeToCompletionInMinutes values in the ProposalReady state. We already documented that users must provide accurate network capacity settings to have accurate rebalances based on network capacity and distribution anyway.

I'm not sure this is a real solution. Do you really believe they configure the accurate network capacity? Do we even know how would they find out the accurate network capacity? Or will we solve it on paper but 99% or users will have it miscofigured and these numbers will be useless?

Copy link
Member Author

@kyguy kyguy Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really believe they configure the accurate network capacity?

Users that are serious about their network resource usage and distribution do

Do we even know how would they find out the accurate network capacity?

I imagine they would use K8s CNI plugins or network performance benchmark tools

Or will we solve it on paper but 99% or users will have it configured and these numbers will be useless?
I'm not sure this is a real solution.

For users that have network capacity properly configured, this feature is still useful. I admit that I don't know how many Strimzi users configure their network capacity settings but I would like to believe that those that are doing it are doing it accurately. In addition to the network capacity documentation, what if we were to only include this estimation in the ProposalReady state for users that explicitly configured their network capacity settings? Would that be a more reasonable solution?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the network capacity documentation, what if we were to only include this estimation in the ProposalReady state for users that explicitly configured their network capacity settings? Would that be a more reasonable solution?

I guess it could be a viable solution. If the user is setting the network capacity I would assume they know what to put there, if not and they put a wrong/bad value, they should know that it's going to screw up the estimation. The documentation should state that. If we think it's not a viable solution then we should remove the estimatedTimeToCompletionInMinutes from the overall proposal. But I am for taking it and documenting it properly.

@tomncooper wdyt about the above discussion?

Copy link
Contributor

@fvaleri fvaleri Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO PropsalReady estimation is useful, because what really matters to users is to know if it would take minutes, hours, or days (see Windows file copy). If we could compute the average bandwidth from Kafka metrics, then we could use this value to provide a more accurate estimation independently from the user configuration.


The estimated time it will take in minutes for a rebalance to complete based on the average rate of data transfer.

The formulas used to calculate field value per `KafkaRebalance` state:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the formulas here ... will the CO calculate these? Or does CC calculate this and we just show the numbers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the CO calculating these.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a sentence in the section above this sentence to make this more clear

"All the information required for the Cluster Operator to estimate the values of estimatedTimeToCompletionInMinutes and completedDataMovementPercentage fields"

Copy link
Contributor

@PaulRMellor PaulRMellor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal clearly outlines the approach for providing progress status and how the information necessary for the calculations will be provided. I’ve left a couple of questions for clarification and some minor suggestions for consistency.

progress: [1]
rebalanceProgressConfigMap: my-rebalance [2]
```
[1] The `progress` section will be visible during the `Ready`, `Rebalancing`, `Stopped` and `Ready` states.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Stopped but not PausedReconciliation or Not Ready? should we explain here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PausedReconciliation is not a valid rebalancing state.

Copy link
Member Author

@kyguy kyguy Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Stopped but not PausedReconciliation or Not Ready? should we explain here?

The PausedReconciliation and NotReady states are not related to the rebalance operation but more to the proposal genration. Therefore, these states don't have any rebalance progress information associated with them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PausedReconciliation and NotReady states are not related to the rebalance operation but more to the proposal genration.

Well, actually even during a rebalancing you can get errors from CC and the KafkaRebalance ends in the NotReady state, right? But PausedReconciliation is not a valid rebalancing state at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, actually even during a rebalancing you can get errors from CC and the KafkaRebalance ends in the NotReady state, right?

Ah, yes you are right, any configuration or rebalance errors will put KafkaRebalance resource in NotReady state. Sorry @PaulRMellor, I was incorrect, NotReady should be supported as well for the same reason the Stopped state is supported, to show how far the rebalance got before it failed. That was a nice spot!

But PausedReconciliation is not a valid rebalancing state at all.

Is that because it is related to the resource and not the rebalance itself? What determines whether it is a valid rebalancing state? I am confused because there is an enum for PausedReconciliation listed in the KafkaRebalanceState class [1]

[1] https://github.com/strimzi/strimzi-kafka-operator/blob/0.45.0/api/src/main/java/io/strimzi/api/kafka/model/rebalance/KafkaRebalanceState.java#L77

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that because it is related to the resource and not the rebalance itself? What determines whether it is a valid rebalancing state? I am confused because there is an enum for PausedReconciliation listed in the KafkaRebalanceState class [1]

But it's ReconciliationPaused not PausedReconciliation! :-P
Joking apart ... I was trying to defend myself because I totally missed this state in the rebalance FSM :-D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about it either until Paul mentioned it!

```
[1] The `progress` section will be visible during the `Ready`, `Rebalancing`, `Stopped` and `Ready` states.

[2] The `ConfigMap` containing information related to the ongoing partition rebalance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[2] The `ConfigMap` containing information related to the ongoing partition rebalance
[2] The `ConfigMap` containing information related to the ongoing partition rebalance.

In the `ConfigMap`, we will add the following fields:
- **estimatedTimeToCompletionInMinutes**: The estimated amount time it will take in minutes until partition rebalance is complete.
- **completedDataMovementPercentage**: The percentage of the data movement of the partition rebalance that is completed e.g. values in the range [0-100]%
- **executorState**: The “non-verbose” JSON payload from the `/kafkacruisecontrol/state?substates=executor` endpoint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should be more specific about the contents of the executorState field?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not provide a detailed list of everything, maybe just a link to the OpenAPI definition of it on the Cruise Control repo?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree you should give a brief summary of what this is and link to the OpenAPI def upstream.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the link and text suggested by Paul which should be fair compromise

In the `ConfigMap`, we will add the following fields:
- **estimatedTimeToCompletionInMinutes**: The estimated amount time it will take in minutes until partition rebalance is complete.
- **completedDataMovementPercentage**: The percentage of the data movement of the partition rebalance that is completed e.g. values in the range [0-100]%
- **executorState**: The “non-verbose” JSON payload from the `/kafkacruisecontrol/state?substates=executor` endpoint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **executorState**: The “non-verbose” JSON payload from the `/kafkacruisecontrol/state?substates=executor` endpoint.
- **executorState**: The “non-verbose” JSON payload from the` /kafkacruisecontrol/state?substates=executor` endpoint, providing details about the executor's current status, including partition movement progress, concurrency limits, and total data to move.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

```
[1] The estimated time it will take in minutes for the rebalance to complete based on the average rate of data transfer.

[2] The percentage complete of the ongoing rebalance in the range [0-100]%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[2] The percentage complete of the ongoing rebalance in the range [0-100]%
[2] The percentage complete of the ongoing rebalance in the range [0-100]%.


The percentage of the data movement of the partition rebalance that is completed.

The formulas used to calculate field value per `KafkaRebalance` state:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The formulas used to calculate field value per `KafkaRebalance` state:
The formulas used to calculate the field value differ for each applicable `KafkaRebalance` state:

progress:
rebalanceProgressConfigMap: my-rebalance-progress
```
[1] Error message from failed Cruise Control REST API call
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[1] Error message from failed Cruise Control REST API call
[1] Error message from failed Cruise Control REST API call.

conditions:
- lastTransitionTime: "2024-11-05T15:28:23.995129903Z"
status: "True"
type: Warning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call out this property to highlight and distinguish between NotReady and (new?) Warning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of condition in the status could be any one of the KafkaRebalance states, it could also be Warning. Why would the NotReady type be tied/associated to the Warning type?

### Accessing progress fields using Kubernetes CLI

The progress information will be stored in a `ConfigMap` with the same name as the `KafkaRebalance` resource.
Using the name of the ConfigMap we can view its data from the command line using the Kubernetes CLI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Using the name of the ConfigMap we can view its data from the command line using the Kubernetes CLI.
Using the name of the `ConfigMap`, we can view its data from the command line using the Kubernetes CLI.


### Rejected Alternatives

#### Maintaining progress fields in KafkaRebalance resource status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Maintaining progress fields in KafkaRebalance resource status
#### Maintaining progress fields in `KafkaRebalance` resource status

Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kyguy I had a pass leaving some comments.
I think this proposal is missing the usual "Affected/not affected projects" section.

@@ -0,0 +1,374 @@
# Partition Rebalance Progress Status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be more "Cluster rebalance progress status" (or rebalancing) ... not sure if "partition" (even using the singular) sounds really fine because it's about a cluster rebalancing and about moving one or more partitions across the cluster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to the "Adding progress updates for Cruise Control rebalances"

Knowing things like how much time an ongoing partition rebalance has left to take and how much data an ongoing partition rebalance has left to transfer helps users understand the cost of an ongoing partition rebalance.
This information helps users decide whether they should continue or cancel an ongoing rebalance, and know when future operations will be able to be safely executed.

Further, having this information readily available and easily accessible via Kubernetes primitives, allows users and third-party tools like the Kubernetes CLI or StreamsHub Console to easily track the progression of a partition rebalance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many people in the Strimzi community would know about the "StreamsHub Console"? Maybe it deserves a link to the repo or not mentioning it at all?

## Proposal

This proposal extends the status section of the `KafkaRebalance` custom resource to include a `progress` section with a nested `rebalanceProgressConfigMap` field.
This field will reference the `KafkaRebalance`'s existing `ConfigMap`, which will be enhanced to contain information related to an ongoing partition rebalance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the KafkaRebalance's existing ConfigMap" ... maybe we should mention this ConfigMap beforehand to explain what it contains currently and from where it is already referenced (see afterBeforeLoadConfigMap field).


The estimated time it will take in minutes for a rebalance to complete based on the average rate of data transfer.

The formulas used to calculate field value per `KafkaRebalance` state:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the CO calculating these.

**Estimation for intra-broker rebalance:**

It is challenging to provide an accurate estimate for intra-broker rebalances without an estimate for disk read/write throughput and getting disk throughput is non-trivial for Strimzi.
However, by using the network bandwidth in place of the disk throughput, we can provide a rough estimate of how long the rebalance would take.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"by using the network bandwidth in place of the disk throughput" why do you think that we can do this "replacement" to get a rough estimation? I am not sure about that. I was wondering if we should just avoid this estimation if we don't have a good way for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the disk throughput is always greater than the network bandwidth, an estimate using the network bandwidth could serve as an upperbound (theoretical maximum) of how long an intra-broker rebalance would take. e.g the rebalance won't take longer than this. However, this contradicts the definition provided by the inter-broker balance and may cause confusion. We could simply set the value to N/A for now to avoid confusion/inaccuracy.

Would you mind if we left this estimate out for intra-broker balances @tomncooper?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that is what we were going to do? Basically we don't include the intra estimate as we can't reliably calculate it. So the time to completion is always a theoretical minimum, it WILL take longer than this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I would leave this estimation out imho.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated proposal to explain this and suggest that we set it to "N/A"

$$

Notes
- [1] The number of megabytes already moved by rebalance, provided by [/kafkacruisecontrol/state?substates=executor](#field-executorstate) REST API endpoint.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put the specific field we are using from the returned JSON?

Notes
- [1] The number of megabytes already moved by rebalance, provided by [/kafkacruisecontrol/state?substates=executor](#field-executorstate) REST API endpoint.
- [2] The time when the rebalance task was started, extracted from `triggeredTaskReason` field from the [/kafkacruisecontrol/state?substates=executor](#field-executorstate) for that task.
- [3] The total number of megabytes planned to be moved for rebalance, provided from json payload of the [/kafkacruisecontrol/state?substates=executor](#field-executorstate) REST API endpoint.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put the specific field we are using from the returned JSON?


Notes
- [1] The number of megabytes already moved by rebalance, provided by [/kafkacruisecontrol/state?substates=executor](#field-executorstate) REST API endpoint.
- [2] The total number of megabytes planned to be moved for rebalance, provided by [/kafkacruisecontrol/state?substates=executor](#field-executorstate) REST API endpoint.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put the specific fields we are using from the returned JSON for both the above?


For ease of implementation and minimizing the load on the CruiseControl REST API server, the operator will only query the `/kafkacruisecontrol/state?substates=executor` endpoint and update the `ConfigMap` upon `KafkaRebalance` resource reconciliation.

In the event that Cruise Control runs into an error when rebalancing, the operator will transition the `KafkaRebalance` resource to the `NotReady` state, remove the `progress` section, and delete the progress `ConfigMap`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"delete the progress ConfigMap" or "delete the progress section of the ConfigMap"?
Remember we are using the same ConfigMap for two purposes (also storing the before/after load data).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this consistent with what currently happens if we fail to get a response from CC in a single reconciliation? Does the CC client retry? Not sure setting the KR CR to NotReady for what could be a simple network blip is good UX.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete the progress ConfigMap" or "delete the progress section of the ConfigMap"?
Remember we are using the same ConfigMap for two purposes (also storing the before/after load data).

I am going to update the behavior described here to retain the progress information and ConfigMap when the KafkaRebalance resource moves to the NotReady state. As discussed in a previous thread with Paul, the progress information in the NotReady state may just be as useful for debugging as it is in the Stopped.

Is this consistent with what currently happens if we fail to get a response from CC in a single reconciliation? Does the CC client retry? Not sure setting the KR CR to NotReady for what could be a simple network blip is good UX.

This line was intended to describe how the progress information will be updated when CC server returns "CompletedWithError" status for a task. From what I understood when writing this, this was the only situation where the KR resource was moved to the NotReady state. But looking closer at the code, it appears I am wrong.

It appears the CO will move the KR resource to the NotReady state when it fails to get a response from the CC server, it also looks like the CO CC client code does not retry when failing to get a response (unless I am missing some retry logic in the code). This means that if the CO fails to get a response from CC server it will set the KR CR is set to Not Ready. I thought this only happened when there was an "CompleteWithError" response returned by the CC server, not when there was a failed HTTP request.

The proposal suggests attempting to retrieve the executor status but whether the retrieval succeeds or fails has no affect on the state of the KafkaRebalance resource.

Of course, the CC client, wherever it is used, should and will be implemented to retry

Copy link

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed I think you should not estimate time to complete intra-broker movements in the proposal and just exclude them. The estimate will always be a theoretical minimum but it is useful to have a ball park.

@@ -0,0 +1,374 @@
# Partition Rebalance Progress Status

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Partition Rebalance Progress Status
# Adding progress updates for Cruise Control rebalances


At this time, Strimzi users are able to execute partition rebalances via `KafkaRebalance` custom resources but can only monitor the progression of those partition rebalances in two ways:

- Manually querying the Cruise Control REST API endpoint directly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we lock these down with a Network Policy? Would the user have to alter the default set up to get access?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there are a couple of things that would need special configuration to enable a user to access the CC REST API directly

In the `ConfigMap`, we will add the following fields:
- **estimatedTimeToCompletionInMinutes**: The estimated amount time it will take in minutes until partition rebalance is complete.
- **completedDataMovementPercentage**: The percentage of the data movement of the partition rebalance that is completed e.g. values in the range [0-100]%
- **executorState**: The “non-verbose” JSON payload from the `/kafkacruisecontrol/state?substates=executor` endpoint.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree you should give a brief summary of what this is and link to the OpenAPI def upstream.


For ease of implementation and minimizing the load on the CruiseControl REST API server, the operator will only query the `/kafkacruisecontrol/state?substates=executor` endpoint and update the `ConfigMap` upon `KafkaRebalance` resource reconciliation.

In the event that Cruise Control runs into an error when rebalancing, the operator will transition the `KafkaRebalance` resource to the `NotReady` state, remove the `progress` section, and delete the progress `ConfigMap`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this consistent with what currently happens if we fail to get a response from CC in a single reconciliation? Does the CC client retry? Not sure setting the KR CR to NotReady for what could be a simple network blip is good UX.

progress: [1]
rebalanceProgressConfigMap: my-rebalance [2]
```
[1] The `progress` section will be visible during the `Ready`, `Rebalancing`, `Stopped` and `Ready` states.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[1] The `progress` section will be visible during the `Ready`, `Rebalancing`, `Stopped` and `Ready` states.
[1] The `progress` section will be visible during the `ProposalReady`, `Rebalancing`, `Stopped` and `Ready` states.

This estimate will be a theoretical minimum derived from Cruise Control capacity and throttle configurations.
This means that the cluster rebalance would take at least the estimated amount of time to complete.

$$\text{maxPartitionMovements}_{[1]} = \min(\text{numberOfBrokers} \times \text{num.concurrent.partition.movements.per.broker}),\text{max.num.cluster.partition.movements})$$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't follow this calculation, is it calculating the maximum number of non-concurrent partition movements? I can't work out why the number of brokers is needed. Isn't the worst case scenario that all movements have to happen from a single broker, so should it be max.num.cluster.partition.movements/num.concurrent.partition.movements.per.broker? Also it looks like you are missing a bracket somewhere, possible at the beginning of numberOfBrokers x num.concurrent.partition.movements.per.broker?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it calculating the maximum number of non-concurrent partition movements?

The maximum number of concurrent partition movements. Just updated the variable name to make this more clear.

I can't work out why the number of brokers is needed. Isn't the worst case scenario that all movements have to happen from a single broker, so should it be

This calculation is meant to be a theoretical minimum, the best case scenario, the least amount of time a rebalance would take to complete given ideal conditions (if the maximum allowed number of concurrent partitions movements per broker were moved concurrently and the available bandwidth was perfectly utilized). In reality, the rebalance will take longer than the theoretical minimum but it is still useful to know that the rebalance will take at least this estimated amount of time.

In the best case scenario, we are moving as many partitions concurrently as the brokers will allow. To calculate how many partitions can be move concurrently cluster-wide, we need the number of brokers.

Does that make sense? Would it help if I added annotations/descriptions for the CC configurations that are used in the formulas>

Also it looks like you are missing a bracket somewhere, possible at the beginning of numberOfBrokers x num.concurrent.partition.movements.per.broker?

Yes, that is a typo! Thanks for spotting!

It is challenging to provide an accurate estimate for intra-broker rebalances without an estimate for disk read/write throughput and getting disk throughput is non-trivial for Strimzi.
Since we cannot accurately estimate `estimatedTimeToCompletionInMinutes` without knowing the disk throughput, we set `estimatedTimeToCompletionInMinutes` to `N/A`.

$$\text{maxPartitionMovements}_{[1]} = \min\left(\text{numberOfBrokers} \times \text{num.concurrent.intra.broker.partition.movements.per.broker}),\text{max.num.cluster.movements}\right)$$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here to previous comments, you're missing a bracket in the formula, but I'm also not clear what maxPartitionMovements is really representing

Signed-off-by: Kyle Liberti <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants