Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve safe time to reboot handling and more #214

Merged
merged 14 commits into from
Jun 24, 2024

Conversation

slintes
Copy link
Member

@slintes slintes commented Jun 11, 2024

Why we need this PR

There are several issues in the current implementation of handling the safe time to reboot.
Context: the safe time reboot determines how long we wait until we assume that the node rebooted and no workloads are running anymore, before continuing with the remediation process and accelerating workload rescheduling by deleting pods, or using the ungraceful node shutdown feature by applying the OutOfService taint.
That time can be set in the SNRConfig (SafeTimeToAssumeNodeRebootedSeconds). We also calculate a minimum, which depends on other values in the config, as well as cluster size and watchdog timeout.
The minimum time is calculated by all SNR agents on pod start, stored by unhealthy node's agent on the SNR CR during remediation, and then used by both the unhealthy node's agent and the manager for further remediation.

The issues are:

a) the agents crashloop in case the calculated time is lower than the specified time.
b) our own default value can be too low.
c) the calculated time might not be accurate anymore during remediation because of cluster size change.
d) the unhealthy node's agent might not be able to store it's calculated time during remediation, so the manager won't use it it and fallback to the specified value.
e) even when the agent would be able to store the value, there is a race condition with the manager which wants to store the specified value.
f) both unhealthy agent and the manager are running most of the remediation code in parallel, which leads to unneeded conflicts on resource updates, and hard to understand and maintain code execution flows.

Changes made

These changes fix a+c+d+e:

  • don't let the agent calculate the reboot duration on start.
  • instead, just store the watchdog timeout on the agent's node, and let the manager calculate the reboot time during remediation using the watchdog timeout from the unhealthy node, and the current cluster size.

This fixes a+b:

  • make the specified value optional and don't set a default anymore. Don't fail in case it is given and lower than the calculated value, just use the calculated value instead.

This fixes f:

  • make a clear separation in the SN config reconcile (=remediation) code flow between agent and manager code flow. The agent just reboots when needed. Everything else is handled by the manager alone.

More fixes done during development:

  • fix reboot calculation by using MaxTimeForNoPeersResponse correctly
  • deduplicate batch size calculation
  • add unit tests for both reboot duration and batch size calculation

Which issue(s) this PR fixes

Fixes ECOPROJECT-1875
and more
Supersedes #197

Test plan

  • Unit tests are updated as needed
  • TODO: add missing unit test for minimum reboot duration calculation
  • e2e tests should not need any change, since this PR doesn't introduce any feature change

Copy link
Contributor

openshift-ci bot commented Jun 11, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@slintes
Copy link
Member Author

slintes commented Jun 11, 2024

/test 4.15-openshift-e2e

@slintes slintes requested a review from mshitrit June 11, 2024 06:14
@slintes slintes force-pushed the improve-safe-time-handling branch from e7fda78 to 1259ec5 Compare June 11, 2024 08:48
@slintes
Copy link
Member Author

slintes commented Jun 11, 2024

/test 4.15-openshift-e2e

@slintes slintes changed the title WIP Improve safe time to reboot handling and more Improve safe time to reboot handling and more Jun 11, 2024
@slintes slintes force-pushed the improve-safe-time-handling branch 2 times, most recently from 5caab5f to be3ac95 Compare June 11, 2024 17:36
@slintes
Copy link
Member Author

slintes commented Jun 11, 2024

/test 4.15-openshift-e2e

Copy link
Contributor

@clobrano clobrano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, but didn't finish the review. I'll come back tomorrow

pkg/reboot/calculator.go Outdated Show resolved Hide resolved
pkg/reboot/calculator.go Outdated Show resolved Hide resolved
pkg/reboot/calculator.go Show resolved Hide resolved
pkg/reboot/calculator.go Outdated Show resolved Hide resolved
pkg/reboot/calculator.go Outdated Show resolved Hide resolved
pkg/utils/annotations.go Outdated Show resolved Hide resolved
@slintes
Copy link
Member Author

slintes commented Jun 17, 2024

/test 4.15-openshift-e2e

1 similar comment
@slintes
Copy link
Member Author

slintes commented Jun 17, 2024

/test 4.15-openshift-e2e

@slintes slintes force-pushed the improve-safe-time-handling branch from bf6ffbe to a7fd2de Compare June 17, 2024 19:00
@slintes
Copy link
Member Author

slintes commented Jun 17, 2024

/test 4.15-openshift-e2e

pkg/reboot/calculator.go Outdated Show resolved Hide resolved
pkg/reboot/calculator.go Outdated Show resolved Hide resolved
slintes added 7 commits June 18, 2024 17:08
- wrap long log lines
- improve comment

Signed-off-by: Marc Sluiter <[email protected]>
- return an error if config isn't set yet
- in case of an error, retry instead of using hard coded fallback
- set client and logger in main.go
- renaming to avoid package name duplication

Signed-off-by: Marc Sluiter <[email protected]>
Signed-off-by: Marc Sluiter <[email protected]>
Signed-off-by: Marc Sluiter <[email protected]>
- deduplicate batch size calculations
- better structure and comments for reboot duration calculation
- fix usage of MaxTimeForNoPeersResponse in calculation
- unit tests for both

Signed-off-by: Marc Sluiter <[email protected]>
@slintes slintes force-pushed the improve-safe-time-handling branch from 7ddf56a to a487c23 Compare June 18, 2024 15:18
@slintes
Copy link
Member Author

slintes commented Jun 18, 2024

/test 4.15-openshift-e2e

Decreases risk of introducing issues with the MaxTimeForNoPeersResponse
usage change

Signed-off-by: Marc Sluiter <[email protected]>
@slintes
Copy link
Member Author

slintes commented Jun 19, 2024

/test 4.15-openshift-e2e

@slintes
Copy link
Member Author

slintes commented Jun 19, 2024

/retest

is one of the new tests flaky...?

/test 4.12-test
/test 4.13-test
/test 4.14-test
/test 4.15-test
/test 4.16-test

@slintes
Copy link
Member Author

slintes commented Jun 19, 2024

crap, retest was wrong for the github workflow of course 🙈

pkg/utils/peers_test.go Outdated Show resolved Hide resolved
Signed-off-by: Marc Sluiter <[email protected]>
@clobrano
Copy link
Contributor

/lgtm
Keeping the "hold" because there are other threads open
/hold

Copy link
Contributor

openshift-ci bot commented Jun 21, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano, slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@slintes slintes marked this pull request as ready for review June 24, 2024 09:57
@openshift-ci openshift-ci bot requested review from beekhof and mshitrit June 24, 2024 09:58
@slintes
Copy link
Member Author

slintes commented Jun 24, 2024

all discussions resolved

/hold cancel

@slintes slintes merged commit 4b91e5c into medik8s:main Jun 24, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants