Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Dynamiclly set safe time to assume node rebooted seconds #197

Conversation

mshitrit
Copy link
Member

@mshitrit mshitrit commented Apr 18, 2024

Why we need this PR

Safe Time to reboot is a configured value, however it can't be lower than a minimum calculated value.
The calculated value may differ between clusters so we need a mechanism to override the configured Safe Time to reboot in case it's lower than the calculated value.
This conflict can cause to agents crash-loop when the operator is installed since the agents will not not run with an invalid (lower than calculated value) Safe Time to reboot value.

Changes made

  • min time to reboot is calculated when snr agent is initialized and set in an annotation on the configuration.
  • in case the calculated value is lower than the SafeTime in the configuration spec, the value in the configuration is overridden (this can happen when another field that affect the calculation is changed)
  • when SafeTime is changed in the configuration a webhook is used to verify this value against the calculated min value that was set in the config annotation.

Which issue(s) this PR fixes

ECOPROJECT-1875

Test plan

- overrriding SafeTimeToReboot value in configuration in case it's invalid

Signed-off-by: Michael Shitrit <[email protected]>
Copy link
Contributor

openshift-ci bot commented Apr 18, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

openshift-ci bot commented Apr 18, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mshitrit

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mshitrit mshitrit force-pushed the dynamiclly-set-SafeTimeToAssumeNodeRebootedSeconds branch from 414e9ef to 7efc1e4 Compare April 18, 2024 16:37
@mshitrit
Copy link
Member Author

/test 4.15-openshift-e2e

Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
@mshitrit mshitrit force-pushed the dynamiclly-set-SafeTimeToAssumeNodeRebootedSeconds branch from 7efc1e4 to 46c4744 Compare April 25, 2024 13:26
@mshitrit
Copy link
Member Author

/test 4.15-openshift-e2e

@clobrano
Copy link
Contributor

/lgtm
Giving a chance to get more reviews
/hold

Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I don't think it's a good idea to set the spec in a controller, even less when it's overwriting a configuration value which can be set by users
  • Even more, the value we want to set is dynamic, it depends on node specific values (watchdog timeout) and cluster size. So we can potentially have many agents fighting for the "right" value

@mshitrit
Copy link
Member Author

  • I don't think it's a good idea to set the spec in a controller, even less when it's overwriting a configuration value which can be set by users

I agree it's not ideal, however we will override the value only in 2 use cases:

  • When SNR starts and the default value will cause SNR agent crash, at that stage user didn't have a chance to set any value yet and any value that will be set by the user will be verified by the webhook.
  • When user changes another configuration field whose change will cause SafeTime to become invalid because minSafe time change (which will also lead to SNR agent crashing)
  • Even more, the value we want to set is dynamic, it depends on node specific values (watchdog timeout) and cluster size. So we can potentially have many agents fighting for the "right" value

In case several agents has different values which makes SafeTime invalid, the highest (safest) one will be applied, the annotation used by the webhook may still have a too low value though (we can discuss whether/if to address this later in case it's relevant).

IMO among the alternatives this is the best solution but I think it's worth to share the alternatives in case you disagree or had something else in mind:

  • Removing SafeTime from the Spec (major API change)
  • Removing default value, so initial value will always be set dynamically
  • Doing nothing (keep the crush loop bug)
  • Setting a fixed higher default value for SafeTime (will reduce the occurrence of this bug without really fixing it while increasing overall remediation time for small clusters)

@slintes
Copy link
Member

slintes commented Apr 30, 2024

we will override the value only in 2 use cases

and when users set a too low value, not?

Removing SafeTime from the Spec (major API change)
Removing default value, so initial value will always be set dynamically
Doing nothing (keep the crush loop bug)
Setting a fixed higher default value for SafeTime (will reduce the occurrence of this bug without really fixing it while increasing overall remediation time for small clusters)

just do not modify the spec but use the status instead, for reporting problems and values which are used instead?

Spec is always the DESIRED state. When it can't be reached, modifying the spec isn't the solution.

@mshitrit
Copy link
Member Author

mshitrit commented Apr 30, 2024

and when users set a too low value, not?

In this case it'll be rejected by the webhook.

just do not modify the spec but use the status instead, for reporting problems and values which are used instead?

Hmm I don't quite follow, maybe you can give an example ?

@beekhof
Copy link

beekhof commented May 1, 2024

I'm with Marc on this one, operators should not modify the spec.
If the value is missing or too low, either:

  1. refuse to start (taking advantage of crash-loop-backoff)
  2. refuse to make progress and report an error via the status
  3. use a calculated value and report a warning via the status

@mshitrit
Copy link
Member Author

mshitrit commented May 1, 2024

I'm with Marc on this one, operators should not modify the spec. If the value is missing or too low, either:

  1. refuse to start (taking advantage of crash-loop-backoff)
  2. refuse to make progress and report an error via the status
  3. use a calculated value and report a warning via the status

I think option 1 isn't really viable - having the operator crash-loop from the get go with default configuration seems to me like an awfully user experience.

Number 2 and 3 are better in that aspect, but still not great IMO.

How about removing the default value of SafeTime and making it optional ?

  • In case it's not assigned - we can use the calculated value
  • In case the user tries to assign an invalid valuewe can use the webhook to reject it.
  • In case the user manipulate other field making minSafeTime higher than SafeTime we can fallback to option 2 or 3
  • Same as above for an upgrade use case where value is already invalid.

@beekhof
Copy link

beekhof commented May 2, 2024

No objection, but if you need to fall back to 2 or 3 anyway... why not always use that mechanism?

@mshitrit
Copy link
Member Author

mshitrit commented May 2, 2024

No objection, but if you need to fall back to 2 or 3 anyway... why not always use that mechanism?

I'm using the fallback to cover edge case which are rare, using this mechanism for all use cases is a terrible user experience IMO.

For example let's say that we always use option 2, consider the following scenario:

  • User installs SNR for the first time without modifying any of the defaults
  • SNR default safe time is 300S
  • For that specific cluster the calculate min Safe Time is 315s
  • As a result SNR will not work (and report problem in the status)

I think that as a user it's a reasonable expectation to have the operator working with default configuration which will not be the case here.
Best case scenario the user notices it after installation and worst case user assumes that SNR is working.
In any case I think it's a bad user experience.

@slintes
Copy link
Member

slintes commented May 2, 2024

As a result SNR will not work (and report problem in the status)

It can work, just report in the status what you are doing, e.g. using the calculated value. However I agree that the user experience isn't great, when the default value doesn't work...

How about removing the default value of SafeTime and making it optional ?

Wfm. Plus always report the calculated value in the status, plus a warning in case it's higher than the spec's values in case it's filled.

In case several agents has different values which makes SafeTime invalid, the highest (safest) one will be applied

that will make fencing slower than needed on nodes with lower watchdog timeout, but I have no better idea without bigger changes to the process... wfm

Signed-off-by: Michael Shitrit <[email protected]>
@mshitrit
Copy link
Member Author

mshitrit commented May 2, 2024

Plus always report the calculated value in the status

ATM it's stored in an annotation, is that good enough or do you think it's mandatory to have it in the status ?

plus a warning in case it's higher than the spec's values in case it's filled.

using an example in order to clarify to make sure we mean the same thing:

  • in case min Safe time changed (due to change of another field that affects it)
  • if the new value is higher than Safe Time
  • than add a warning to Config status

@mshitrit
Copy link
Member Author

mshitrit commented May 6, 2024

/test 4.15-openshift-e2e

- Make sure safe time can be deleted
- Additional log and events
- e2e fix

Signed-off-by: Michael Shitrit <[email protected]>
@mshitrit mshitrit force-pushed the dynamiclly-set-SafeTimeToAssumeNodeRebootedSeconds branch from 9d10cfa to d200dce Compare May 6, 2024 18:13
@mshitrit
Copy link
Member Author

mshitrit commented May 6, 2024

/test 4.15-openshift-e2e

Copy link
Contributor

@clobrano clobrano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only minor of comments from my side

pkg/reboot/calculator.go Show resolved Hide resolved
api/v1alpha1/selfnoderemediationconfig_types.go Outdated Show resolved Hide resolved
Signed-off-by: Michael Shitrit <[email protected]>
@mshitrit
Copy link
Member Author

mshitrit commented May 8, 2024

/test 4.15-openshift-e2e

@@ -46,8 +46,7 @@ type SelfNodeRemediationConfigSpec struct {
// node will likely lead to data corruption and violation of run-once semantics.
// In an effort to prevent this, the operator ignores values lower than a minimum calculated from the
// ApiCheckInterval, ApiServerTimeout, MaxApiErrorThreshold, PeerDialTimeout, and PeerRequestTimeout fields.
// +kubebuilder:validation:Minimum=0
// +kubebuilder:default=180
// +kubebuilder:validation:Minimum=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory at least this can be considered as an API change. It makes CRs with 0 values invalid...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Context] I've added this change in order to differentiate between 0 value which is just the default (when field is empty) and 0 value field in by the user (This differentiation was needed in the webhook).

in theory at least this can be considered as an API change. It makes CRs with 0 values invalid...

That's a good point, after some thinking I think we are still in the clear though.

Here is my line of thought:

  • A 0 value of an older version would cause the removal of SafeTimeToAssumeNodeRebootedSeconds .
  • So IIUC the risk here is a user missing SafeTimeToAssumeNodeRebootedSeconds (that was set to 0) after an upgrade
  • since I think both the risk is low and the consequences of this risk materializing are minor, I think we can go ahead with this change.

api/v1alpha1/selfnoderemediationconfig_types.go Outdated Show resolved Hide resolved
api/v1alpha1/selfnoderemediationconfig_types.go Outdated Show resolved Hide resolved
pkg/reboot/calculator.go Outdated Show resolved Hide resolved
pkg/reboot/calculator.go Outdated Show resolved Hide resolved
pkg/reboot/calculator.go Outdated Show resolved Hide resolved
pkg/reboot/calculator.go Show resolved Hide resolved
return nil
}

// manageSafeRebootTimeInConfiguration does two things:
// 1. It sets Status.MinSafeTimeToAssumeNodeRebootedSeconds in case it's changed by latest calculation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it still gives me a bit of headache that we set a dynamic value (depends on size of cluster, and watchdog timeout on node)... and the last agent being started just wins... (which solved the cluster size issue when it's increased at least... but not if it decreases. And having potentially different watchdog configs on the nodes is a topic on its own...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are valid points.
I do have some ideas on how to improve this.
Mainly keeping a map with a separate value for each agent - which should address all of the issues (cluster size change is a bit tricky but still can be managed by comparing other agents entries), but I don't think it's a good idea to do it in this PR, but the least we can do is discussing it.

api/v1alpha1/selfnoderemediationconfig_types.go Outdated Show resolved Hide resolved
pkg/reboot/calculator.go Outdated Show resolved Hide resolved
@@ -172,3 +175,26 @@ func validateToleration(toleration v1.Toleration) error {
}
return nil
}

func (r *SelfNodeRemediationConfig) validateMinRebootTime() error {
if r.Status.MinSafeTimeToAssumeNodeRebootedSeconds == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to this comment ("below" as "in a comment for a later line of code 😉" ) : #197 (comment)

It's weird that validation fails because a status field (or annotation, doesn't matter) isn't set yet, isn't it? 🤔
I think this might be a usecase for the new warning, which can be returned instead of an error. That needs some dependency updates though, IIUC (see new method signature e.g. here: https://github.com/medik8s/node-healthcheck-operator/blob/9d59a0387a11c4d38ee45f8fb055a37727e02b74/api/v1alpha1/nodehealthcheck_webhook.go#L68)

mshitrit added 3 commits May 23, 2024 12:49
Signed-off-by: Michael Shitrit <[email protected]>
- improve method readability

Signed-off-by: Michael Shitrit <[email protected]>
@mshitrit
Copy link
Member Author

/test 4.16-openshift-e2e

api/v1alpha1/selfnoderemediationconfig_types.go Outdated Show resolved Hide resolved
api/v1alpha1/selfnoderemediationconfig_types.go Outdated Show resolved Hide resolved
mshitrit added 2 commits May 30, 2024 13:57
- use midsentence lowercase
- remove redundant display name

Signed-off-by: Michael Shitrit <[email protected]>
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

openshift-ci bot commented Jun 6, 2024

@mshitrit: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.12-ci-bundle-self-node-remediation-bundle 04a6964 link true /test 4.12-ci-bundle-self-node-remediation-bundle
ci/prow/4.13-ci-bundle-self-node-remediation-bundle 04a6964 link true /test 4.13-ci-bundle-self-node-remediation-bundle
ci/prow/4.14-ci-bundle-self-node-remediation-bundle 04a6964 link true /test 4.14-ci-bundle-self-node-remediation-bundle
ci/prow/4.15-ci-bundle-self-node-remediation-bundle 04a6964 link true /test 4.15-ci-bundle-self-node-remediation-bundle
ci/prow/4.16-ci-bundle-self-node-remediation-bundle 04a6964 link true /test 4.16-ci-bundle-self-node-remediation-bundle

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@mshitrit
Copy link
Member Author

closing in favor of #214

@mshitrit mshitrit closed this Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants