-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Dynamiclly set safe time to assume node rebooted seconds #197
[WIP] Dynamiclly set safe time to assume node rebooted seconds #197
Conversation
- overrriding SafeTimeToReboot value in configuration in case it's invalid Signed-off-by: Michael Shitrit <[email protected]>
…ime to reboot Signed-off-by: Michael Shitrit <[email protected]>
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mshitrit The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
414e9ef
to
7efc1e4
Compare
/test 4.15-openshift-e2e |
Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
7efc1e4
to
46c4744
Compare
/test 4.15-openshift-e2e |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I don't think it's a good idea to set the spec in a controller, even less when it's overwriting a configuration value which can be set by users
- Even more, the value we want to set is dynamic, it depends on node specific values (watchdog timeout) and cluster size. So we can potentially have many agents fighting for the "right" value
I agree it's not ideal, however we will override the value only in 2 use cases:
In case several agents has different values which makes SafeTime invalid, the highest (safest) one will be applied, the annotation used by the webhook may still have a too low value though (we can discuss whether/if to address this later in case it's relevant). IMO among the alternatives this is the best solution but I think it's worth to share the alternatives in case you disagree or had something else in mind:
|
and when users set a too low value, not?
just do not modify the spec but use the status instead, for reporting problems and values which are used instead? Spec is always the DESIRED state. When it can't be reached, modifying the spec isn't the solution. |
In this case it'll be rejected by the webhook.
Hmm I don't quite follow, maybe you can give an example ? |
I'm with Marc on this one, operators should not modify the spec.
|
I think option 1 isn't really viable - having the operator crash-loop from the get go with default configuration seems to me like an awfully user experience. Number 2 and 3 are better in that aspect, but still not great IMO. How about removing the default value of SafeTime and making it optional ?
|
No objection, but if you need to fall back to 2 or 3 anyway... why not always use that mechanism? |
I'm using the fallback to cover edge case which are rare, using this mechanism for all use cases is a terrible user experience IMO. For example let's say that we always use option 2, consider the following scenario:
I think that as a user it's a reasonable expectation to have the operator working with default configuration which will not be the case here. |
It can work, just report in the status what you are doing, e.g. using the calculated value. However I agree that the user experience isn't great, when the default value doesn't work...
Wfm. Plus always report the calculated value in the status, plus a warning in case it's higher than the spec's values in case it's filled.
that will make fencing slower than needed on nodes with lower watchdog timeout, but I have no better idea without bigger changes to the process... wfm |
Signed-off-by: Michael Shitrit <[email protected]>
ATM it's stored in an annotation, is that good enough or do you think it's mandatory to have it in the status ?
using an example in order to clarify to make sure we mean the same thing:
|
…value can't yet be verified Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
/test 4.15-openshift-e2e |
- Make sure safe time can be deleted - Additional log and events - e2e fix Signed-off-by: Michael Shitrit <[email protected]>
9d10cfa
to
d200dce
Compare
/test 4.15-openshift-e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only minor of comments from my side
Signed-off-by: Michael Shitrit <[email protected]>
/test 4.15-openshift-e2e |
@@ -46,8 +46,7 @@ type SelfNodeRemediationConfigSpec struct { | |||
// node will likely lead to data corruption and violation of run-once semantics. | |||
// In an effort to prevent this, the operator ignores values lower than a minimum calculated from the | |||
// ApiCheckInterval, ApiServerTimeout, MaxApiErrorThreshold, PeerDialTimeout, and PeerRequestTimeout fields. | |||
// +kubebuilder:validation:Minimum=0 | |||
// +kubebuilder:default=180 | |||
// +kubebuilder:validation:Minimum=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in theory at least this can be considered as an API change. It makes CRs with 0 values invalid...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Context] I've added this change in order to differentiate between 0 value which is just the default (when field is empty) and 0 value field in by the user (This differentiation was needed in the webhook).
in theory at least this can be considered as an API change. It makes CRs with 0 values invalid...
That's a good point, after some thinking I think we are still in the clear though.
Here is my line of thought:
- A 0 value of an older version would cause the removal of
SafeTimeToAssumeNodeRebootedSeconds
. - So IIUC the risk here is a user missing
SafeTimeToAssumeNodeRebootedSeconds
(that was set to 0) after an upgrade - since I think both the risk is low and the consequences of this risk materializing are minor, I think we can go ahead with this change.
return nil | ||
} | ||
|
||
// manageSafeRebootTimeInConfiguration does two things: | ||
// 1. It sets Status.MinSafeTimeToAssumeNodeRebootedSeconds in case it's changed by latest calculation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it still gives me a bit of headache that we set a dynamic value (depends on size of cluster, and watchdog timeout on node)... and the last agent being started just wins... (which solved the cluster size issue when it's increased at least... but not if it decreases. And having potentially different watchdog configs on the nodes is a topic on its own...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are valid points.
I do have some ideas on how to improve this.
Mainly keeping a map with a separate value for each agent - which should address all of the issues (cluster size change is a bit tricky but still can be managed by comparing other agents entries), but I don't think it's a good idea to do it in this PR, but the least we can do is discussing it.
Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
@@ -172,3 +175,26 @@ func validateToleration(toleration v1.Toleration) error { | |||
} | |||
return nil | |||
} | |||
|
|||
func (r *SelfNodeRemediationConfig) validateMinRebootTime() error { | |||
if r.Status.MinSafeTimeToAssumeNodeRebootedSeconds == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to this comment ("below" as "in a comment for a later line of code 😉" ) : #197 (comment)
It's weird that validation fails because a status field (or annotation, doesn't matter) isn't set yet, isn't it? 🤔
I think this might be a usecase for the new warning, which can be returned instead of an error. That needs some dependency updates though, IIUC (see new method signature e.g. here: https://github.com/medik8s/node-healthcheck-operator/blob/9d59a0387a11c4d38ee45f8fb055a37727e02b74/api/v1alpha1/nodehealthcheck_webhook.go#L68)
Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
- improve method readability Signed-off-by: Michael Shitrit <[email protected]>
/test 4.16-openshift-e2e |
- use midsentence lowercase - remove redundant display name Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@mshitrit: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
closing in favor of #214 |
Why we need this PR
Safe Time to reboot is a configured value, however it can't be lower than a minimum calculated value.
The calculated value may differ between clusters so we need a mechanism to override the configured Safe Time to reboot in case it's lower than the calculated value.
This conflict can cause to agents crash-loop when the operator is installed since the agents will not not run with an invalid (lower than calculated value) Safe Time to reboot value.
Changes made
Which issue(s) this PR fixes
ECOPROJECT-1875
Test plan