-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure update-agent waits for all volumes to be detached before rebooting #30
Comments
Scenario: Running a ceph cluster using the rook operator. During drain, the volumes are detatched, however it might take some time to propagate the the kernel unmount. I have not looked into details, but according to @martin31821 this is caused by the ceph kernel client doing some foo during unmount, thus trying to change this from userspace is not possible. #62 introduces a quick workaround by just adding some wait time after draining the node. |
Maybe we can solve this by introducing a possibility to run one or more kubernetes jobs prior to rebooting, which could be used e.g. to change DNS records, wait a certain amount of time or run host commands prior to rebooting. |
Not ideal, but I guess we could test against that on Lokomotive, as we have there a pipeline testing FLUO and Rook together. CC @surajssd |
Note: existing capabilities for running hooks runs before node is drained, which indeed can make it impossible right now to deploy a custom hook which could ensure it. Perhaps this could be addressed. |
As part of #37, I'm analyzing how FLUO works in details, as there is no documentation or tests and what comes to my mind is, that perhaps hooks model could be extended, so it's possible to run a workflow between each significant action taken, which would be:
However, existing state tracking model is overly complex and right now I don't feel comfortable adding another step to it. Perhaps we try to simplify it first, then extend with extra hook. |
We are affected by this is well. Some seconds of sleep after draining, like in #62 would help mitigating. |
Just realized I think I hit this issue on my cluster as well 😄 |
This commit provides PoC version of implementing agent waiting for all volumtes attached to the node to be detached as a step after draining the node, as shutting down the Pod does not mean the volume has been detached, as usually CSI agent will be running as a DaemonSet on the node and will take care of detaching the volume from the node when the pod shuts down. This commit improves rebooting experience, as right now if there is not enough time for CSI agent to detach the volumes from the node, node gets rebooted and pods using attached volumes have no way to be attached to other nodes, which effectively increases the downtime caused for stateful workloads. This commit still requires tests and better interface for the users. If someone wants to try this feature on their own cluster, I've published the following image I've been testing with: quay.io/invidian/flatcar-linux-update-operator:97c0dee50c807dbba7d2debc59b369f84002797e Closes #30 Signed-off-by: Mateusz Gozdek <[email protected]>
Created PoC/draft PR to play around with this and things seem to improve nicely: #169. |
This commit provides PoC version of implementing agent waiting for all volumtes attached to the node to be detached as a step after draining the node, as shutting down the Pod does not mean the volume has been detached, as usually CSI agent will be running as a DaemonSet on the node and will take care of detaching the volume from the node when the pod shuts down. This commit improves rebooting experience, as right now if there is not enough time for CSI agent to detach the volumes from the node, node gets rebooted and pods using attached volumes have no way to be attached to other nodes, which effectively increases the downtime caused for stateful workloads. This commit still requires tests and better interface for the users. If someone wants to try this feature on their own cluster, I've published the following image I've been testing with: quay.io/invidian/flatcar-linux-update-operator:97c0dee50c807dbba7d2debc59b369f84002797e Closes #30 Signed-off-by: Mateusz Gozdek <[email protected]>
Original issue: coreos/container-linux-update-operator#191
Perhaps waiting for
kubectl get volumeattachments
to get empty with the right selector would be sufficient?The text was updated successfully, but these errors were encountered: