-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows node - VolumeAttachment stuck #2100
Comments
Thank you for filing this issue, this daily restart really should not be needed. I will try to reproduce this issue this week with a similar cron job.
What kind of debug option would have been helpful for you? /priority important-soon |
Also would you consider filing a customer support request with AWS about this? Going through AWS support would make it easier for EBS to collect more details as to why the volume is not detaching. Especially given the comment Thank you |
Also the following AWS FAQ item might be helpful to understand root cause: Resolve an EBS volume stuck in the detaching state | AWS re:Post The log message: This is strange because the csi driver node pod should have already succeeded in unmounting volume by the time csi controller pod is calling EC2 DetachVolume here. I will continue to investigate issue. Thank you. |
@AndrewSirenko thanks a lot for jumping right into this topic. Hope you will get lucky reproducing the problem.
|
I believe you can already see what calls are made to AWS with You can also get more detailed CSI Driver node pod logs with
Understood. If/when we reproduce the issue then I can file a support ticket myself. In the meantime while we try to reproduce on our end, if you could send any high verbosity (node.logLevel=7) node pod logs from the relevant node (or even kubelet logs), it would help us see whether there was an issue with the CSI Driver unmount operation that might have caused the EC2 DetachVolume API call to get stuck in busy. Thank you!
|
Thanks for the details, I was not aware that you can set separate loglevel for controller and node. I set up I have to observe this and see how much log it produces, and what exactly pushed into the output before I set this up on production where the issue happens with much faster reproduction rate. Let you know when I succeed with it. |
The volume is not yet stuck on the windows node, but here is what you can se already in the node pod's log.
The Controller log shows normal activities if I search for the volume number:
Still these GRPC errors looks problematic. Do you have any idea how can I debug that further to see which operations are in progress and contention? |
This error
Typically means that a node driver remote procedure call (e.g. Increased node debug logs might show you which operation was slow. We can take an AI for making this error message more clear by atleast noting which gRPC call it was. But I don't think this is of concern for now. Thanks for attaching these logs! Still no luck on my windows eks cluster but perhaps the weekend will bring the issue up locally. |
Hello @AndrewSirenko the issue was not came up during the weekend, meanwhile the debug mode was switched on and as I see and understand the operations goes well both attach and detach. This was a successful execution of operations on the volume I had issue last time:
Sorry but as it looks I still have to wait. |
Today morning the issue occurred again and I've seen two volumes vol-0a959b8cf5f76a3eb and vol-05adad3e3cf7854ab stuck for hours. For debug purpose I searched for the string "vol-05adad3e3cf7854ab" which I mentioned earlier in the ticket. This way I was able to pinpoint the exact time the issue happened. In the attached picture you can see that normal execution happened every hour yesterday until 3:30 when something happened and the volumeID logged continuously in the node's log for a 210 minutes. During this time the pod was active, then as we have an Argo Checked the two timepoints to extract the logs around them. In the csi-attacher it seems that the last successful attachment was at 03:30 and detaching never called up until the pod was cancelled. Other than this I only found messages I was already familiar with. What messages or strings I should look for? |
Hey @weretheone, thank you for reproducing and for your patience. This does not look like intended CSI Driver behavior, but I believe we are getting closer to a solution. Your screenshots and timeline point to an issue in I've taken an action item to take a closer look at our NodeStageVolume window's code path. If the format + mount succeeded, there is no reason the gRPC error should keep occuring. Can you help answer a few followups, given that I have yet to reproduce this on my Windows cluster?
|
Hi @AndrewSirenko, for sure, I will try to answer your follow up questions.
|
These followup answers are very helpful/ This looks like more of a "Kubernetes never realized that the volume was mounted" rather than a "Volume cannot detach" problem, which narrows down our search. The root cause of Regarding email, mine is "andrewsirenko" at gmail. Thank you for the AMI, I will use it to try to find a repro. I'm not sure if I can give you an ETA on a fix, but we will look into this this month, especially now that we have a hint as to where problem lies. Thank you so much! |
@AndrewSirenko sorry, I was away for personal reasons, just sent over the extracted and sanitized logs for you. |
Hello @AndrewSirenko, I forgot about this issue because of other responsibilities but it's still affecting our windows operation. Did you have time to look into the topic? |
Hi @weretheone, without a local reproduction of the issue on our end we are a bit stuck. We would like to help, but will need more info from you. Would you be able to do one of the following three options:
|
sure, for me option 2 and 3 are fine. I wasn't able to reproduce the issue locally. Let's have a call. Then if we need it we can discuss which support plan I would need on AWS to cover these type of support activities. I sent my logs from my business e-mail, you can reply there with call details. |
/kind bug
What happened?
In a mixed cluster, on a Windows node VolumeAttachment stuck for hours with detach error:
We never had issue on a Linux based nodes, but on Windows nodes after a certain run time we face the above issue. Usually it happens after like 2-3 days of uptime despite the daily workload (and the operation volume of attach - detach) is identical every day.
The volume state is "In-use" on AWS UI and this can be found in the logs:
This issue is with us for years and we was only able to mitigate the problem with the daily restart of windows machines, which somewhat solved the problem, but it would be nice to resolve the root cause.
What you expected to happen?
Volumes get detached correctly.
How to reproduce it (as minimally and precisely as possible)?
Configure and startup the driver with option
enableWindows: true
on mixed cluster on a windows node. Define the Storage class and PVC for your workload. We use Argo Workflows to schedule jobs which utilize the defined PVCs and attach volumes to pods.The pods execute activities and upon completion it detaches. After a few day of correct execution the communication between the
controller which runs on a linux machine and the ebs-csi-node-windows pod which runs on a windows machine will experience communication problem, which later result in volumes gets stuck.
Anything else we need to know?:
We tried various debug methods and read trough a ton of issues connected to gRPC problems, but we was not able to pinpoint the root cause of the problem. Debugging is extremely hard as after a node restart it can go for days without problem, so if you can provide any debug option that would be really nice.
Environment
The text was updated successfully, but these errors were encountered: