-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gvisor pod cannot be terminated properly #417
Comments
Upstream issue: google/gvisor#9834 (comment) |
Just hit this after upgrading to Talos 1.8.0 |
Also have been experiencing this |
Gvisor is still broken with talos main Warning FailedKillPod 17s kubelet error killing pod: failed to "KillPodSandbox" for "01ee1caf-9da0-40af-a663-5408d37d8a0e" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded" |
Can you try with |
Seems when adding gvisor debug it's still using the ❯ talosctl -n 10.5.0.3 read /etc/cri/conf.d/gvisor-debug.part
[debug]
level = "debug"
[plugins."io.containerd.runtime.v1.linux"]
shim_debug = true
❯ talosctl -n 10.5.0.3 read /etc/cri/conf.d/runsc.toml
[runsc_config]
❯ talosctl -n 10.5.0.3 get extensions
WARNING: 10.5.0.3: server version 1.8.0-alpha.2-70-ga9bff3a1d-dirty is older than client version 1.8.1
NODE NAMESPACE TYPE ID VERSION NAME VERSION
10.5.0.3 runtime ExtensionStatus 0 1 gvisor-debug v1.0.0
10.5.0.3 runtime ExtensionStatus 1 1 gvisor 20240826.0 |
not sure why |
I wonder if that's the order of extensions? |
I think we should integrate gvisor debug with the general gvisor extension and just add them as additional runtimes. They remain unusable unless someone configured a runtimeclass for debugging and help to reduce the overhead we see here right now. |
attaching support zip and runsc logs |
I don't see any errors in the logs you posted so far. |
yeh, that's the thing, it's just the pod fails to terminate |
I'm quite sure it's a containerd vs gvisor-shim problem. Given how many breaking changes containerd v2 introduced in that space: https://github.com/containerd/containerd/blob/main/docs/containerd-2.0.md#whats-breaking
|
would you like to create an upstream issue then? |
I think containerd removed it's own runc.v1 shim, totally unrelated to gvisor, but still there might some issue of course. |
containerd issue: containerd/containerd#10891 |
New gvisor issue here: google/gvisor#11308 |
@SISheogorath @smira I saw that containerd v2.0.1 was released just 5 days back: https://github.com/containerd/containerd/releases/tag/v2.0.1. Have you been on containerd v2 from before that? From your investigation in google/gvisor#11308 (comment), you intuition does feel correct. Something at the shim level is misbehaving (i.e. the shim is not being invoked like its expecting to be). |
until a solution is available, is there a workaround to clean up those resources on talos's end? |
not really, triggering a reboot would clean them up as talos will forcefully remove the pods, there's support for containerd v2 coming from gvisor side |
sadge |
The Gvisor test pod used in talos e2e-extensions test never terminates succesfully, this causes the reboot/shutdown sequence to hang and eventually timeout, the kubelet shows failed to delete pod sandbox error. Gvisor test is going to be disabled until this is addressed.
The text was updated successfully, but these errors were encountered: