-
Notifications
You must be signed in to change notification settings - Fork 935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LXD containers randomly drops/missing /proc/cpuinfo when using ZFS and Linux 6.8 (Noble) #14178
Comments
I have upgraded the lxd on hosts to 6.1/stable to see if the new major revision mitigates this issue. |
I tried reproducing this on
It didn't result in a This sounds like a fuse/lxcfs issue at first glance but I wonder if the attached GPU or the ES CPU could have anything to do with it. @Qubitium it's a long shot but I see many fuse related changes in the next kernel point release: https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.10.10 so you might want to consider upgrading to the latest 6.10.x release. |
@simondeziel I will upgrade to 6.10.10 to see if the problem persists. To add more info:
Very strange. |
@mihalicyn could this be related to the LXCFS fixes you're working on? |
Reproduced on kernel 6.10.12-x64v4-xanmod1: Host: Ubuntu 24.04, Kernel 6.10.12, snap lxd 6.1/stable
This server/host was rebooted yesterday so it happened within 24 hours. Again, it's quite random when/which container it happens to. I checked The host has GPU passed to a separate container. EDIT: ALL containers on this host lost access relevant /proc/* entries. Not just this container. I checked all containers, about 8-10 and they all have broken /proc/cpuinfo and related access. |
@simondeziel @tomponline @mihalicyn Found the cause! This is very good news. The lxd daemon had an internal crash related to
EDIT: looks like an attempt to free an invalid pointer in |
@Qubitium thanks that's really useful. I've punted the issue to Aleks whose one of the |
@simondeziel Should I track this here or will there there be a second issue on github/lxcfs? |
Indeed, that might require a bug in the |
Hi @Qubitium, Thanks a lot for reporting this issue to us! Maybe my question sounds unrelated but, are you using ZFS? If yes, then your case looks similar to lxc/lxcfs#644 See also: |
@mihalicyn I am using zfs but I did not get the same kernel crashes as posted in the zfs github issue. But... I found your comment in that issue thread and I am doing hourly flushes of cache buffers exactly as you do not recommend. Oof. 😢 Can you explain why this level 3 flush is dangerous? I am using it so ext4 and zfs buffers are flushed on host on a regular basis so that containers dont oom due to memory allocations. |
yeah, that issue reported also have not experiencing crashes until some point and they were relatively rare. But once he get one, it was a clear evidence of a serious issue with ZFS kernel driver. I'm not saying that your issue is 100% sure the same as that from other report. But looks too similar:
What I would suggest you to do, is to install older kernel like 6.5 and check if issue is still there or not. And if not - then you have at least workaround until all the problems with ZFS will be solved. Now the question is how to install 6.5 kernel on Noble, taking into account that Ubuntu Noble has 6.8 as a base kernel. I would try to download 6.6.x kernel from https://kernel.ubuntu.com/mainline/v6.6.51/ (6.6 choice is not random, it's an official upstream LTS kernel. see https://kernel.org/)
These flushes are not dangerous if they are done on a non-buggy kernel. But I had a hint that something is wrong with ZFS ARC cache and forcing drop caches may trigger a buggy codepath in the kernel and stimulate a kernel crash (and this is good for debugging, but can have really bad consequences when you run it on a production system and cause data loss or corruption on your disk). |
@mihalicyn Thank you for the deep dive. Looks like I hit a rabbit hole that may not be solvable in the near term unless someone can reproduce it in a non-random workload. I will definitely test downgrade kernel to 6.6 and report back on stability. |
Hey @Qubitium, if you are still on a Ubuntu Noble's default kernel you can also try to enable KFENCE detector as it may (if we are lucky enough) help to identify the issue and help with fixing it in the future. As a root user:
or even better (but will take more CPU resources):
It is relatively safe and designed for debugging in production environments. After enabing this you need to watch after your Upd: you may consider this https://gist.github.com/melver/7bf5bdfa9a84c52225b8313cbd7dc1f9 script too. Upd 2: You can also enable SLUB debugging by editing
and then |
@mihalicyn Thanks for the tips. Can I combine with KFENCE with SLUB debug? Can they coexist peacefully? |
yes, absolutely! |
Required information
Issue description
lxd container (both host and container are ubuntu 24.04.1) randomly drops
/proc/cpuinfo
?I have no idea why this is happening. Force stop container and start container will fix this issue, until it happens next time. The chance of it happening is about once per week.
Want to add this server/container has a single Nvidia 4070 GPU passed to via
device=gpu type=gpu
.Steps to reproduce
Happened more than once randomly on different amd single socket servers EPCY 9004 32-core/64 threads. (cpu has no official model id: engineering sample)
Information to attach
Correct cpuinfo:
lxc config show c1 (the container)
The text was updated successfully, but these errors were encountered: