Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend LXCFS integration #1072

Merged
merged 6 commits into from
Aug 19, 2024
Merged

Extend LXCFS integration #1072

merged 6 commits into from
Aug 19, 2024

Conversation

PhilippWendler
Copy link
Member

We already use LXCFS to provide better isolation of the containers, and virtualize /proc/uptime. But LXCFS can do more and for example also provide a virtualized view on the CPU-core information provided by the kernel, such that applications see only cores that they are allowed to use. This was incompletely implemented and not working so far.

Currently RunExecutor creates a cgroup instance and uses that.
But sometimes the underlying executor needs to create a nested cgroup
and put the tool into that. Now we pass it back to RunExecutor
such that it can make use of it for the measurements.
The effect is that we can use different cgroups for limits
and for measurements (and killing processes).

But in this commit we still pass only the original cgroups instance back,
so there is no behavior change.
We already do that for cgroups v2,
such that the cgroup of the benchmarked tool is a child cgroup
of the cgroup where we configure the limits.
For cgroups v1 it is not necessary to do this so far,
but it might be good for consistency
and it is required for better integration of LXCFS.
Building on the last commit, we now change the behavior
and pass back the cgroup where the actual tool is in,
instead of the parent cgroup
(at least on cgroups v2, no change for cgroups v1).
This should still not result in any visible changes,
because measurements and killing processes
should be the same for the parent cgroup and the tool cgroup
- there is nothing else in the parent cgroup.
We recommend to install LXCFS together with BenchExec,
because we use that to virtualize for example /proc/uptime in the container.
However, a main use case of LXCFS is to virtualize files
that contain information about the system such as the available CPU
cores in /proc/cpuinfo.
We never advertised this, but I assumed this was working all the time.
I found out that it never worked, though.

The reason is that LXCFS is using the limits configured for the init
process of the container, but our init process has no limits,
it is not part of the same cgroup as the other processes in the container
(on purpose, because we do not want to measure its resource consumption).
So now we create yet another cgroup for the init process
that is below the one with the limits
but outside of the one that is used for measurements.
Note: A single runexec execution will now create up to 5 cgroups.

This is made possible due to the separation between the cgroups
for limits and for measurements in the last commits.

With this change, /proc/cpuinfo now shows only the cores available in the
container if LXCFS is running.
This helps processes in the container to see how many CPU cores
they are allowed to use and for example to decide how many threads to spawn.

Fixes #1070
Like for /proc, LXCFS provides a virtualized /sys/devices/system/cpu
that only shows the allowed cores.
Of course we want to mount that in the container as well,
at least if the user has not requested /sys to be hidden
or have full access to the host directory.

Fixes #1069
@PhilippWendler PhilippWendler added enhancement container related to container mode labels Aug 16, 2024
@PhilippWendler
Copy link
Member Author

@schroeding A code review and testing in as many scenarios as possible would be good.

Copy link
Contributor

@schroeding schroeding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, test code also works with all the different grep-implementations I could find on Debian & NixOS.

As a side note, the output of /proc/cpuinfo is inconsistent to other related outputs in /sys/devices/system/cpu/cpu*/ - we do everything correctly, mounting the correct directory from /var/lib/lxcfs, but the output of lxcfs in the /sys/devices/system/cpu/cpu*/ directories is itself inconsistent (at least on my AMD powered test system) with /proc/cpuinfo and the information directly in /sys/devices/system/cpu/*.

Programs which parse /sys/devices/system/cpu/cpu* in detail (e.g. cpu-info, some versions of htop) thus are still confused for now, but I don't see anything we can do about it, this has to be fixed by lxcfs (see e.g. lxc/lxcfs#627).

@PhilippWendler
Copy link
Member Author

Thanks, also for explaining the LXCFS problem. I think the problem does not look bad enough that we need a workaround or so.

@PhilippWendler PhilippWendler merged commit 09baf16 into main Aug 19, 2024
15 checks passed
@PhilippWendler PhilippWendler deleted the extend-lxcfs-integration branch August 19, 2024 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
container related to container mode enhancement
Development

Successfully merging this pull request may close these issues.

Virtualized /proc/cpuinfo via LXCFS not working Use LXCFS to provide virtualized /sys as well
2 participants