Extend LXCFS integration #1072

PhilippWendler · 2024-08-16T07:54:02Z

We already use LXCFS to provide better isolation of the containers, and virtualize /proc/uptime. But LXCFS can do more and for example also provide a virtualized view on the CPU-core information provided by the kernel, such that applications see only cores that they are allowed to use. This was incompletely implemented and not working so far.

Currently RunExecutor creates a cgroup instance and uses that. But sometimes the underlying executor needs to create a nested cgroup and put the tool into that. Now we pass it back to RunExecutor such that it can make use of it for the measurements. The effect is that we can use different cgroups for limits and for measurements (and killing processes). But in this commit we still pass only the original cgroups instance back, so there is no behavior change.

We already do that for cgroups v2, such that the cgroup of the benchmarked tool is a child cgroup of the cgroup where we configure the limits. For cgroups v1 it is not necessary to do this so far, but it might be good for consistency and it is required for better integration of LXCFS.

Building on the last commit, we now change the behavior and pass back the cgroup where the actual tool is in, instead of the parent cgroup (at least on cgroups v2, no change for cgroups v1). This should still not result in any visible changes, because measurements and killing processes should be the same for the parent cgroup and the tool cgroup - there is nothing else in the parent cgroup.

We recommend to install LXCFS together with BenchExec, because we use that to virtualize for example /proc/uptime in the container. However, a main use case of LXCFS is to virtualize files that contain information about the system such as the available CPU cores in /proc/cpuinfo. We never advertised this, but I assumed this was working all the time. I found out that it never worked, though. The reason is that LXCFS is using the limits configured for the init process of the container, but our init process has no limits, it is not part of the same cgroup as the other processes in the container (on purpose, because we do not want to measure its resource consumption). So now we create yet another cgroup for the init process that is below the one with the limits but outside of the one that is used for measurements. Note: A single runexec execution will now create up to 5 cgroups. This is made possible due to the separation between the cgroups for limits and for measurements in the last commits. With this change, /proc/cpuinfo now shows only the cores available in the container if LXCFS is running. This helps processes in the container to see how many CPU cores they are allowed to use and for example to decide how many threads to spawn. Fixes #1070

Like for /proc, LXCFS provides a virtualized /sys/devices/system/cpu that only shows the allowed cores. Of course we want to mount that in the container as well, at least if the user has not requested /sys to be hidden or have full access to the host directory. Fixes #1069

PhilippWendler · 2024-08-16T07:54:18Z

@schroeding A code review and testing in as many scenarios as possible would be good.

schroeding

Looks good to me, test code also works with all the different grep-implementations I could find on Debian & NixOS.

As a side note, the output of /proc/cpuinfo is inconsistent to other related outputs in /sys/devices/system/cpu/cpu*/ - we do everything correctly, mounting the correct directory from /var/lib/lxcfs, but the output of lxcfs in the /sys/devices/system/cpu/cpu*/ directories is itself inconsistent (at least on my AMD powered test system) with /proc/cpuinfo and the information directly in /sys/devices/system/cpu/*.

Programs which parse /sys/devices/system/cpu/cpu* in detail (e.g. cpu-info, some versions of htop) thus are still confused for now, but I don't see anything we can do about it, this has to be fixed by lxcfs (see e.g. lxc/lxcfs#627).

PhilippWendler · 2024-08-19T07:21:12Z

Thanks, also for explaining the LXCFS problem. I think the problem does not look bad enough that we need a workaround or so.

PhilippWendler added 6 commits August 16, 2024 09:51

Refactoring: rename variable

a0bff5e

PhilippWendler added enhancement container related to container mode labels Aug 16, 2024

PhilippWendler requested a review from schroeding August 16, 2024 07:54

schroeding approved these changes Aug 18, 2024

View reviewed changes

This was linked to issues Aug 19, 2024

Use LXCFS to provide virtualized /sys as well #1069

Closed

Virtualized /proc/cpuinfo via LXCFS not working #1070

Closed

PhilippWendler merged commit 09baf16 into main Aug 19, 2024
15 checks passed

PhilippWendler deleted the extend-lxcfs-integration branch August 19, 2024 10:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend LXCFS integration #1072

Extend LXCFS integration #1072

PhilippWendler commented Aug 16, 2024

PhilippWendler commented Aug 16, 2024

schroeding left a comment

PhilippWendler commented Aug 19, 2024

Extend LXCFS integration #1072

Extend LXCFS integration #1072

Conversation

PhilippWendler commented Aug 16, 2024

PhilippWendler commented Aug 16, 2024

schroeding left a comment

Choose a reason for hiding this comment

PhilippWendler commented Aug 19, 2024