Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seize: enable support for frozen containers #2514

Conversation

rst0git
Copy link
Member

@rst0git rst0git commented Nov 4, 2024

Container runtimes like CRI-O and containerd utilize the freezer cgroup to create a consistent snapshot of container rootfs changes. In this case, the container is frozen before invoking CRIU. Once CRIU successfully completes, a copy of the container rootfs diff is saved, and then the container is unfrozen. To enable GPU checkpointing support with these runtimes, we need to unfreeze the cgroup and restore it to its original state at the end.

When the CUDA plugin is installed, container checkpointing with Kubernetes fails, even for containers that don't use GPUs. This patch aims to resolve this issue.

Fixes: #2508

@rst0git rst0git requested a review from avagin November 4, 2024 20:06
@rst0git rst0git marked this pull request as ready for review November 4, 2024 20:07
@rst0git rst0git requested a review from adrianreber November 4, 2024 20:08
@rst0git rst0git force-pushed the 2024-11-04-seize-checkpointing-freezen-containers branch from 70b3ad9 to 511d073 Compare November 4, 2024 20:08
criu/seize.c Show resolved Hide resolved
@adrianreber
Copy link
Member

Doesn't this break the expectations of the container engines. You wrote they freeze the container to avoid changes to the container file-system. Does the container now continue to run with your change?

@rst0git
Copy link
Member Author

rst0git commented Nov 5, 2024

Does the container now continue to run with your change?

No, we use process seizing without freezer cgroup during checkpointing (see #2475 and #2470).
After criu dump the container should remain in a frozen state.

@rst0git rst0git force-pushed the 2024-11-04-seize-checkpointing-freezen-containers branch 2 times, most recently from d39fc15 to 979e277 Compare November 7, 2024 12:06
@avagin
Copy link
Member

avagin commented Nov 7, 2024

Doesn't this break the expectations of the container engines. You wrote they freeze the container to avoid changes to the container file-system. Does the container now continue to run with your change?

Strictly speaking, this expectation was not right even before this change. CRIU does file system changes while dumping processes. For example, it creates ghost files.

I think the right expectation here is that file systems are not changed after dumping processes and this statement isn't affected by this change.

@avagin
Copy link
Member

avagin commented Nov 8, 2024

LGTM. Thanks.

Container runtimes like CRI-O and containerd utilize the freezer cgroup
to create a consistent snapshot of container root filesystem (rootfs)
changes. In this case, the container is frozen before invoking CRIU.
After CRIU successfully completes, a copy of the container rootfs diff
is saved, and the container is then unfrozen.

However, the `cuda-checkpoint` tool is not able to perform a 'lock'
action on frozen threads.  To support GPU checkpointing with these
container runtimes, we need to unfreeze the cgroup and return it to its
original state once the checkpointing is complete.

To reflect this new behavior, the following changes are applied:
 - `dont_use_freeze_cgroup(void)` -> `set_compel_interrupt_only_mode(void)`
 - `bool freeze_cgroup_disabled` -> `bool compel_interrupt_only_mode`
 - `check_freezer_cgroup(void)` -> `prepare_freezer_for_interrupt_only_mode(void)`

Note that when `compel_interrupt_only_mode` is set to `true`,
`compel_interrupt_task()` is used instead of `freeze_processes()`
to prevent tasks from running during `criu dump`.

Fixes: checkpoint-restore#2508

Signed-off-by: Radostin Stoyanov <[email protected]>
@rst0git rst0git force-pushed the 2024-11-04-seize-checkpointing-freezen-containers branch from 979e277 to 495e39e Compare November 8, 2024 13:44
@avagin avagin merged commit 31b38d6 into checkpoint-restore:criu-dev Nov 12, 2024
38 of 41 checks passed
@rst0git rst0git deleted the 2024-11-04-seize-checkpointing-freezen-containers branch November 12, 2024 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

seccomp: Can't find entry on tid_real
3 participants