Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMDGPU: add parallel restore of BO content to accelerate restore #2527

Open
wants to merge 8 commits into
base: criu-dev
Choose a base branch
from

Conversation

wweewrwer
Copy link

TL;DR:

This pull request extends CRIU to support parallel restore of AMDGPU buffer object content alongside other restore operations to accelerate the restoration.

The target issue:

In the current restore procedure of AMDGPU applications, the content of the AMDGPU buffer object (BO) is restored synchronously in CR_PLUGIN_HOOK__RESTORE_EXT_FILE. This procedure usually takes a significant amount of time, and during this time the target process cannot perform any other restore operations. However, this restoration has no logical dependencies with other restore operations. Parallelizing this part with other restore operations can speed up the restoration.

The parallel restore approach in this PR:

The core idea of these patch series is to offload the restore of the BO content from the target process to the main CRIU process (the main CRIU process refers to the parent process, and the target process refers to the child process created during the fork). To achieve this, we introduce a new hook, CR_PLUGIN_HOOK__RESTORE_ASYNCHRONOUS, in the main CRIU process. For the AMDGPU plugin, the target process will no longer restore BO contents in CR_PLUGIN_HOOK__RESTORE_EXT_FILE and just send the relevant BOs to the main CRIU process. the main CRIU process will receive the corresponding BOs in CR_PLUGIN_HOOK__RESTORE_ASYNCHRONOUS and begin the restoration. Meanwhile, the target process can continue with other parts of the restoration without being blocked by the BO content restoration. The full design of the idea can also be referred with the ACM SoCC'24 paper: On-demand and Parallel Checkpoint/Restore for GPU Applications.

Tests:

We evaluated the performance according to the following settings. The results show that parallel restore can speed up by 34.3% when images cached in the page cache, and 7.6% when restoring from disk.

Results:

From disk From page cache
Sequential restore 1728ms 254ms
Parallel restore 1596ms 167ms
Speed up 7.6% 34.3%

Settings:

CPU: Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz

Memory: DDR4, 2x8GB

GPU: AMD MI50

Disk: 512GB, Samsung SSD 860

Docker image: rocm/pytorch:rocm5.6_ubuntu20.04_py3.8_pytorch_1.12.1

Example program:

example.py: a ResNet18 application. Enter 'y' to exit, or press any other key to perform inference.

import time
import os
import sys
import torch
import torchvision.models as models
import torchvision.transforms as transforms
torch.set_grad_enabled(False)

device = "cuda:0"

model = models.resnet18(weights='DEFAULT')
model = model.to(device)
model.eval()

batch_size = 1
channels = 3
height = 224
width = 224
input_tensor = torch.randn(batch_size, channels, height, width)
preprocess = transforms.Compose([
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
input_tensor = preprocess(input_tensor)

while input()!="y":
    st = time.time()
    input_tensor = input_tensor.to(device)
    output = model(input_tensor)
    output = output.to("cpu")
    _, predicted_idx = torch.max(output, 1)
    torch.cuda.synchronize()
    ed = time.time()
    print("test time:",ed-st)
    sys.stdout.flush()

Steps:

  1. Install CRIU

    Follow the standard CRIU installation process. Ensure you set the environment variable CRIU_LIBS_DIR to the plugins/amdgpu path.

  2. Dump checkpoint image

    #In one shell
    python3 example.py
    #In another shell
    mkdir -p /tmp/criu-dump
    criu dump -t $(pgrep python3) -D /tmp/criu-dump -j --file-locks
    
  3. Restore from disk

    Test for sequential restore:

    #Clear page cache
    sync; sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" 
    criu restore -D /tmp/criu-dump -j --file-locks
    cat stats-restore | crit decode --pretty | grep restore_time
    

    Test for parallel restore:

    sync; sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" 
    criu restore -D /tmp/criu-dump -j --file-locks --parallel
    cat stats-restore | crit decode --pretty | grep restore_time
    
  4. Restore from page cache

    Install vmtouch for caching images:

    sudo apt install vmtouch
    

    Test:

    sync; sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" 
    #Cache image in memory
    vmtouch -l criu-dump
    #Warm up environment 
    criu restore -D /tmp/criu-dump -j --file-locks
    #Begin to Test
    criu restore -D /tmp/criu-dump -j --file-locks
    cat stats-restore | crit decode --pretty | grep restore_time
    criu restore -D /tmp/criu-dump -j --file-locks --parallel
    cat stats-restore | crit decode --pretty | grep restore_time
    

criu/crtools.c Outdated Show resolved Hide resolved
criu/cr-restore.c Outdated Show resolved Hide resolved
criu/cr-restore.c Outdated Show resolved Hide resolved
@Ddnirvana
Copy link

Thanks for the above comments @avagin @rst0git , we are fixing and polishing the PR. Will update ASAP.

@rst0git
Copy link
Member

rst0git commented Nov 25, 2024

@Ddnirvana @wweewrwer Thank you for your contributions! It might be good to also update the content of the following files to reflect these changes:

@Ddnirvana
Copy link

@Ddnirvana @wweewrwer Thank you for your contributions! It might be good to also update the content of the following files to reflect these changes:

@rst0git No problem. We will add proper description in the next version.

Copy link
Contributor

@dayatsin-amd dayatsin-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wweewrwer. Some minor nit picks, but overall the code looks good to me.

plugins/amdgpu/amdgpu_socket_utils.c Outdated Show resolved Hide resolved
plugins/amdgpu/amdgpu_socket_utils.c Outdated Show resolved Hide resolved
plugins/amdgpu/amdgpu_socket_utils.c Outdated Show resolved Hide resolved
plugins/amdgpu/amdgpu_socket_utils.c Outdated Show resolved Hide resolved
plugins/amdgpu/amdgpu_socket_utils.c Outdated Show resolved Hide resolved
@wweewrwer
Copy link
Author

@rst0git @avagin @dayatsin-amd hi maintainers, thanks for your prior reviews and comments. We have fixed all the issues, as the following:

  1. Use the proper APIs to allocate (xmalloc, etc.)
  2. Enable the optimizations by default
  3. Change the name of hook
  4. Fix the issues to run in Podman containers
  5. Other fixes (line width, comments, etc.)
  6. Add descriptions in README to explain the optimizations.

Please let us know if you have any further comments

Copy link
Contributor

@dayatsin-amd dayatsin-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wweewrwer

@rst0git
Copy link
Member

rst0git commented Nov 28, 2024

@wweewrwer Would you be able to merge the fixup commits into the previous commits using git rebase?
https://github.com/checkpoint-restore/criu/blob/criu-dev/CONTRIBUTING.md#submit-your-work-upstream

@wweewrwer
Copy link
Author

wweewrwer commented Nov 29, 2024

@wweewrwer Would you be able to merge the fixup commits into the previous commits using git rebase? https://github.com/checkpoint-restore/criu/blob/criu-dev/CONTRIBUTING.md#submit-your-work-upstream

@rst0git Thanks for your comment! I have merged the fixup commits into the previous commits using git rebase. Please let me know if you have any further comments.

criu/cr-restore.c Outdated Show resolved Hide resolved
Currently, in the target process, device-related restore operations and
other restore operations almost run sequentially. When the target
process executes the corresponding CRIU hook functions, it can't perform
other restore operations. However, for GPU applications, some device
restore operations have no logical dependencies on other common restore
operations and can be offloaded to the main CRIU process, allowing the
target process to perform other restore operations in parallel.

- POST_FORKING

*POST_FORKING: Hook to enable the main CRIU process to perform some
restore operations of plugins.

Signed-off-by: Yanning Yang <[email protected]>
Currently, when CRIU calls `cr_plugin_init`, `fdstore` is not
initialized. However, during the plugin restore procedure, there may be
some common file operations used in multiple hooks. This patch moves
`cr_plugin_init` after `fdstore_init`, allowing `cr_plugin_init` to use
`fdstore` to place these file operations.

Signed-off-by: Yanning Yang <[email protected]>
@wweewrwer wweewrwer force-pushed the parallel_restore branch 2 times, most recently from cb6b91d to 37e3813 Compare December 5, 2024 13:47
plugins/amdgpu/README.md Outdated Show resolved Hide resolved
Currently, `restore_wait_inprogress_tasks` is a static function and can
only be called within `cr-restore.c`. However, to implement parallel
restore, amdgpu plugin also needs to check the tasks' state to decide
whether to stop the parallel restore server. Therefore, this patch moves
the declaration of `restore_wait_inprogress_tasks` to `restore.h` so
that it can be called by the plugin.

Signed-off-by: Yanning Yang <[email protected]>
Parallel restore needs an interface to know if there is only one process
to restore. This patch adds a `has_children` function in `pstree.h`.

Signed-off-by: Yanning Yang <[email protected]>
When enabling `POST_FORKING`, the target process and the main CRIU
process need an IPC interface to communicate and transfer file
descriptors. This patch adds a Unix domain TCP socket and stores this
socket in `fdstore`.

Signed-off-by: Yanning Yang <[email protected]>
Currently the restore of buffer object comsumes a significant amount of
time. However, this part has no logical dependencies with other restore
operations. This patch introduce some structures and some helper
functions for the target process to offload this task to the main CRIU
process.

Signed-off-by: Yanning Yang <[email protected]>
@wweewrwer
Copy link
Author

@rst0git @avagin
Dear maintainers,

We have pushed the V4 version of the PR, completing all mentioned issues since the last version. Specifically, we: (1) support multiple commands (from a single process), (2) support multiple processes restore, and (3) fix other minor issues mentioned.

Details:

  • Replaced UDP with TCP to distinguish messages between different processes and commands.
  • Multiple-command support: Instead of receiving the command only once, the hook function now launches a dedicated thread to receive commands indefinitely until all tasks finish their restore stage. The main thread in this hook uses restore_wait_inprogress_tasks to determine when tasks have finished. Once completed, it sends an exit command to the parallel restore thread to stop receiving commands.
  • Multi-process support: In the case of multiple processes, they are restored in parallel (with different processes) by default, which will not benefit from the parallel optimizations. Therefore, we introduce a flag (called parallel_disabled) to only enable the optimization for single-process (which is the common case) as a fast path, and fallback to original restore otherwise.
  • Multi-GPU parallel restore support: In the original restore, when a process has multiple GPUs, the content on each GPU is restored in parallel. In this version, we have supported multi-GPU parallel restore utilizing the original design.
  • Other issues: Big thanks to Andrei and Radostin for other issues and suggestions, which are all fixed accordingly.

We have performed all the tests with the above changes. The PR can still bring 31% decrease for the restore latency in the case of single process, and achieves the same results for mutlti-process scenarios.

Please let me know if you have any further comments.

@wweewrwer
Copy link
Author

@rst0git @avagin Just a friendly reminder about the updates in this PR (in case maintainers miss the prior notifications)

This patch implements the entire logic to enable the offloading of
buffer object content restoration. It has two parts: the first replaces
the restoration of buffer objects in the target process by sending a
parallel restore command to the main CRIU process; the second implements
the `POST_FORKING` hook in the amdgpu plugin to enable buffer object
content restoration in the main CRIU process.

Signed-off-by: Yanning Yang <[email protected]>
@avagin
Copy link
Member

avagin commented Dec 16, 2024

Have you investigated other approaches of parallel restoring of BO? For example, it is possible to fork a thread and restoring BO asynchronously in context of its process. In this case, two BO will be restored concurrently.

@wweewrwer
Copy link
Author

Have you investigated other approaches of parallel restoring of BO? For example, it is possible to fork a thread and restoring BO asynchronously in context of its process. In this case, two BO will be restored concurrently.

Yes, we have investigated the approach of forking a thread in the background, but it cannot work as it conflicts with the restore logic of CRIU.

Specifically, when CRIU tries to restore its memory state, it will unmap all old mappings. However, some mappings may be needed by the background thread for BO restoring. Therefore, a thread can only run in parallel with shorter procedures (possibly before entering the restorer blob), while offloading the restore of BO content to a new process (in this PR) can be parallelized with almost the entire restore procedure.

Below figure shows the issues that BO restore must be finished before the CPU memory state restore:

71b554be0823fc028c0fd9a8f9c554f

@wweewrwer
Copy link
Author

@rst0git @avagin Dear maintainers/reviewers, just want to know if there are any further issues/concerns about the latest version?

@dayatsin-amd
Copy link
Contributor

I have requested this PR to be validated on a multi-GPU set-up internally at AMD. Can you give us a few days to confirm there is no regression.

Thank you for this patch!

@wweewrwer
Copy link
Author

I have requested this PR to be validated on a multi-GPU set-up internally at AMD. Can you give us a few days to confirm there is no regression.

Thank you for this patch!

Sure. Thanks!

@Ddnirvana
Copy link

I have requested this PR to be validated on a multi-GPU set-up internally at AMD. Can you give us a few days to confirm there is no regression.

Thank you for this patch!

Dear David @dayatsin-amd , just want to know are there any progress/results about the internal regression test. Thank you again for the assistance and happy new year btw :)

@avagin avagin closed this Jan 7, 2025
@avagin avagin reopened this Jan 7, 2025
@avagin
Copy link
Member

avagin commented Jan 7, 2025

Have you investigated other approaches of parallel restoring of BO? For example, it is possible to fork a thread and restoring BO asynchronously in context of its process. In this case, two BO will be restored concurrently.

Yes, we have investigated the approach of forking a thread in the background, but it cannot work as it conflicts with the restore logic of CRIU.

Specifically, when CRIU tries to restore its memory state, it will unmap all old mappings. However, some mappings may be needed by the background thread for BO restoring. Therefore, a thread can only run in parallel with shorter procedures (possibly before entering the restorer blob), while offloading the restore of BO content to a new process (in this PR) can be parallelized with almost the entire restore procedure.

Everything what is happening in the restore blob should be fast. All mappings are restored before switching into the restore blob. There, the restored mappings are just remapped to proper addresses. I am still not convinced that the idea of restoring buffer objects from the main process is really what we need here.. I can miss something, but I want to see a clear explanation with numbers why the proposed solution is a valuable one.

Additionally, I see two potential issues:

  • Sequential Restoration: This change seems to introduce a new bottleneck by restoring buffer objects sequentially. Could this cause performance problems for workloads with many buffer objects across multiple processes? It would be helpful to understand how this approach scales.
  • Plugin Hook Execution: Running the plugin hook in the main CRIU process for an extended period and making it dependent on other processes is problematic. This deviates from the expectation that multiple plugins should operate independently with equal capabilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants