-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMDGPU: add parallel restore of BO content to accelerate restore #2527
base: criu-dev
Are you sure you want to change the base?
Conversation
@Ddnirvana @wweewrwer Thank you for your contributions! It might be good to also update the content of the following files to reflect these changes: |
@rst0git No problem. We will add proper description in the next version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wweewrwer. Some minor nit picks, but overall the code looks good to me.
@rst0git @avagin @dayatsin-amd hi maintainers, thanks for your prior reviews and comments. We have fixed all the issues, as the following:
Please let us know if you have any further comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wweewrwer
@wweewrwer Would you be able to merge the fixup commits into the previous commits using |
6d2c563
to
0b97df8
Compare
@rst0git Thanks for your comment! I have merged the fixup commits into the previous commits using git rebase. Please let me know if you have any further comments. |
Currently, in the target process, device-related restore operations and other restore operations almost run sequentially. When the target process executes the corresponding CRIU hook functions, it can't perform other restore operations. However, for GPU applications, some device restore operations have no logical dependencies on other common restore operations and can be offloaded to the main CRIU process, allowing the target process to perform other restore operations in parallel. - POST_FORKING *POST_FORKING: Hook to enable the main CRIU process to perform some restore operations of plugins. Signed-off-by: Yanning Yang <[email protected]>
Currently, when CRIU calls `cr_plugin_init`, `fdstore` is not initialized. However, during the plugin restore procedure, there may be some common file operations used in multiple hooks. This patch moves `cr_plugin_init` after `fdstore_init`, allowing `cr_plugin_init` to use `fdstore` to place these file operations. Signed-off-by: Yanning Yang <[email protected]>
cb6b91d
to
37e3813
Compare
Currently, `restore_wait_inprogress_tasks` is a static function and can only be called within `cr-restore.c`. However, to implement parallel restore, amdgpu plugin also needs to check the tasks' state to decide whether to stop the parallel restore server. Therefore, this patch moves the declaration of `restore_wait_inprogress_tasks` to `restore.h` so that it can be called by the plugin. Signed-off-by: Yanning Yang <[email protected]>
Parallel restore needs an interface to know if there is only one process to restore. This patch adds a `has_children` function in `pstree.h`. Signed-off-by: Yanning Yang <[email protected]>
37e3813
to
834c4e6
Compare
When enabling `POST_FORKING`, the target process and the main CRIU process need an IPC interface to communicate and transfer file descriptors. This patch adds a Unix domain TCP socket and stores this socket in `fdstore`. Signed-off-by: Yanning Yang <[email protected]>
Currently the restore of buffer object comsumes a significant amount of time. However, this part has no logical dependencies with other restore operations. This patch introduce some structures and some helper functions for the target process to offload this task to the main CRIU process. Signed-off-by: Yanning Yang <[email protected]>
834c4e6
to
d533e0c
Compare
@rst0git @avagin We have pushed the V4 version of the PR, completing all mentioned issues since the last version. Specifically, we: (1) support multiple commands (from a single process), (2) support multiple processes restore, and (3) fix other minor issues mentioned. Details:
We have performed all the tests with the above changes. The PR can still bring 31% decrease for the restore latency in the case of single process, and achieves the same results for mutlti-process scenarios. Please let me know if you have any further comments. |
d533e0c
to
1f48cd3
Compare
This patch implements the entire logic to enable the offloading of buffer object content restoration. It has two parts: the first replaces the restoration of buffer objects in the target process by sending a parallel restore command to the main CRIU process; the second implements the `POST_FORKING` hook in the amdgpu plugin to enable buffer object content restoration in the main CRIU process. Signed-off-by: Yanning Yang <[email protected]>
Signed-off-by: Yanning Yang <[email protected]>
1f48cd3
to
c7ca1b3
Compare
Have you investigated other approaches of parallel restoring of BO? For example, it is possible to fork a thread and restoring BO asynchronously in context of its process. In this case, two BO will be restored concurrently. |
I have requested this PR to be validated on a multi-GPU set-up internally at AMD. Can you give us a few days to confirm there is no regression. Thank you for this patch! |
Sure. Thanks! |
Dear David @dayatsin-amd , just want to know are there any progress/results about the internal regression test. Thank you again for the assistance and happy new year btw :) |
Everything what is happening in the restore blob should be fast. All mappings are restored before switching into the restore blob. There, the restored mappings are just remapped to proper addresses. I am still not convinced that the idea of restoring buffer objects from the main process is really what we need here.. I can miss something, but I want to see a clear explanation with numbers why the proposed solution is a valuable one. Additionally, I see two potential issues:
|
TL;DR:
This pull request extends CRIU to support parallel restore of AMDGPU buffer object content alongside other restore operations to accelerate the restoration.
The target issue:
In the current restore procedure of AMDGPU applications, the content of the AMDGPU buffer object (BO) is restored synchronously in
CR_PLUGIN_HOOK__RESTORE_EXT_FILE
. This procedure usually takes a significant amount of time, and during this time the target process cannot perform any other restore operations. However, this restoration has no logical dependencies with other restore operations. Parallelizing this part with other restore operations can speed up the restoration.The parallel restore approach in this PR:
The core idea of these patch series is to offload the restore of the BO content from the target process to the main CRIU process (the main CRIU process refers to the parent process, and the target process refers to the child process created during the fork). To achieve this, we introduce a new hook,
CR_PLUGIN_HOOK__RESTORE_ASYNCHRONOUS
, in the main CRIU process. For the AMDGPU plugin, the target process will no longer restore BO contents inCR_PLUGIN_HOOK__RESTORE_EXT_FILE
and just send the relevant BOs to the main CRIU process. the main CRIU process will receive the corresponding BOs inCR_PLUGIN_HOOK__RESTORE_ASYNCHRONOUS
and begin the restoration. Meanwhile, the target process can continue with other parts of the restoration without being blocked by the BO content restoration. The full design of the idea can also be referred with the ACM SoCC'24 paper: On-demand and Parallel Checkpoint/Restore for GPU Applications.Tests:
We evaluated the performance according to the following settings. The results show that parallel restore can speed up by 34.3% when images cached in the page cache, and 7.6% when restoring from disk.
Results:
Settings:
CPU: Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz
Memory: DDR4, 2x8GB
GPU: AMD MI50
Disk: 512GB, Samsung SSD 860
Docker image: rocm/pytorch:rocm5.6_ubuntu20.04_py3.8_pytorch_1.12.1
Example program:
example.py
: a ResNet18 application. Enter 'y' to exit, or press any other key to perform inference.Steps:
Install CRIU
Follow the standard CRIU installation process. Ensure you set the environment variable
CRIU_LIBS_DIR
to theplugins/amdgpu
path.Dump checkpoint image
Restore from disk
Test for sequential restore:
Test for parallel restore:
Restore from page cache
Install vmtouch for caching images:
Test: