Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] FP Process Restart Creates New Directory Instead of Continuing Unfinished Tasks #1678

Open
chenggoj opened this issue Nov 23, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@chenggoj
Copy link

Bug summary

I've encountered an unexpected behavior when trying to restart an interrupted FP (Forward Propagation) process in iter0. Instead of continuing with the unfinished tasks, the restart seems to create a new scratch directory and start fresh.
Current Behavior:

Initial run:

Successfully completed training and exploration stages
Generated 200 candidates for FP calculation
Due to HPC resource limitations, only submitted first batch (50 tasks)
These 50 tasks completed successfully

Attempted restart:

Command used: dpgen run param.json machine.json
Tasks were submitted successfully
However, instead of continuing from the previous state, a new scratch directory was created
Previous progress (50 completed tasks) appears to be ignored

Evidence:
Task completion monitoring script output:
First run (Directory ID: 7fd10183c41f2c7c92a969ac0ceffd9361de0d31):

Total directories scanned: 200
Completed tasks: 50
Incomplete tasks: 150
Completion rate: 25.00%

After restart (Directory ID: 989863a025b55a9ba0bba9363fb8d70d7532605b):


Total directories scanned: 200
Completed tasks: 0
Incomplete tasks: 200
Completion rate: 0.00%

Questions:

Is this the expected behavior for the restart process?
Is there a specific flag or different command that should be used to continue from the previous state?

DP-GEN Version

0.12.1

Platform, Python Version, Remote Platform, etc

No response

Input Files, Running Commands, Error Log, etc.

No Inputs.

Steps to Reproduce

See Bug summary.

Further Information, Files, and Links

No response

@chenggoj chenggoj added the bug Something isn't working label Nov 23, 2024
@chenggoj
Copy link
Author

chenggoj commented Nov 23, 2024

I found the reason. It is because I changed the wall-time (from 48:00:00 to 24:00:00) in machine.json file at the second submission. Therefore, the dpgen cannot find the old correct submission_hash, and after it will create a new folder and start it from scratch again. When I restore the original wall-time setting, it can restart from the correct folder again.

But I found this design too rigid. When restarting a task, the original submitted task settings must be exactly the same as before.
Because once I am forced to modify some submitted task settings due to limited computing resources, I will not be able to correctly recover from the original unfinished task, and the previous task will be in vain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant