You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've encountered an unexpected behavior when trying to restart an interrupted FP (Forward Propagation) process in iter0. Instead of continuing with the unfinished tasks, the restart seems to create a new scratch directory and start fresh.
Current Behavior:
Initial run:
Successfully completed training and exploration stages
Generated 200 candidates for FP calculation
Due to HPC resource limitations, only submitted first batch (50 tasks)
These 50 tasks completed successfully
Attempted restart:
Command used: dpgen run param.json machine.json
Tasks were submitted successfully
However, instead of continuing from the previous state, a new scratch directory was created
Previous progress (50 completed tasks) appears to be ignored
Evidence:
Task completion monitoring script output:
First run (Directory ID: 7fd10183c41f2c7c92a969ac0ceffd9361de0d31):
Is this the expected behavior for the restart process?
Is there a specific flag or different command that should be used to continue from the previous state?
DP-GEN Version
0.12.1
Platform, Python Version, Remote Platform, etc
No response
Input Files, Running Commands, Error Log, etc.
No Inputs.
Steps to Reproduce
See Bug summary.
Further Information, Files, and Links
No response
The text was updated successfully, but these errors were encountered:
I found the reason. It is because I changed the wall-time (from 48:00:00 to 24:00:00) in machine.json file at the second submission. Therefore, the dpgen cannot find the old correct submission_hash, and after it will create a new folder and start it from scratch again. When I restore the original wall-time setting, it can restart from the correct folder again.
But I found this design too rigid. When restarting a task, the original submitted task settings must be exactly the same as before.
Because once I am forced to modify some submitted task settings due to limited computing resources, I will not be able to correctly recover from the original unfinished task, and the previous task will be in vain.
Bug summary
I've encountered an unexpected behavior when trying to restart an interrupted FP (Forward Propagation) process in iter0. Instead of continuing with the unfinished tasks, the restart seems to create a new scratch directory and start fresh.
Current Behavior:
Initial run:
Successfully completed training and exploration stages
Generated 200 candidates for FP calculation
Due to HPC resource limitations, only submitted first batch (50 tasks)
These 50 tasks completed successfully
Attempted restart:
Command used: dpgen run param.json machine.json
Tasks were submitted successfully
However, instead of continuing from the previous state, a new scratch directory was created
Previous progress (50 completed tasks) appears to be ignored
Evidence:
Task completion monitoring script output:
First run (Directory ID: 7fd10183c41f2c7c92a969ac0ceffd9361de0d31):
After restart (Directory ID: 989863a025b55a9ba0bba9363fb8d70d7532605b):
Questions:
Is this the expected behavior for the restart process?
Is there a specific flag or different command that should be used to continue from the previous state?
DP-GEN Version
0.12.1
Platform, Python Version, Remote Platform, etc
No response
Input Files, Running Commands, Error Log, etc.
No Inputs.
Steps to Reproduce
See Bug summary.
Further Information, Files, and Links
No response
The text was updated successfully, but these errors were encountered: