[BUG] FP Process Restart Creates New Directory Instead of Continuing Unfinished Tasks #1678

chenggoj · 2024-11-23T15:36:43Z

Bug summary

I've encountered an unexpected behavior when trying to restart an interrupted FP (Forward Propagation) process in iter0. Instead of continuing with the unfinished tasks, the restart seems to create a new scratch directory and start fresh.
Current Behavior:

Initial run:

Successfully completed training and exploration stages
Generated 200 candidates for FP calculation
Due to HPC resource limitations, only submitted first batch (50 tasks)
These 50 tasks completed successfully

Attempted restart:

Command used: dpgen run param.json machine.json
Tasks were submitted successfully
However, instead of continuing from the previous state, a new scratch directory was created
Previous progress (50 completed tasks) appears to be ignored

Evidence:
Task completion monitoring script output:
First run (Directory ID: 7fd10183c41f2c7c92a969ac0ceffd9361de0d31):

Total directories scanned: 200
Completed tasks: 50
Incomplete tasks: 150
Completion rate: 25.00%

After restart (Directory ID: 989863a025b55a9ba0bba9363fb8d70d7532605b):


Total directories scanned: 200
Completed tasks: 0
Incomplete tasks: 200
Completion rate: 0.00%

Questions:

Is this the expected behavior for the restart process?
Is there a specific flag or different command that should be used to continue from the previous state?

DP-GEN Version

0.12.1

Platform, Python Version, Remote Platform, etc

No response

Input Files, Running Commands, Error Log, etc.

No Inputs.

Steps to Reproduce

See Bug summary.

Further Information, Files, and Links

No response

The text was updated successfully, but these errors were encountered:

chenggoj · 2024-11-23T19:44:01Z

I found the reason. It is because I changed the wall-time (from 48:00:00 to 24:00:00) in machine.json file at the second submission. Therefore, the dpgen cannot find the old correct submission_hash, and after it will create a new folder and start it from scratch again. When I restore the original wall-time setting, it can restart from the correct folder again.

But I found this design too rigid. When restarting a task, the original submitted task settings must be exactly the same as before.
Because once I am forced to modify some submitted task settings due to limited computing resources, I will not be able to correctly recover from the original unfinished task, and the previous task will be in vain.

chenggoj added the bug Something isn't working label Nov 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] FP Process Restart Creates New Directory Instead of Continuing Unfinished Tasks #1678

[BUG] FP Process Restart Creates New Directory Instead of Continuing Unfinished Tasks #1678

chenggoj commented Nov 23, 2024

chenggoj commented Nov 23, 2024 •

edited

Loading

[BUG] FP Process Restart Creates New Directory Instead of Continuing Unfinished Tasks #1678

[BUG] FP Process Restart Creates New Directory Instead of Continuing Unfinished Tasks #1678

Comments

chenggoj commented Nov 23, 2024

Bug summary

DP-GEN Version

Platform, Python Version, Remote Platform, etc

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

chenggoj commented Nov 23, 2024 • edited Loading

chenggoj commented Nov 23, 2024 •

edited

Loading