-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ModelCheckpoint Doesn't Delete Old Best Checkpoints When Resuming Training #18687
Comments
@danielzeng-gt Thanks for submitting the issue. I read your description multiple times but I don't understand the problem. Can you try to formulate it with an example? Is it related to #17912? |
Hey Adrian, thanks for the prompt response! I generated an example with GPT4, and I read over it and it is quite accurate in describing the problem. Please let me know if it's still confusing: Example:Suppose Alice is training a neural network to classify images of cats and dogs on a cloud-based preemptible instance. She's interested in keeping two kinds of checkpoints:
To achieve this, Alice uses two Training Run 1:
Training Resumption:
Expected Behavior: Actual Behavior:
This indicates that the Implication: Conclusion:The bug seems to arise from a state restoration issue in the |
I met same issue, I understand that maybe a breaking change, can wee add an option to handle that? |
Bug description
Description:
When using
ModelCheckpoint
with the parameterstop_k=1
andmonitor='val_loss'
during a singular training run, the behavior is as expected and only retains one 'best_val_confidence-epoch...' checkpoint.However, in the context of cloud-based training where instances may be preempted or restarted from a checkpoint:
ModelCheckpoint
.ModelCheckpoint
state was restored incorrectly.ModelCheckpoint
creates a new checkpoint but fails to delete the old one. Thus, if there's a single preemption/restart during the training run, we end up with two 'best_val_loss' checkpoints.It should be noted we load/write checkpoints to GCS with
fsspec
, which allows for checkpoints to be written to and loaded directly from Google Cloud Storage (GCS).Code Details:
There are two current
ModelCheckpoint
callbacks in use:The first is for saving the latest checkpoint:
The second is for saving the best validation loss checkpoint:
Environment:
What version are you seeing the problem on?
v1.9
How to reproduce the bug
Error messages and logs
Environment
Current environment
More info
No response
cc @carmocca @awaelchli
The text was updated successfully, but these errors were encountered: