ModelCheckpoint Doesn't Delete Old Best Checkpoints When Resuming Training #18687

danielzeng-gt · 2023-10-02T02:22:35Z

Bug description

Description:
When using ModelCheckpoint with the parameters top_k=1 and monitor='val_loss' during a singular training run, the behavior is as expected and only retains one 'best_val_confidence-epoch...' checkpoint.

However, in the context of cloud-based training where instances may be preempted or restarted from a checkpoint:

The training resumes from a checkpoint labeled "last.ckpt", which was initially created by a different ModelCheckpoint.
There aren't any explicit warnings indicating that the ModelCheckpoint state was restored incorrectly.
Post-resumption, ModelCheckpoint creates a new checkpoint but fails to delete the old one. Thus, if there's a single preemption/restart during the training run, we end up with two 'best_val_loss' checkpoints.

It should be noted we load/write checkpoints to GCS with fsspec, which allows for checkpoints to be written to and loaded directly from Google Cloud Storage (GCS).

Code Details:

There are two current ModelCheckpoint callbacks in use:

The first is for saving the latest checkpoint:

last_ckpt_callback = ModelCheckpoint(
    save_top_k= -1,
    save_last= True,
    dirpath= self.checkpoint_dir,
)
last_ckpt_callback.CHECKPOINT_NAME_LAST = _CHECKPOINT_NAME_LAST

The second is for saving the best validation loss checkpoint:

best_val_loss_ckpt_callback = ModelCheckpoint(
    monitor=f'val_loss',
    mode='min',
    save_top_k=1,
    auto_insert_metric_name=False,
    filename='best_val_confidence-epoch{epoch}-val_loss{{val_loss:.4e}}',
    dirpath=self.checkpoint_dir,
)

Environment:

Lightning Component: ModelCheckpoint object
PyTorch Lightning Version: 1.9.2
PyTorch Version: 1.13.0
Python Version: 3.10.12
OS: Linux
CUDA/cuDNN version: Build cuda_11.6.r11.6/compiler.31057947_0
GPU models: Nvidia A100
How you installed Lightning: Conda
Cloud: Running on GCP Cluster

What version are you seeing the problem on?

v1.9

How to reproduce the bug

1. Setup a training loop on the cloud with the aforementioned `ModelCheckpoint` callbacks.
2. Intentionally interrupt the training to simulate preemption.
3. Resume the training from the "last.ckpt".
4. Post-resumption, inspect the stored checkpoints. There should be two 'best_val_loss' checkpoints instead of one.

**Expected behavior**: Only one 'best_val_confidence-epoch...' checkpoint should remain after resumption.

**Actual behavior**: Multiple 'best_val_confidence-epoch...' checkpoints are observed after training preemption and resumption.

Error messages and logs

# Error messages and logs here please

Environment

Current environment

- Lightning Component: ModelCheckpoint object
- PyTorch Lightning Version: 1.9.2
- PyTorch Version: 1.13.0
- Python Version: 3.10.12
- OS: Linux
- CUDA/cuDNN version: Build cuda_11.6.r11.6/compiler.31057947_0
- GPU models: Nvidia A100
- How you installed Lightning: Conda
- Running environment of LightningApp: Cloud, Running on GCP A100 instance

More info

No response

cc @carmocca @awaelchli

The text was updated successfully, but these errors were encountered:

awaelchli · 2023-10-02T22:17:55Z

@danielzeng-gt Thanks for submitting the issue.

I read your description multiple times but I don't understand the problem. Can you try to formulate it with an example? Is it related to #17912?

danielzeng-gt · 2023-10-02T23:17:14Z

Hey Adrian, thanks for the prompt response!
I looked at #17912 and it doesn't seem to be related.

I generated an example with GPT4, and I read over it and it is quite accurate in describing the problem. Please let me know if it's still confusing:

Example:

Suppose Alice is training a neural network to classify images of cats and dogs on a cloud-based preemptible instance. She's interested in keeping two kinds of checkpoints:

The latest checkpoint, irrespective of its performance on validation data.
The checkpoint with the best validation loss.

To achieve this, Alice uses two ModelCheckpoint callbacks as described.

Training Run 1:

Alice starts her training.
After epoch 1, the validation loss is 0.5. The system saves:
- last.ckpt (The latest checkpoint)
- best_val_confidence-epoch1-val_loss0.5e (The best checkpoint based on validation loss)
Suddenly, the preemptible instance is terminated.

Training Resumption:

Alice's setup detects the preemption and decides to restart the training from the last checkpoint.
It loads last.ckpt and continues training.
After epoch 2, the validation loss improves to 0.4. The system now tries to save:
- A new last.ckpt (Replacing the older one)
- best_val_confidence-epoch2-val_loss0.4e (A new best checkpoint)

Expected Behavior:
Since Alice specified save_top_k=1 for the best validation loss checkpoint, she expects to find only one such checkpoint in her directory, i.e., best_val_confidence-epoch2-val_loss0.4e.

Actual Behavior:
Alice finds two best validation loss checkpoints:

best_val_confidence-epoch1-val_loss0.5e
best_val_confidence-epoch2-val_loss0.4e

This indicates that the ModelCheckpoint callback did not delete the older "best" checkpoint upon resumption, leading to multiple "best" checkpoints being saved.

Implication:
This behavior can be problematic especially if Alice runs multiple epochs and faces multiple preemptions. Over time, she would accumulate multiple "best" checkpoints, and it is confusing her when trying to identify the genuine best checkpoint.

Conclusion:

The bug seems to arise from a state restoration issue in the ModelCheckpoint callback when resuming training from a checkpoint. It fails to remember its previous "best" state and does not delete older checkpoints as it should.

leng-yue · 2024-05-07T08:45:23Z

I met same issue, I understand that maybe a breaking change, can wee add an option to handle that?

danielzeng-gt added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Oct 2, 2023

github-actions bot added the ver: 1.9.x label Oct 2, 2023

awaelchli added callback: model checkpoint and removed needs triage Waiting to be triaged by maintainers labels Oct 2, 2023

danielzeng-gt changed the title ~~ModelCheckpoint Doesn't Overwrite Old Checkpoints When Resuming Training~~ ModelCheckpoint Doesn't Delete Old Best Checkpoints When Resuming Training Oct 13, 2023

awaelchli added the repro needed The issue is missing a reproducible example label Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ModelCheckpoint Doesn't Delete Old Best Checkpoints When Resuming Training #18687

ModelCheckpoint Doesn't Delete Old Best Checkpoints When Resuming Training #18687

danielzeng-gt commented Oct 2, 2023 •

edited by github-actions bot

Loading

awaelchli commented Oct 2, 2023 •

edited

Loading

danielzeng-gt commented Oct 2, 2023 •

edited

Loading

leng-yue commented May 7, 2024

ModelCheckpoint Doesn't Delete Old Best Checkpoints When Resuming Training #18687

ModelCheckpoint Doesn't Delete Old Best Checkpoints When Resuming Training #18687

Comments

danielzeng-gt commented Oct 2, 2023 • edited by github-actions bot Loading

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

awaelchli commented Oct 2, 2023 • edited Loading

danielzeng-gt commented Oct 2, 2023 • edited Loading

Example:

Conclusion:

leng-yue commented May 7, 2024

danielzeng-gt commented Oct 2, 2023 •

edited by github-actions bot

Loading

awaelchli commented Oct 2, 2023 •

edited

Loading

danielzeng-gt commented Oct 2, 2023 •

edited

Loading