`save_last: True` saves 2 checkpoints every time #18670

stas00 · 2023-09-29T01:17:24Z

Description & Motivation

I hope this is pure PTL and not NeMo override, but I'm observing that:

checkpoint_callback_params.save_last: True

leads to saving 2 copies of the same checkpoint. Being on a slow NFS I can see it writing one and then the second time.

For training a huge model this is not the most efficient choice as it doubles the time training is blocked from progress during checkpoint saving.

Is there any reason for not using a symlink from the actual checkpoint to the one named foo-last.ckpt which would do the exact same thing but cost 0 time and space?

FWIW, in other frameworks like Megatron-LM and Deepspeed this is implemented completely differently - there is just file called last which contains the last checkpoint's id (or filename), so the resume operation always knows where to resume from and requires nothing from actual checkpoint files.

The reason I mention this other approach to tracking which file to resume from is I've just gotten this:

ls -l checkpoints/mp_rank_00/
total 6.6G
-rw-rw-rw- 1 stas stas 842M Sep 29 01:04 step=10.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:05 step=20.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:05 step=30.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:06 step=40.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:07 step=50-last.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:06 step=50.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:07 step=60-last.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:07 step=60.ckpt

I have no idea how it came to be but clearly this is broken - which is the last here? having a single file last would overcome this situation.

edit: actually I think it happened due to a race condition inherent in the current approach - I happened to kill it before it was able to delete the previous last

Related - isn't save_last: True a requirement and not an option - I find that if I set it to False the trainer starts from scratch and doesn't resume from the latest checkpoint. I guess it doesn't know which is the latest, but nevertheless this doesn't seem to be optional.

Related Also this doc is broken - searched for save_last on your docs site, got the first hit, linking to:

https://lightning.ai/docs/pytorch/stable/extensions/callbacks.html#lightning.pytorch.callbacks.ModelCheckpoint.params.save_last

which has no mentioning of save_last and I can't find any other doc of this option.

Thank you.

Pitch

No response

Alternatives

No response

Additional context

No response

cc @Borda @carmocca @awaelchli

The text was updated successfully, but these errors were encountered:

awaelchli · 2023-09-29T12:07:21Z

Hi @stas00
I'll answer some of your questions.

The symlink is definitely something we wanted to do: #14973
Someone just needs to do it and handle the edge cases with different filesystems etc.

FWIW, in other frameworks like Megatron-LM and Deepspeed this is implemented completely differently - there is just file called last which contains the last checkpoint's id (or filename), so the resume operation always knows where to resume from and requires nothing from actual checkpoint files.

The trainer can resume from last when the user sets trainer.fit(ckpt_path="last").

I have no idea how it came to be but clearly this is broken - which is the last here? having a single file last would overcome this situation.

There should be a single last.ckpt file only. It gets overwritten everytime a new checkpoint is saved. It is meant to provide the user with a deterministic filename that can always be accessed to inspect or load the "latest"/"last"/"most recent" checkpoint produced by the Trainer.

I have no idea how you ended up with step=50-last.ckpt. Could you help us reproduce this? The only way I could imagine this happening is if the checkpoint_callback.CHECKPOINT_NAME_LAST was modified to include the additional filename pattern.

I agree with your comments about the docs.

stas00 · 2023-09-29T22:42:05Z

thank you for the pointer to the old feature request, I posted there with the hope to may be revive it.

The trainer can resume from last when the user sets trainer.fit(ckpt_path="last").

in the case of nemo the user has no access to this, only a config file.

I have no idea how you ended up with step=50-last.ckpt.

as I added in my original post, it was a race condition. I was doing short internal writing while testing and the job was killed while a 2nd last was writing and the previous one wasn't deleted.

awaelchli · 2023-10-11T15:52:42Z

@stas00 I was able to land #18748 in time for 2.1. NeMo was recently updated to 2.0, and since 2.1 doesn't have API changes you will probably be fine updating to 2.1, although minor fixes might be needed since NeMo sometimes reaches into protected APIs of Lightning or has some overrides.

stas00 · 2023-10-12T01:23:38Z

You rock, Adrian! Thank you very much! That would be super useful for our work.

awaelchli · 2023-10-12T21:41:48Z

Thank you, I'm glad you're happy with it.

stas00 · 2023-10-27T22:56:47Z

Adrian, FWIW, I think while your symlink change is by far more superior than what it was before this will still fail at times because the operation of creating a new -last file and deleting the previous one isn't atomic. So every so often if the timing is "right" the user will end up with 2 -last symlinks.

If you switch to having a single file latest where you write out the latest checkpoint number/name as its sole one line content - at the very least you will never have 2 last files. This is what Megatron-LM and Deepspeed do - not sure about the rest.

I'm not insisting but just flagging a possible better resolution.

stas00 added feature Is an improvement or enhancement needs triage Waiting to be triaged by maintainers labels Sep 29, 2023

stas00 mentioned this issue Sep 29, 2023

How do I save all checkpoints in exp_manager NVIDIA/NeMo#7514

Closed

awaelchli added help wanted Open to be worked on docs Documentation related repro needed The issue is missing a reproducible example callback: model checkpoint and removed needs triage Waiting to be triaged by maintainers labels Sep 29, 2023

stas00 mentioned this issue Sep 29, 2023

Store last.ckpt as symlink when appropriate to save space #14973

Closed

awaelchli mentioned this issue Oct 8, 2023

Save ModelCheckpoint's last.ckpt as symlink if possible #18748

Merged

awaelchli closed this as completed in #18748 Oct 11, 2023

awaelchli added this to the 2.1 milestone Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`save_last: True` saves 2 checkpoints every time #18670

`save_last: True` saves 2 checkpoints every time #18670

stas00 commented Sep 29, 2023 •

edited

Loading

awaelchli commented Sep 29, 2023

stas00 commented Sep 29, 2023

awaelchli commented Oct 11, 2023

stas00 commented Oct 12, 2023

awaelchli commented Oct 12, 2023

stas00 commented Oct 27, 2023

save_last: True saves 2 checkpoints every time #18670

save_last: True saves 2 checkpoints every time #18670

Comments

stas00 commented Sep 29, 2023 • edited Loading

Description & Motivation

Pitch

Alternatives

Additional context

awaelchli commented Sep 29, 2023

stas00 commented Sep 29, 2023

awaelchli commented Oct 11, 2023

stas00 commented Oct 12, 2023

awaelchli commented Oct 12, 2023

stas00 commented Oct 27, 2023

`save_last: True` saves 2 checkpoints every time #18670

`save_last: True` saves 2 checkpoints every time #18670

stas00 commented Sep 29, 2023 •

edited

Loading