Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

save_last: True saves 2 checkpoints every time #18670

Closed
stas00 opened this issue Sep 29, 2023 · 6 comments · Fixed by #18748
Closed

save_last: True saves 2 checkpoints every time #18670

stas00 opened this issue Sep 29, 2023 · 6 comments · Fixed by #18748
Labels
callback: model checkpoint docs Documentation related feature Is an improvement or enhancement help wanted Open to be worked on repro needed The issue is missing a reproducible example
Milestone

Comments

@stas00
Copy link
Contributor

stas00 commented Sep 29, 2023

Description & Motivation

I hope this is pure PTL and not NeMo override, but I'm observing that:

checkpoint_callback_params.save_last: True

leads to saving 2 copies of the same checkpoint. Being on a slow NFS I can see it writing one and then the second time.

For training a huge model this is not the most efficient choice as it doubles the time training is blocked from progress during checkpoint saving.

Is there any reason for not using a symlink from the actual checkpoint to the one named foo-last.ckpt which would do the exact same thing but cost 0 time and space?

FWIW, in other frameworks like Megatron-LM and Deepspeed this is implemented completely differently - there is just file called last which contains the last checkpoint's id (or filename), so the resume operation always knows where to resume from and requires nothing from actual checkpoint files.

The reason I mention this other approach to tracking which file to resume from is I've just gotten this:

ls -l checkpoints/mp_rank_00/
total 6.6G
-rw-rw-rw- 1 stas stas 842M Sep 29 01:04 step=10.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:05 step=20.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:05 step=30.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:06 step=40.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:07 step=50-last.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:06 step=50.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:07 step=60-last.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:07 step=60.ckpt

I have no idea how it came to be but clearly this is broken - which is the last here? having a single file last would overcome this situation.

edit: actually I think it happened due to a race condition inherent in the current approach - I happened to kill it before it was able to delete the previous last


Related - isn't save_last: True a requirement and not an option - I find that if I set it to False the trainer starts from scratch and doesn't resume from the latest checkpoint. I guess it doesn't know which is the latest, but nevertheless this doesn't seem to be optional.


Related Also this doc is broken - searched for save_last on your docs site, got the first hit, linking to:

https://lightning.ai/docs/pytorch/stable/extensions/callbacks.html#lightning.pytorch.callbacks.ModelCheckpoint.params.save_last

which has no mentioning of save_last and I can't find any other doc of this option.

Thank you.

Pitch

No response

Alternatives

No response

Additional context

No response

cc @Borda @carmocca @awaelchli

@stas00 stas00 added feature Is an improvement or enhancement needs triage Waiting to be triaged by maintainers labels Sep 29, 2023
@awaelchli
Copy link
Contributor

Hi @stas00
I'll answer some of your questions.

The symlink is definitely something we wanted to do: #14973
Someone just needs to do it and handle the edge cases with different filesystems etc.

FWIW, in other frameworks like Megatron-LM and Deepspeed this is implemented completely differently - there is just file called last which contains the last checkpoint's id (or filename), so the resume operation always knows where to resume from and requires nothing from actual checkpoint files.

The trainer can resume from last when the user sets trainer.fit(ckpt_path="last").

I have no idea how it came to be but clearly this is broken - which is the last here? having a single file last would overcome this situation.

There should be a single last.ckpt file only. It gets overwritten everytime a new checkpoint is saved. It is meant to provide the user with a deterministic filename that can always be accessed to inspect or load the "latest"/"last"/"most recent" checkpoint produced by the Trainer.

I have no idea how you ended up with step=50-last.ckpt. Could you help us reproduce this? The only way I could imagine this happening is if the checkpoint_callback.CHECKPOINT_NAME_LAST was modified to include the additional filename pattern.

I agree with your comments about the docs.

@awaelchli awaelchli added help wanted Open to be worked on docs Documentation related repro needed The issue is missing a reproducible example callback: model checkpoint and removed needs triage Waiting to be triaged by maintainers labels Sep 29, 2023
@stas00
Copy link
Contributor Author

stas00 commented Sep 29, 2023

thank you for the pointer to the old feature request, I posted there with the hope to may be revive it.

The trainer can resume from last when the user sets trainer.fit(ckpt_path="last").

in the case of nemo the user has no access to this, only a config file.

I have no idea how you ended up with step=50-last.ckpt.

as I added in my original post, it was a race condition. I was doing short internal writing while testing and the job was killed while a 2nd last was writing and the previous one wasn't deleted.

@awaelchli
Copy link
Contributor

@stas00 I was able to land #18748 in time for 2.1. NeMo was recently updated to 2.0, and since 2.1 doesn't have API changes you will probably be fine updating to 2.1, although minor fixes might be needed since NeMo sometimes reaches into protected APIs of Lightning or has some overrides.

@stas00
Copy link
Contributor Author

stas00 commented Oct 12, 2023

You rock, Adrian! Thank you very much! That would be super useful for our work.

@awaelchli
Copy link
Contributor

Thank you, I'm glad you're happy with it.

@stas00
Copy link
Contributor Author

stas00 commented Oct 27, 2023

Adrian, FWIW, I think while your symlink change is by far more superior than what it was before this will still fail at times because the operation of creating a new -last file and deleting the previous one isn't atomic. So every so often if the timing is "right" the user will end up with 2 -last symlinks.

If you switch to having a single file latest where you write out the latest checkpoint number/name as its sole one line content - at the very least you will never have 2 last files. This is what Megatron-LM and Deepspeed do - not sure about the rest.

I'm not insisting but just flagging a possible better resolution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
callback: model checkpoint docs Documentation related feature Is an improvement or enhancement help wanted Open to be worked on repro needed The issue is missing a reproducible example
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants