-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
save_last: True
saves 2 checkpoints every time
#18670
Comments
Hi @stas00 The symlink is definitely something we wanted to do: #14973
The trainer can resume from last when the user sets
There should be a single I have no idea how you ended up with I agree with your comments about the docs. |
thank you for the pointer to the old feature request, I posted there with the hope to may be revive it.
in the case of nemo the user has no access to this, only a config file.
as I added in my original post, it was a race condition. I was doing short internal writing while testing and the job was killed while a 2nd last was writing and the previous one wasn't deleted. |
You rock, Adrian! Thank you very much! That would be super useful for our work. |
Thank you, I'm glad you're happy with it. |
Adrian, FWIW, I think while your symlink change is by far more superior than what it was before this will still fail at times because the operation of creating a new -last file and deleting the previous one isn't atomic. So every so often if the timing is "right" the user will end up with 2 -last symlinks. If you switch to having a single file I'm not insisting but just flagging a possible better resolution. |
Description & Motivation
I hope this is pure PTL and not NeMo override, but I'm observing that:
leads to saving 2 copies of the same checkpoint. Being on a slow NFS I can see it writing one and then the second time.
For training a huge model this is not the most efficient choice as it doubles the time training is blocked from progress during checkpoint saving.
Is there any reason for not using a symlink from the actual checkpoint to the one named
foo-last.ckpt
which would do the exact same thing but cost 0 time and space?FWIW, in other frameworks like Megatron-LM and Deepspeed this is implemented completely differently - there is just file called
last
which contains the last checkpoint's id (or filename), so the resume operation always knows where to resume from and requires nothing from actual checkpoint files.The reason I mention this other approach to tracking which file to resume from is I've just gotten this:
I have no idea how it came to be but clearly this is broken - which is the last here? having a single file
last
would overcome this situation.edit: actually I think it happened due to a race condition inherent in the current approach - I happened to kill it before it was able to delete the previous
last
Related - isn't
save_last: True
a requirement and not an option - I find that if I set it toFalse
the trainer starts from scratch and doesn't resume from the latest checkpoint. I guess it doesn't know which is the latest, but nevertheless this doesn't seem to be optional.Related Also this doc is broken - searched for
save_last
on your docs site, got the first hit, linking to:https://lightning.ai/docs/pytorch/stable/extensions/callbacks.html#lightning.pytorch.callbacks.ModelCheckpoint.params.save_last
which has no mentioning of
save_last
and I can't find any other doc of this option.Thank you.
Pitch
No response
Alternatives
No response
Additional context
No response
cc @Borda @carmocca @awaelchli
The text was updated successfully, but these errors were encountered: