Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store last.ckpt as symlink when appropriate to save space #14973

Closed
ZhaofengWu opened this issue Oct 3, 2022 · 6 comments · Fixed by #18748
Closed

Store last.ckpt as symlink when appropriate to save space #14973

ZhaofengWu opened this issue Oct 3, 2022 · 6 comments · Fixed by #18748
Labels
checkpointing Related to checkpointing feature Is an improvement or enhancement good first issue Good for newcomers help wanted Open to be worked on
Milestone

Comments

@ZhaofengWu
Copy link
Contributor

ZhaofengWu commented Oct 3, 2022

🚀 Feature

Currently, if I'm understanding correctly, last.ckpt is always stored as a separate model file (unless it's disabled, of course). But sometimes the same checkpoint already exists, for example as the best checkpoint, or as one of the top-k checkpoints. When this happens, it'd be very helpful that only a symlink is stored to reduce space usage.

Motivation

Modern models could take up a lot of disk space. In the typical use case where only the best and last checkpoints are stored, this could reduce the space usage by ~half.

cc @Borda @awaelchli

@ZhaofengWu ZhaofengWu added the needs triage Waiting to be triaged by maintainers label Oct 3, 2022
@carmocca
Copy link
Contributor

carmocca commented Oct 4, 2022

This has been already proposed in #4335, but this cannot be done because the fsspec API does not support it. It wouldn't work for remote filesystems

@carmocca carmocca added feature Is an improvement or enhancement checkpointing Related to checkpointing and removed needs triage Waiting to be triaged by maintainers labels Oct 4, 2022
@ZhaofengWu
Copy link
Contributor Author

Thanks for the context! I'm not very familiar with fsspec and how lightning uses it, but why wouldn't something like os.symlink work? Because some people might work with two sets of file systems, and the path doesn't distinguish between them? If so, I believe it's a valid concern, but maybe not the most common use case? Why not just implement the proposed flag symlink_to_last in that issue and let the user decide when to enable it, and users who work with multiple file systems can choose to disable it?

@carmocca
Copy link
Contributor

carmocca commented Oct 4, 2022

Yes, it can be implemented. It's open to community contributions.

@ZhaofengWu
Copy link
Contributor Author

I'm happy to give this a try. But to verify the best way to do it: current I'm considering changing _save_topk_checkpoint to return the ckpt path iff it saved the current state. And we pass this flag into _save_last_checkpoint (most of the calls to these methods are preceded by _save_topk_checkpoint), which uses the flag to determine if it should symlink or not. Does that make sense?

@ZhaofengWu
Copy link
Contributor Author

And of course, if we don't want this behavior to be the default, we can introduce a flag like symlink_last_ckpt.

@stale stale bot added the won't fix This will not be worked on label Apr 14, 2023
@carmocca carmocca added the help wanted Open to be worked on label Apr 15, 2023
@stale stale bot removed the won't fix This will not be worked on label Apr 15, 2023
@carmocca carmocca added this to the future milestone Apr 15, 2023
@Borda Borda added the good first issue Good for newcomers label Apr 17, 2023
@Lightning-AI Lightning-AI deleted a comment from stale bot Apr 17, 2023
@stas00
Copy link
Contributor

stas00 commented Sep 29, 2023

could this be revived? This is not only a problem with disc space, the way it's done now is also taking 2x longer to save the checkpoint which for 100B model could take a really long time.

Additionally as I shared in #18670 there is a race condition and one could end up with 2 last files.

I proposed an alternatively solution used by other frameworks which uses an actual file latest which is updated to point to the last checkpoint's filename - no race condition in this situation since you only ever have one latest file.

But my main need is to remove the overhead of saving the file 2 times, which is a significant problem, especially on a slow filesystem.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpointing Related to checkpointing feature Is an improvement or enhancement good first issue Good for newcomers help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants