You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm facing this issue of shared memory when training LLM models using TRLX
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'base_model.transformer.wte.weight', 'base_model.lm_head.weight'}].
A potential way to correctly save your model is to use save_model.
Most of forums are recommending the below configuration to fix the issue for non-RL applications:
"save_safetensors=false"
Unfortunately, TRLX library doesn’t offer this argument which is part of Transformers module. Is there any way to define it equivalently in order to resolve the “tensors share memory” problem.
Which trlX version are you using?
0.7.0
Additional system and package information
Linux 20.04
python 3.11.8
pytorch 2.2.2
The text was updated successfully, but these errors were encountered:
I seem to have solved this problem by setting safe_serialization = False on line 99 of the python 3.10/site-packages/accelerate/checkpointing.py library to False, and saving the model will use the torc.save () method by default.
🐛 Describe the bug
I'm facing this issue of shared memory when training LLM models using TRLX
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'base_model.transformer.wte.weight', 'base_model.lm_head.weight'}].
A potential way to correctly save your model is to use
save_model
.Most of forums are recommending the below configuration to fix the issue for non-RL applications:
"save_safetensors=false"
Unfortunately, TRLX library doesn’t offer this argument which is part of Transformers module. Is there any way to define it equivalently in order to resolve the “tensors share memory” problem.
Which trlX version are you using?
0.7.0
Additional system and package information
Linux 20.04
python 3.11.8
pytorch 2.2.2
The text was updated successfully, but these errors were encountered: