Issue of tensors share memory #591

heraldiclily · 2024-04-01T15:23:38Z

🐛 Describe the bug

I'm facing this issue of shared memory when training LLM models using TRLX

Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'base_model.transformer.wte.weight', 'base_model.lm_head.weight'}].
A potential way to correctly save your model is to use save_model.

Most of forums are recommending the below configuration to fix the issue for non-RL applications:

"save_safetensors=false"

Unfortunately, TRLX library doesn’t offer this argument which is part of Transformers module. Is there any way to define it equivalently in order to resolve the “tensors share memory” problem.

Which trlX version are you using?

0.7.0

Additional system and package information

Linux 20.04
python 3.11.8
pytorch 2.2.2

The text was updated successfully, but these errors were encountered:

RekkimiARG · 2024-04-26T07:23:10Z

do you have any solution? i meet the same problem.

PamKing7 · 2024-08-21T02:00:46Z

I seem to have solved this problem by setting safe_serialization = False on line 99 of the python 3.10/site-packages/accelerate/checkpointing.py library to False, and saving the model will use the torc.save () method by default.

heraldiclily added the bug Something isn't working label Apr 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue of tensors share memory #591

Issue of tensors share memory #591

heraldiclily commented Apr 1, 2024 •

edited

Loading

RekkimiARG commented Apr 26, 2024

PamKing7 commented Aug 21, 2024

Issue of tensors share memory #591

Issue of tensors share memory #591

Comments

heraldiclily commented Apr 1, 2024 • edited Loading

🐛 Describe the bug

Which trlX version are you using?

Additional system and package information

RekkimiARG commented Apr 26, 2024

PamKing7 commented Aug 21, 2024

heraldiclily commented Apr 1, 2024 •

edited

Loading