You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
distributed is initialized for both the GPUs but only one is getting hit.
Also for validation loop the GPU are not in usage
How can I resolve the situation to use 2 GPUs and fasten my training.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
# Error messages and logs here please
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
More info
No response
The text was updated successfully, but these errors were encountered:
The issue with the two utilization indicators may be that one process is CPU bound (e.g. rank 0 doing logging) while the other isn't. The model seems really small and essentially CPU operations dominate.
I suggest you increase the size of the model, or the size of the batch, to bring the actual utilization up.
You can also verify this by passing barebones=True to the Trainer: this should minimize non-model related operations and the two GPUs will probably look more similar.
Bug description
i initialized my trainer
distributed is initialized for both the GPUs but only one is getting hit.
Also for validation loop the GPU are not in usage
How can I resolve the situation to use 2 GPUs and fasten my training.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: