Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why only one GPU is getting used in the kaggle kernel #20424

Open
KeesariVigneshwarReddy opened this issue Nov 16, 2024 · 1 comment
Open

Why only one GPU is getting used in the kaggle kernel #20424

KeesariVigneshwarReddy opened this issue Nov 16, 2024 · 1 comment
Labels
waiting on author Waiting on user action, correction, or update

Comments

@KeesariVigneshwarReddy
Copy link

Bug description

Screenshot 2024-11-16 201845

i initialized my trainer

trainer = L.Trainer(max_epochs=5,
                    devices=2,
                    strategy='ddp_notebook',
                    num_sanity_val_steps=0,
                    profiler='simple', 
                    default_root_dir="/kaggle/working",  
                    callbacks=[DeviceStatsMonitor(), 
                               StochasticWeightAveraging(swa_lrs=1e-2), 
                               #EarlyStopping(monitor='train_Loss', min_delta=0.001, patience=100, verbose=False, mode='min'),
                              ],
                    enable_progress_bar=True,
                    enable_model_summary=True,
                   )

distributed is initialized for both the GPUs but only one is getting hit.

Also for validation loop the GPU are not in usage

Screenshot 2024-11-16 202120

How can I resolve the situation to use 2 GPUs and fasten my training.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

@KeesariVigneshwarReddy KeesariVigneshwarReddy added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Nov 16, 2024
@lantiga
Copy link
Collaborator

lantiga commented Nov 18, 2024

It actually looks like both GPUs are being used.

The issue with the two utilization indicators may be that one process is CPU bound (e.g. rank 0 doing logging) while the other isn't. The model seems really small and essentially CPU operations dominate.

I suggest you increase the size of the model, or the size of the batch, to bring the actual utilization up.

You can also verify this by passing barebones=True to the Trainer: this should minimize non-model related operations and the two GPUs will probably look more similar.

@lantiga lantiga added waiting on author Waiting on user action, correction, or update and removed needs triage Waiting to be triaged by maintainers bug Something isn't working ver: 2.4.x labels Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting on author Waiting on user action, correction, or update
Projects
None yet
Development

No branches or pull requests

2 participants