Why only one GPU is getting used in the kaggle kernel #20424

KeesariVigneshwarReddy · 2024-11-16T15:03:53Z

Bug description

i initialized my trainer

trainer = L.Trainer(max_epochs=5,
                    devices=2,
                    strategy='ddp_notebook',
                    num_sanity_val_steps=0,
                    profiler='simple', 
                    default_root_dir="/kaggle/working",  
                    callbacks=[DeviceStatsMonitor(), 
                               StochasticWeightAveraging(swa_lrs=1e-2), 
                               #EarlyStopping(monitor='train_Loss', min_delta=0.001, patience=100, verbose=False, mode='min'),
                              ],
                    enable_progress_bar=True,
                    enable_model_summary=True,
                   )

distributed is initialized for both the GPUs but only one is getting hit.

Also for validation loop the GPU are not in usage

How can I resolve the situation to use 2 GPUs and fasten my training.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment

#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

The text was updated successfully, but these errors were encountered:

lantiga · 2024-11-18T23:44:50Z

It actually looks like both GPUs are being used.

The issue with the two utilization indicators may be that one process is CPU bound (e.g. rank 0 doing logging) while the other isn't. The model seems really small and essentially CPU operations dominate.

I suggest you increase the size of the model, or the size of the batch, to bring the actual utilization up.

You can also verify this by passing barebones=True to the Trainer: this should minimize non-model related operations and the two GPUs will probably look more similar.

KeesariVigneshwarReddy added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Nov 16, 2024

github-actions bot added the ver: 2.4.x label Nov 16, 2024

lantiga added waiting on author Waiting on user action, correction, or update and removed needs triage Waiting to be triaged by maintainers bug Something isn't working ver: 2.4.x labels Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why only one GPU is getting used in the kaggle kernel #20424

Why only one GPU is getting used in the kaggle kernel #20424

KeesariVigneshwarReddy commented Nov 16, 2024

lantiga commented Nov 18, 2024

Why only one GPU is getting used in the kaggle kernel #20424

Why only one GPU is getting used in the kaggle kernel #20424

Comments

KeesariVigneshwarReddy commented Nov 16, 2024

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

lantiga commented Nov 18, 2024