Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] L40 GPUs get detected as L4s #4404

Closed
romilbhardwaj opened this issue Nov 24, 2024 · 0 comments · Fixed by #4511
Closed

[k8s] L40 GPUs get detected as L4s #4404

romilbhardwaj opened this issue Nov 24, 2024 · 0 comments · Fixed by #4511
Assignees
Labels
help wanted Extra attention is needed k8s Kubernetes related items

Comments

@romilbhardwaj
Copy link
Collaborator

romilbhardwaj commented Nov 24, 2024

User reported L40 GPUs in their cluster were being shown as L4 GPUs in sky show-gpus

sky show-gpus:

Kubernetes GPUs (context: default)
GPU  REQUESTABLE_QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
L4   1, 2, 4                   4           4
Kubernetes per node GPU availability
NODE_NAME     GPU_NAME  TOTAL_GPUS  FREE_GPUS
node          L4        4           4

Labels:

                    nvidia.com/cuda.driver-version.major=535
                    nvidia.com/cuda.driver-version.minor=104
                    nvidia.com/cuda.driver-version.revision=12
                    nvidia.com/cuda.driver.major=535
                    nvidia.com/cuda.driver.minor=104
                    nvidia.com/cuda.driver.rev=12
                    nvidia.com/cuda.runtime-version.full=12.2
                    nvidia.com/cuda.runtime-version.major=12
                    nvidia.com/cuda.runtime-version.minor=2
                    nvidia.com/cuda.runtime.major=12
                    nvidia.com/cuda.runtime.minor=2
                    nvidia.com/gfd.timestamp=1732465495
                    nvidia.com/gpu-driver-upgrade-state=upgrade-done
                    nvidia.com/gpu.compute.major=8
                    nvidia.com/gpu.compute.minor=9
                    nvidia.com/gpu.count=4
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=pre-installed
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.family=ampere
                    nvidia.com/gpu.memory=46068
                    nvidia.com/gpu.mode=compute
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-L40
                    nvidia.com/gpu.replicas=1
                    nvidia.com/gpu.sharing-strategy=none
                    nvidia.com/mig.capable=false
                    nvidia.com/mig.strategy=single
                    nvidia.com/mps.capable=false
                    nvidia.com/vgpu.present=false

The canonical_name in value here is likely the culprit, since the substring gets matched:

def get_accelerator_from_label_value(cls, value: str) -> str:
"""Searches against a canonical list of NVIDIA GPUs and pattern
matches the canonical GPU name against the GFD label.
"""
canonical_gpu_names = [
'A100-80GB', 'A100', 'A10G', 'H100', 'K80', 'M60', 'T4g', 'T4',
'V100', 'A10', 'P4000', 'P100', 'P40', 'P4', 'L4'
]
for canonical_name in canonical_gpu_names:
# A100-80G accelerator is A100-SXM-80GB or A100-PCIE-80GB
if canonical_name == 'A100-80GB' and re.search(
r'A100.*-80GB', value):
return canonical_name
elif canonical_name in value:
return canonical_name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed k8s Kubernetes related items
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants