You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
There is a pytorchjob that occupies three nodes (master 1, worker 2) and 24 GPUs. The workload priority is 1000 and runs normally. Now, a pytorchjob with the same configuration is submitted to the same clusterqueue and the priority is 1000. High-priority tasks cannot Preempt low-priority tasks and report the following error:
conditions:
- lastTransitionTime: "2024-12-25T02:15:59Z"
message: 'couldn''t assign flavors to pod set master: insufficient unused quota
for nvidia.com/gpu in flavor multi-node-h100, 8 more needed, insufficient unused
quota for nvidia.com/gpu in flavor single-node-h100, 8 more needed; couldn''t
assign flavors to pod set worker: insufficient quota for nvidia.com/gpu in flavor
multi-node-h100, request > maximum capacity (16 > 8), insufficient quota for
nvidia.com/gpu in flavor single-node-h100, request > maximum capacity (24 >
16)'
observedGeneration: 1
reason: Pending
status: "False"
type: QuotaReserved
But, a pytorchjob(master 1, worker 2, 16GPUs) is submitted to the same clusterqueue and the priority is 1000,it‘s works
What you expected to happen:
It is hoped that the high-priority pytorchjob configured the same as the low-priority pytorchjob can normally preempt the low-priority pytorchjob.
How to reproduce it (as minimally and precisely as possible):
What happened:
There is a pytorchjob that occupies three nodes (master 1, worker 2) and 24 GPUs. The workload priority is 1000 and runs normally. Now, a pytorchjob with the same configuration is submitted to the same clusterqueue and the priority is 1000. High-priority tasks cannot Preempt low-priority tasks and report the following error:
But, a pytorchjob(master 1, worker 2, 16GPUs) is submitted to the same clusterqueue and the priority is 1000,it‘s works
What you expected to happen:
It is hoped that the high-priority pytorchjob configured the same as the low-priority pytorchjob can normally preempt the low-priority pytorchjob.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl version
): 1.29.5git describe --tags --dirty --always
):0.10.0cat /etc/os-release
): Ubuntu 22.04.5 LTSuname -a
): Linux scl-c26-r3-svr05 5.15.0-117-genericThe text was updated successfully, but these errors were encountered: