Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preemption does not meet expectations #3909

Open
dafu-wu opened this issue Dec 25, 2024 · 1 comment
Open

Preemption does not meet expectations #3909

dafu-wu opened this issue Dec 25, 2024 · 1 comment
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@dafu-wu
Copy link

dafu-wu commented Dec 25, 2024

What happened:
There is a pytorchjob that occupies three nodes (master 1, worker 2) and 24 GPUs. The workload priority is 1000 and runs normally. Now, a pytorchjob with the same configuration is submitted to the same clusterqueue and the priority is 1000. High-priority tasks cannot Preempt low-priority tasks and report the following error:

  conditions:
  - lastTransitionTime: "2024-12-25T02:15:59Z"
    message: 'couldn''t assign flavors to pod set master: insufficient unused quota
      for nvidia.com/gpu in flavor multi-node-h100, 8 more needed, insufficient unused
      quota for nvidia.com/gpu in flavor single-node-h100, 8 more needed; couldn''t
      assign flavors to pod set worker: insufficient quota for nvidia.com/gpu in flavor
      multi-node-h100, request > maximum capacity (16 > 8), insufficient quota for
      nvidia.com/gpu in flavor single-node-h100, request > maximum capacity (24 >
      16)'
    observedGeneration: 1
    reason: Pending
    status: "False"
    type: QuotaReserved 

But, a pytorchjob(master 1, worker 2, 16GPUs) is submitted to the same clusterqueue and the priority is 1000,it‘s works

What you expected to happen:
It is hoped that the high-priority pytorchjob configured the same as the low-priority pytorchjob can normally preempt the low-priority pytorchjob.

How to reproduce it (as minimally and precisely as possible):

  • clusterqueue
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: cluster-queue-asr
spec:
  namespaceSelector: {} 
  cohort: h100-dev
  flavorFungibility:
    whenCanBorrow: TryNextFlavor
    whenCanPreempt: Preempt
  preemption:
    borrowWithinCohort:
      maxPriorityThreshold: 100
      policy: LowerPriority
    reclaimWithinCohort: LowerPriority
    withinClusterQueue: LowerPriority
  queueingStrategy: BestEffortFIFO
  resourceGroups:
  - coveredResources:
    - nvidia.com/gpu
    flavors:
    - name: single-node-h100
      resources:
      - name: nvidia.com/gpu
        nominalQuota: "16"
        borrowingLimit: "8"
    - name: multi-node-h100
      resources:
      - name: nvidia.com/gpu
        nominalQuota: "8"
        borrowingLimit: "8"
  stopPolicy: None
---

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: multi-node-h100
spec:
  nodeLabels:
    mlp: multi-node

---

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: single-node-h100
spec:
  nodeLabels:
    mlp: single-node
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: asr
spec:
  clusterQueue: cluster-queue-asr```
- pytorch-simple-low-priority
```apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-simple-low-priority
  namespace: default
  labels:
    kueue.x-k8s.io/queue-name: asr
    kueue.x-k8s.io/priority-class: low-priority
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-21320b6
#              If you have gpu, pytorch-mnist-gpu would be helpful. pytorch-mnist-gpu is approximately 22GB
#              image: docker.io/kubeflowkatib/pytorch-mnist-cpu:latest
              imagePullPolicy: Always
              command:
                - "python3"
                - "/opt/pytorch-mnist/mnist.py"
                - "--epochs=100000"
              resources:
                limits:
                  nvidia.com/gpu: 8

    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-21320b6
#              If you have gpu, pytorch-mnist-gpu would be helpful. pytorch-mnist-gpu is approximately 22GB
#              image: docker.io/kubeflowkatib/pytorch-mnist-cpu:latest
              imagePullPolicy: Always
              command:
                - "python3"
                - "/opt/pytorch-mnist/mnist.py"
                - "--epochs=100000"
              resources:
                limits:
                  nvidia.com/gpu: 8
  • pytorch-simple-high-priority
kind: PyTorchJob
metadata:
  name: pytorch-simple-high-priority
  namespace: default
  labels:
    kueue.x-k8s.io/queue-name: asr
    kueue.x-k8s.io/priority-class: high-priority
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-21320b6
#              If you have gpu, pytorch-mnist-gpu would be helpful. pytorch-mnist-gpu is approximately 22GB
#              image: docker.io/kubeflowkatib/pytorch-mnist-cpu:latest
              imagePullPolicy: Always
              command:
                - "python3"
                - "/opt/pytorch-mnist/mnist.py"
                - "--epochs=1"
              resources:
                limits:
                  nvidia.com/gpu: 8

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.29.5
  • Kueue version (use git describe --tags --dirty --always):0.10.0
  • OS (e.g: cat /etc/os-release): Ubuntu 22.04.5 LTS
  • Kernel (e.g. uname -a): Linux scl-c26-r3-svr05 5.15.0-117-generic
@dafu-wu dafu-wu added the kind/bug Categorizes issue or PR as related to a bug. label Dec 25, 2024
@dafu-wu
Copy link
Author

dafu-wu commented Dec 25, 2024

Whether low priority workloads should be considered in maxCapacity?

maxCapacity := a.cq.PotentialAvailable(fr)

https://github.com/kubernetes-sigs/kueue/blame/95bc3b28da49862186f5e38924e1507ce4b7c703/pkg/scheduler/flavorassigner/flavorassigner.go#L608C22-L608C40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

1 participant