Pytorch job running with pod exception unable to recover after retry #2300

shaoqingyang · 2024-10-22T08:40:32Z

What happened?

I created a pytorch job which to use three pod.

when I delete a pod(worker), It will recover, but can't join into cluster.

What did you expect to happen?

Pod can join to cluster to continue train.

Environment

Kubernetes version:

$ kubectl version

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"

Training Operator Python SDK version:

$ pip show kubeflow-training

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

shaoqingyang · 2024-10-22T08:42:47Z

Syulin7 · 2024-10-25T09:22:54Z

@shaoqingyang Are you using DeepSpeed for model training?

This issue is not caused by training-operator. You need to confirm whether the training framework you are using supports job recovery if one of the processes exits and is restarted. IIRC, DeepSpeed does not support it.

Syulin7 · 2024-10-25T09:28:13Z

#2269 re-create all PyTorchJob's pods is another solution.

shaoqingyang added kind/bug lifecycle/needs-triage labels Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytorch job running with pod exception unable to recover after retry #2300

Pytorch job running with pod exception unable to recover after retry #2300

shaoqingyang commented Oct 22, 2024

shaoqingyang commented Oct 22, 2024

Syulin7 commented Oct 25, 2024

Syulin7 commented Oct 25, 2024

Pytorch job running with pod exception unable to recover after retry #2300

Pytorch job running with pod exception unable to recover after retry #2300

Comments

shaoqingyang commented Oct 22, 2024

What happened?

What did you expect to happen?

Environment

Impacted by this bug?

shaoqingyang commented Oct 22, 2024

Syulin7 commented Oct 25, 2024

Syulin7 commented Oct 25, 2024