Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pytorch job running with pod exception unable to recover after retry #2300

Open
shaoqingyang opened this issue Oct 22, 2024 · 3 comments
Open

Comments

@shaoqingyang
Copy link

What happened?

I created a pytorch job which to use three pod.
image
when I delete a pod(worker), It will recover, but can't join into cluster.
Uploading image.png…

What did you expect to happen?

Pod can join to cluster to continue train.

Environment

Kubernetes version:

$ kubectl version

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"

Training Operator Python SDK version:

$ pip show kubeflow-training

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@shaoqingyang
Copy link
Author

image

@Syulin7
Copy link
Contributor

Syulin7 commented Oct 25, 2024

@shaoqingyang Are you using DeepSpeed for model training?

This issue is not caused by training-operator. You need to confirm whether the training framework you are using supports job recovery if one of the processes exits and is restarted. IIRC, DeepSpeed does not support it.

@Syulin7
Copy link
Contributor

Syulin7 commented Oct 25, 2024

#2269 re-create all PyTorchJob's pods is another solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants