You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@shaoqingyang Are you using DeepSpeed for model training?
This issue is not caused by training-operator. You need to confirm whether the training framework you are using supports job recovery if one of the processes exits and is restarted. IIRC, DeepSpeed does not support it.
What happened?
I created a pytorch job which to use three pod.
when I delete a pod(worker), It will recover, but can't join into cluster.
What did you expect to happen?
Pod can join to cluster to continue train.
Environment
Kubernetes version:
Training Operator version:
$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"
Training Operator Python SDK version:
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
The text was updated successfully, but these errors were encountered: