10 Sep 15:14

andreyvelich

v1.8.1

0f8735f

v1.8.1 release Latest

Latest

This is the Training Operator v1.8.1 release.

Bug Fixes

[Bug] Finish CleanupJob early if the job is suspended (#2243 by @mszadkow)
[SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
Update huggingface_hub Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit)

New Contributors

@mszadkow made their first contribution in #2243
@helenxie-bit made their first contribution in #2180

Contributors

mszadkow and helenxie-bit

Assets 2

23 Jul 18:10

andreyvelich

v1.8.0

f8687ca

v1.8.0 release

This is the Training Operator v1.8.0 release.

This release introduces a new Python API for LLMs Fine-Tuning that simplifies the ability to fine-tune foundational models using distributed PyTorch nodes.

Install the Kubeflow Training SDK as follows to try it:

pip install -U "kubeflow-training[huggingface]"

LLMs Fine-Tuning API

Train/Fine-tune API Proposal for LLMs (#1945 by @deepanker13)
[SDK] Train API for LLM Fine-Tuning (#1962 by @deepanker13)
Modify LLM Trainer to support BERT and Tiny LLaMA (#2031 by @andreyvelich)
Support arm64 for Hugging Face trainer (#2028 by @tariq-hasan)
Add Fine-Tune BERT LLM Example (#2021 by @andreyvelich)
Train api dataset download changes (#1959 by @deepanker13)
Train api init container creation (#1958 by @deepanker13)
[SDK] Add docstring for Train API (#2075 by @andreyvelich)

Breaking Changes

[SDK] Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)
Support K8s v1.29 and Drop K8s v1.26 (#2039 by @tenzen-y)
Support K8s v1.28 and Drop K8s v1.25 (#2038 by @tenzen-y)
Deprecation Notice for MXJob (#2058 by @tenzen-y)
⚠️ Breaking Changes: Rename monitoring-port flag to webook-server-port (#1925 by @afritzler)

New Features

Control Plane Updates

Upgrade scheduler-plugins to v0.28.9 (#2065 by @tenzen-y)
Implement webhook validations for the PaddleJob (#2057 by @tenzen-y)
Implement webhook validations for the XGBoostJob (#2052 by @tenzen-y)
Implement webhook validation for the TFJob (#2051 by @tenzen-y)
Implement webhook validations for the PyTorchJob (#2035 by @tenzen-y)
Upgrade PyTorchJob examples to PyTorch v2 (#2024 by @champon1020)
Upgrade Go version to v1.22 (#2046 by @tenzen-y)

SDK Improvements

[SDK] Add resources per worker for Create Job API (#1990 by @andreyvelich)
[SDK] Fix Worker and Master templates for PyTorchJob (#1988 by @andreyvelich)
[SDK] Get Kubernetes Events for Job (#1975 by @andreyvelich)
SDK: Upgrade the minimum required Kubernetes version to v1.27.2 (#2066 by @tenzen-y)
[SDK] Add information about TrainingClient logging (#1973 by @andreyvelich)
Training operator SDK unit test (#1938 by @deepanker13)
[SDK] Consolidate Naming for CRUD APIs (#1907 by @andreyvelich)

Bug Fixes

[SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
[SDK] Sync Transformers version for train API (#2147 by @andreyvelich)
[SDK] Changed package name to flake8 to fix pip install (#2140 by @tenzen-y)
[SDK] Fix Incorrect Events in get_job_logs API (#2138 by @tenzen-y)
Fix volcano podgroup update issue (#2079 by @ckyuto)
Fix import for HuggingFace Dataset Provider (#2085 by @andreyvelich)
Updated examples for train API (#2077 by @shruti2522)
Fail job for non-retryable exit codes (#2071 by @kellyaa)
E2E: Replace outdated images with latest ones (#2083 by @tenzen-y)
fix wrong filepath in the simple example command (#2062 by @qzoscar)
fix(example): add installation of python-etcd in Pytorch example (#2064 by @champon1020)
fix: Upgrade controller-gen to v0.14.0 (#2026 by @champon1020)
Fix build workflow config for pytorch-torchrun-example (#2020 by @PeterWrighten)
Fix Distributed Data Samplers in PyTorch Examples (#2012 by @andreyvelich)
Fix URL in python SDK setup.py (#2011 by @garymm)
Fix for Github CI to publish HF trainer image (#1987 by @johnugeorge)
train api jupyternotebook fix (#1984 by @deepanker13)
fix: volcano podgroup should has a non-empty queue name (#1977 by @lowang-bh)
Fix Master Label for PyTorchJob (#1974 by @andreyvelich)
IsMasterRole fix in pytorchjob controller (#1969 by @deepanker13)
[fix] replace ${go env GOPATH} with $(go env GOPATH) (#1952 by @double12gzh)
Fixing issues with providing existing service account (#1918 by @rpemsel)

Misc

Refine the integration tests for the immutable PyTorchJob (#2130 by @tenzen-y)
Update training operator image to latest (#2089 by @johnugeorge)
Update sdk to v1.8.0rc0 (#2087 by @johnugeorge)
Test: Simplify and Identify pod-controller envtest (#2084 by @tenzen-y)
Remove deadcode related to PodDisruptionBudget (#2073 by @tenzen-y)
docs: updating docs for local development (#2074 by @franciscojavierarceo)
PyTorchJob: Always show warnings when using elasticPolicy.nProcPerNode (#2067 by @tenzen-y)
Updated developer docs to include Kind (#2061 by @franciscojavierarceo)
adding fine tune example with s3 as the dataset store (#2006 by @deepanker13)
CI: Use a mode=min in the builder cache (#2053 by @tenzen-y)
Fix: upgrade version of crd-ref-docs, which caused panic with go v1.22 (#2043 by @jdcfd)
Remove Dockerfile.ppc64le of pytorch example (#2042 by @champon1020)
publish torchrun example via Dockerfile (#2018 by @PeterWrighten)
Updated examples/pytorch to disable istio sidecar injection (#2004 by @jdcfd)
[docs] development guide update (#1995 by @shashank-iitbhu)
Add Kubeflow Website links to README (#1983 by @andreyvelich)
publish trainer hugging face image (#1985 by @deepanker13)
Adding Training image needed for train api (#1963 by @deepanker13)
Add test to create PyTorchJob from func (#1979 by @andreyvelich)
Corrected Some Spelling And Grammatical Errors (#1980 by @daniel-hutao)
torchrun example with cpu version pytorch (#1965 by @kuizhiqing)
utils changes needed to add train api (#1954 by @deepanker13)
Adding parallel support for coveralls (#1956 by @johnugeorge)
chore: pkg import only once (#1950 by @testwill)
fix nproc env in elas...

Contributors

garymm, rpemsel, and 23 other contributors

Assets 2

28 Apr 18:37

johnugeorge

v1.8.0-rc.0

643af3d

v1.8.0-rc.0 release Pre-release

Pre-release

New features

Train/Fine-tune API Proposal for LLMs #1945 (deepanker13)
Adding Training image needed for train api #1963 (deepanker13)
[SDK] Train API #1962 (deepanker13)
Train api dataset download changes #1959 (deepanker13)
Train api init container creation #1958 (deepanker13)
Publish trainer hugging face image #1985 (deepanker13)
Support arm64 for Hugging Face trainer #2028 (tariq-hasan)
Modify LLM Trainer to support BERT and Tiny LLaMA #2031 (andreyvelich)
Implement webhook validations for the PyTorchJob #2035 (tenzen-y)
Implement webhook validations for the XGBoostJob #2052 (tenzen-y)
Implement webhook validation for the TFJob #2051 (tenzen-y)
Implement webhook warnings for the MXJob #2058 (tenzen-y)
Implement webhook validations for the PaddleJob #2057 (tenzen-y)
Fail job for non-retryable exit codes #2071 (kellyaa)
Adding fine tune example with s3 as the dataset store #2006 (deepanker13)

Bug fixes

fix nproc env in elastic mode for pytorchjob #1948 (kuizhiqing)
IsMasterRole fix in pytorchjob controller #1969 (deepanker13)
fix: volcano podgroup should has a non-empty queue name #1977 (lowang-bh)
Fix Master Label for PyTorchJob #1974 (andreyvelich)
[SDK] Fix Worker and Master templates for PyTorchJob #1988 (andreyvelich)
Fix import for HuggingFace Dataset Provider #2085 (andreyvelich)
Upgrade controller-gen to v0.14.0 #2026 (champon1020)
Fix Distributed Data Samplers in PyTorch Examples #2012 (andreyvelich)
Fix URL in python SDK setup.py #2011 (garymm)

Misc

Adding parallel support for coveralls #1956 (johnugeorge)
torchrun example with cpu version pytorch #1965 (kuizhiqing)
[SDK] Get Kubernetes Events for Job #1975 (andreyvelich)
Fix Master Label for PyTorchJob #1974 (andreyvelich)
[SDK] Add information about TrainingClient logging #1973 (andreyvelich)
PyTorchJob: Always show warnings when using elasticPolicy.nProcPerNode #2067 (tenzen-y)
SDK: Upgrade the minimum required Kubernetes version to v1.27.2 #2066 (tenzen-y)
Test: Simplify and Identify pod-controller envtest #2084 (tenzen-y)
E2E: Replace outdated images with latest ones #2083 (tenzen-y)
Upgrade scheduler-plugins to v0.28.9 #2065 (tenzen-y)

Assets 2

01 Nov 07:49

johnugeorge

v1.7.0

5525468

v1.7.0 release

Breaking Changes

Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
Upgrade the kubernetes dependencies to v1.27 #1834 (tenzen-y)

New features

Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
Merge kubeflow/common to training-operator #1813 (johnugeorge)
Auto-generate RBAC manifests by the controller-gen #1815 (Syulin7)
Implement suspend semantics #1859 (tenzen-y)
Set up controllers using goroutines to start the manager quickly #1869 (tenzen-y)
Set correct ENV for PytorchJob to support torchrun #1840 (kuizhiqing)

Bug fixes

Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed #1866 (tenzen-y)
Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789 (tenzen-y)
Avoid to depend on local env when installing the code-generators #1810 (tenzen-y)

Misc

Removing reconciler code #1879 (johnugeorge)
Make Condition and ReplicaStatus optional #1862 (tenzen-y)
Use the same reasons for Condition and Event #1854 (tenzen-y)
Fully consolidate tfjob-operator to training-operator #1850 (tenzen-y)
Clean up /pkg/common/util/v1 #1845 (tenzen-y)
Refactoring tests in common/controller.v1 #1843 (tenzen-y)
remove duplicate code of add task spec annotation #1839 (lowang-bh)
fetch volcano log when e2e failed #1837 (lowang-bh)
Add check pods are not scheduled when testing gang-scheduler integrations in e2e #1835 (tenzen-y)
Replace dummy client with fake client #1818 (tenzen-y)
Add default Intel MPI env variables to MPIJob #1804 (tkatila)
Improve E2E tests for the gang-scheduling #1801 (tenzen-y)
xgb yaml container name should be consistent with xgb job default container name #1794 (Crisescode)
make timeout configurable from e2e tests #1787 (nagar-ajay)

Assets 2

07 Aug 13:00

johnugeorge

v1.7.0-rc.0

434cef7

v1.7.0-rc.0 release Pre-release

Pre-release

Breaking Changes

Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
Upgrade the kubernetes dependencies to v1.27 #1834 (tenzen-y)

New features

Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
Merge kubeflow/common to training-operator #1813 (johnugeorge)
Auto-generate RBAC manifests by the controller-gen #1815 (Syulin7)
Implement suspend semantics #1859 (tenzen-y)
Set up controllers using goroutines to start the manager quickly #1869 (tenzen-y)
Set correct ENV for PytorchJob to support torchrun #1840 (kuizhiqing)

Bug fixes

Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed #1866 (tenzen-y)
Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789 (tenzen-y)
Avoid to depend on local env when installing the code-generators #1810 (tenzen-y)

Misc

Removing reconciler code #1879 (johnugeorge)
Make Condition and ReplicaStatus optional #1862 (tenzen-y)
Use the same reasons for Condition and Event #1854 (tenzen-y)
Fully consolidate tfjob-operator to training-operator #1850 (tenzen-y)
Clean up /pkg/common/util/v1 #1845 (tenzen-y)
Refactoring tests in common/controller.v1 #1843 (tenzen-y)
remove duplicate code of add task spec annotation #1839 (lowang-bh)
fetch volcano log when e2e failed #1837 (lowang-bh)
Add check pods are not scheduled when testing gang-scheduler integrations in e2e #1835 (tenzen-y)
Replace dummy client with fake client #1818 (tenzen-y)
Add default Intel MPI env variables to MPIJob #1804 (tkatila)
Improve E2E tests for the gang-scheduling #1801 (tenzen-y)
xgb yaml container name should be consistent with xgb job default container name #1794 (Crisescode)
make timeout configurable from e2e tests #1787 (nagar-ajay)

Assets 2

21 Mar 19:37

johnugeorge

v1.6.0

66aa635

v1.6.0 release

Note: Since scheduler-plugins has changed API from sigs.k8s.io with the x-k8s.io, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower. Related: #1773

Note: Latest Python SDK 1.6 version does not support earlier training operator versions. The minimum training operator version required is v1.6.0 release. Related: #1702

New Features

Support for k8s v1.25 in CI #1684 (johnugeorge)
HPA support for PyTorch Elastic #1701 (johnugeorge)
Adopting coschduling plugin #1724 (tenzen-y)
Support for Paddlepaddle #1675 (kuizhiqing)
Create TFJob and PyTorchJob from Function APIs in the Training SDK #1659 (andreyvelich)
[SDK] Use Training Client without Kube Config #1740 (andreyvelich)
[SDK] Create Unify Training Client #1719 (andreyvelich)

Bug fixes

[SDK] pod has no metadata attr anymore in the get_job_logs() … #1760 (yaobaiwei)
Add PodGroup as controller watch source #1666 (ggaaooppeenngg)
fix infinite loop in init-pytorch container #1756 (kidddddddddddddddddddddd)
Fix the success condition of the job in PyTorchJob's Elastic mode. #1752 (Syulin7)
Fix XGBoost conditions bug #1737 (tenzen-y)
To fix scaledown error, upgrade PyTorch version to v1.13.1 in echo example #1733 (tenzen-y)
fix: support MxNet single host training when update mxJob status #1644 (PeterChg)
fix: fix mxnet failed to update StartTime and CompletionTime #1643 (PeterChg)
Fix the default LeaderElectionID and make it an argument #1639 (goyalankit)
fix: fix wrong parameter for resolveControllerRef #1583 (fighterhit)
fix: tfjob with restartPolicy=ExitCode not work #1562 (cheimu)
fix: Mac M1 compatible Dockerfile and bump TF version #1700 (terrytangyuan)
Fix status lost #1697 (ggaaooppeenngg)
handle all restart policies #1649 (abin-thomas-by)
[chore] fix typo #1648 (tenzen-y)

Misc

Add validation for verifying that the CustomJob (e.g., TFJob) name meets DNS1035 #1748 (tenzen-y)
Configure controller worker threads #1707 (HeGaoYuan)
Validation Spec consistency #1705 (HeGaoYuan)
[SDK] Remove Final Keyword from constants #1676 (andreyvelich)
Fix Python installation in CI #1759 (tenzen-y)
Update mpijob_controller.go #1755 (yshalabi)
Set the default value of CleanPodPolicy to None #1754 (Syulin7)
Update join Slack link #1750 (Syulin7)
Update latest operator image #1742 (johnugeorge)
Run E2E with various Python versions to verify Python SDK #1741 (tenzen-y)
Add Yuki to reviewer group #1739 (johnugeorge)
Trim down CRD descriptions #1735 (tenzen-y)
Add CI to build example images #1731 (tenzen-y)
Fix predicates of paddlepaddle-controller for scheduling.volcano.sh/v1beta1 PodGroup #1730 (tenzen-y)
Fix indents on examples for tensorflow #1726 (tenzen-y)
docs: Update Kubernetes requirement and version matrix #1721 (terrytangyuan)
chore: Update the use of MultiWorkerMirroredStrategy in TF #1715 (terrytangyuan)
Removing deprecated Job Labels #1702 (johnugeorge)
Bump certifi from 2022.9.14 to 2022.12.7 in /py/kubeflow/tf_operator #1699 (dependabot[bot])
Add myself to reviewer. #1689 (kuizhiqing)
Upgrade the envtest version #1687 (tenzen-y)
[chore] Upgrade some actions version #1686 (tenzen-y)
Upgrade Golangci-lint #1685 (johnugeorge)
Make a generic logger instead of the nil logger on dependent update #1680 (ggaaooppeenngg)
Bump protobuf from 3.8.0 to 3.18.3 in /py/kubeflow/tf_operator #1669 (dependabot[bot])
Removed GOARCH dependency for multiarch support #1674 (pranavpandit1)
Update deployment.yaml #1668 (OmriShiv)
Upgrade Go version to v1.19 #1663 (tenzen-y)
Upgrade kubernetes versoin for test #1667 (tenzen-y)
Adding support for linux/ppc64le in github actions for training-operator #1692 (amitmukati-2604)
style: Refine name and signature of 2 replicaName functions #1660 (houz42)
Update training operator sdk version to 1.5.0 #1651 (johnugeorge)
Add finalizers to cluster-role #1646 (ArangoGutierrez)
Update the cmd to support MPI operator in ReadME #1656 (denkensk)

Closed issues:

The default value for CleanPodPolicy is inconsistent. #1753
HPA support for PyTorch Elastic #1751
Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state [#1745](https://github.com/kubeflow/t...

Assets 2

14 Feb 09:05

johnugeorge

v1.6.0-rc.1

27e5499

v1.6.0-rc.1 release Pre-release

Pre-release

Note: Since scheduler-plugins has changed API from sigs.k8s.io with the x-k8s.io, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower

Merged pull requests:

[SDK] pod has no metadata attr anymore in the get_job_logs() … #1760 (yaobaiwei)
Fix Python installation in CI #1759 (tenzen-y)
fix infinite loop in init-pytorch container #1756 (kidddddddddddddddddddddd)
Update mpijob_controller.go #1755 (yshalabi)
Set the default value of CleanPodPolicy to None #1754 (Syulin7)
Fix the success condition of the job in PyTorchJob's Elastic mode. #1752 (Syulin7)
Update join Slack link #1750 (Syulin7)
Add validation for verifying that the CustomJob (e.g., TFJob) name meets DNS1035 #1748 (tenzen-y)
Update latest operator image #1742 (johnugeorge)
Run E2E with various Python versions to verify Python SDK #1741 (tenzen-y)
[SDK] Use Training Client without Kube Config #1740 (andreyvelich)
Add Yuki to reviewer group #1739 (johnugeorge)
Fix XGBoost conditions bug #1737 (tenzen-y)
Add E2E test for gang-scheduling #1736 (tenzen-y)
Trim down CRD descriptions #1735 (tenzen-y)
To fix scaledown error, upgrade PyTorch version to v1.13.1 in echo example #1733 (tenzen-y)
Add CI to build example images #1731 (tenzen-y)
Fix predicates of paddlepaddle-controller for scheduling.volcano.sh/v1beta1 PodGroup #1730 (tenzen-y)
Fix indents on examples for tensorflow #1726 (tenzen-y)
Adopting coschduling plugin #1724 (tenzen-y)
docs: Update Kubernetes requirement and version matrix #1721 (terrytangyuan)
[SDK] Create Unify Training Client #1719 (andreyvelich)
chore: Update the use of MultiWorkerMirroredStrategy in TF #1715 (terrytangyuan)
Configure controller worker threads #1707 (HeGaoYuan)
Validation Spec consistency #1705 (HeGaoYuan)
Removing deprecated Job Labels #1702 (johnugeorge)
HPA support for PyTorch Elastic #1701 (johnugeorge)
fix: Mac M1 compatible Dockerfile and bump TF version #1700 (terrytangyuan)
Bump certifi from 2022.9.14 to 2022.12.7 in /py/kubeflow/tf_operator #1699 (dependabot[bot])
Fix status lost #1697 (ggaaooppeenngg)
Adding support for linux/ppc64le in github actions for training-operator #1692 (amitmukati-2604)
Add myself to reviewer. #1689 (kuizhiqing)
Upgrade the envtest version #1687 (tenzen-y)
[chore] Upgrade some actions version #1686 (tenzen-y)
Upgrade Golangci-lint #1685 (johnugeorge)
Support for k8s v1.25 in CI #1684 (johnugeorge)
Make a generic logger instead of the nil logger on dependent update #1680 (ggaaooppeenngg)
[SDK] Remove Final Keyword from constants #1676 (andreyvelich)
[PaddlePaddle] support paddlejob #1675 (kuizhiqing)
Removed GOARCH dependency for multiarch support #1674 (pranavpandit1)
Bump protobuf from 3.8.0 to 3.18.3 in /py/kubeflow/tf_operator #1669 (dependabot[bot])
Update deployment.yaml #1668 (OmriShiv)
Upgrade kubernetes versoin for test #1667 (tenzen-y)
Add PodGroup as controller watch source #1666 (ggaaooppeenngg)
Upgrade Go version to v1.19 #1663 (tenzen-y)
style: Refine name and signature of 2 replicaName functions #1660 (houz42)
Create TFJob and PyTorchJob from Function APIs in the Training SDK #1659 (andreyvelich)
Update the cmd to support MPI operator in ReadME #1656 (denkensk)
Update training operator sdk version to 1.5.0 #1651 (johnugeorge)
handle all restart policies #1649 (abin-thomas-by)
[chore] fix typo #1648 (tenzen-y)
Add finalizers to cluster-role #1646 (ArangoGutierrez)
fix: support MxNet single host training when update mxJob status #1644 (PeterChg)
fix: fix mxnet failed to update StartTime and CompletionTime #1643 (PeterChg)
Fix the default LeaderElectionID and make it an argument #1639 (goyalankit)
fix: fix wrong parameter for resolveControllerRef #1583 (fighterhit)
fix: tfjob with restartPolicy=ExitCode not work #1562 (cheimu)

Closed issues:

The default value for CleanPodPolicy is inconsistent. #1753
HPA support for PyTorch Elastic #1751
Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state #1745
paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729
*job API(master) cannot compatible with old job [#1725](https://github.com/kubeflow/training-opera...

Assets 2

26 Jan 13:32

johnugeorge

v1.6.0-rc.0

b8004ae

v1.6.0-rc.0 release Pre-release

Pre-release

v1.6.0-rc.0 release

Assets 2

19 Aug 13:56

johnugeorge

v1.5.0

1435d57

v1.5.0 release

Full Changelog

New Features

Add clientset for MPIJob, PytorchJob, MXJob, and XGBoostJob #1610 (tenzen-y)
Add all generation tools to Makefile #1609 (johnugeorge)
Adding MPI python sdk #1608 (johnugeorge)
Adding XGboost Python sdk #1607 (johnugeorge)
Generating MPI python sdk #1606 (johnugeorge)
Update k8s dependencies to v0.24.1 #1604 (johnugeorge)
Migrate test framework to GHA #1603 (johnugeorge)
Add mpi in update-codegen.sh #1600 (ggaaooppeenngg)
MXNet SDK with Status check fix #1618 (johnugeorge)

Bug Fixes

fix: MPIJob worker still running when NotEnoughResources #1621 (hackerboy01)
fix comments for pytorch-controller #1620 (hackerboy01)
fix: requeue when expire time is not up yet #1614 (Garrybest)
Look for fully-qualified job role label in Python sdk #1588 (person142)
fix torch env typo #1573 (kuizhiqing)
Restart job on failure for Always,OnFailure Policy #1572 (georgkaleido)
Increase success threshold #1568 (haoxins)
update status.startTime for pytorchjob and xgboostjob #1567 (cheimu)
fix: add mpijobs to kubeflow training role #1565 (henrysecond1)
fix Pytorjob status inaccuracy when task replica scale down #1593 (PeterChg)
fix: MPIJob cannot use gang-scheduling when --enable-gang-scheduling is set #1557 (cheimu)
fix api reader issue #1551 (zw0610)
fix label and CleanPodPolicy for mpi-controller #1550 (zw0610)
fix UpdateJobStatusInApiServer when gang-scheduling is enabled #1549 (zw0610)
fix: add namespace filtering when getting pods/services for jobs #1545 (henrysecond1)
fix: set mpijob runPolicy.cleanPodPolicy to default none #1554 (cheimu)

Misc

Update training controller image to latest #1625 (johnugeorge)
Update SDK version to 1.5.0 #1624 (johnugeorge)
Upgrade common to v0.4.3 #1623 (johnugeorge)
Adding GHA for automatic image build and push #1615 (johnugeorge)
Remove presubmit test depending on optional-test-infra #1596 (aws-kf-ci-bot)
chore: stop action on first fail #1595 (jasonliu747)
update img url in design doc #1591 (zw0610)
Remove uncalled mpi-controller DeletePodsAndServices() #1558 (cheimu)
Update MPIJob unit tests to use spec.runPolicy.cleanPodPolicy #1556 (cheimu)
Remove table-logger dependency #1544 (person142)
Bump pyyaml from 5.1 to 5.4 in /py/kubeflow/tf_operator #1542 (dependabot[bot])

Assets 2

28 Jun 18:31

johnugeorge

v1.5.0-rc.0

8c6eab2

v1.5.0-rc.0 release Pre-release

Pre-release

Full Changelog

Closed issues:

MPIJob worker still running when NotEnoughResources with enable-gang-scheduling==true? #1617
unable to fetch TFJob when I use client.go run tfjob #1612
Pytorchjob dist-mnist no training logs #1601
kubectl get tfjob -o yaml, but not status output #1598
missing image in tf_job_design_doc.md #1590
Labels in Python client are out of date #1587
PyTorchJob Pods "Not Ready" After Completing Training #1577
cannot use "github.com/go-openapi/spec".Schema{...} (type "github.com/go-openapi/spec".Schema) as type "k8s.io/kube-openapi/pkg/validation/spec".Schema in field value #1576
PyTorchJob: OnFailure Policy won't handle pod failure gracefully #1570
pytorchjob doesn't have status.startTIme. #1566
Optional-test-infra Deprecation Notice - Training #1561
Should we update MPIJob unit test CleanPodPolicy field? #1555
--enable-gang-scheduling=true doesn't work for MPIJob #1548
PyTorchJob fails when creating a task with a different namespace but the same name #1543
Reconcile PyTorchJob error: PyTorchJob.status.replicaStatuses: Invalid value: "null" after enable-gang-scheduling #1538
Job TTLs not working #1533
Support PodGroup in scheduler-plugins/coscheduling #1518
support elastic training #1515
Modified the configuration of RootLogger #1514
Add checking import order in CI #1510
Scale down of pytorchJob cause workers pod to restart #1509
Support label selector based success/failure conditions #1507
[feat] Support SuccessPolicy in PyTorchJob #1505
pytorch elastic scheduler error #1504
Could you add the example of MPIJob in this repository #1502
[Feature] Create a Informer/ClientSet for PyTorch Jobs #1499
[feature] Make init container injection logic availabel to all jobs #1498
Roadmaps for 1.4 release #1496
[bug] (MpiJob)Init container KubectlDeliveryImage should remain the ability that it can be specified from container parameters or environment variables. #1494
Reconcile PyTorch Job error Operation cannot be fulfilled on pytrochjobs.kubeflow.org #1492
Python PytorchJob: no attribute openapi_types for example code #1481
PyTorch DistributedDataParallel training with multi nodes #1475
Installing kubeflow-training breaks import for other kubeflow packages (katib, fairing, etc.) #1471
Deprecate ksonnet and use python/golang to submit jobs #1468
Help Wanted in ParameterServerStrategy Example. #1459
Bug: SomeTimes Coredumped using tfjob #1456
[question] PyTorchJob MNIST example training speed #1454
tfjob status not match when EnableDynamicWorker set true #1452
training-operator set scheduler error #1447
[sdk]: Replace TableLogger component in the SDK for better support with ipykernel>=6.x #1446
SDK: wait_for_job reports typeError #1445
Update prometheus monitoring doc #1443
Master branch should provide a nightly image #1433
Clean up test folder before testing #1429
Clean up TF specific docs #1424
[feature] Support SchedulingPolicy in PyTorchJob #1414
Hyperlinks in the "Overview" section is incorrect/not found #1411
add workqueue metric #1407
Validation fails for MXJob Tune example #1402
Rate exceeded for aws ecr image #1400
change layout to follow the standard of kubebuilder? #1397
[example] kubeflow/tf-dist-mnist-test:1.0 is missing in v1.2-branch examples/v1/dist-mnist #1393
Update kubeflow/website for 1.4 release #1392
Cut beta release of tf-operator for 1.4 release #1385
"invalid memory address or nil pointer dereference" #1382
some questions about job sync #1379
Provides a default Grafana dashboard #1376
[feature] Support different PS/worker types #1369
Need to copy all (mainly pytorch) framework's example dir to tf-operator/examples #1366
Add more CRD validations markers to block invalid job on client apply #1363
Update presubmit and post submit job triggers #1354
Optimize post submit jobs flow #1353
Enable leader election in controller manager using controllermanagerconfig #1350
Support mpi jobs in universal operator #1345
post-submit job failure in master branch #1343
Improve observability of universal operator #1340
Best practice to organize main.go and Dockerfile? #1333
Should training operator keep clientset in the same repository? #1332
Test image has incorrect tag? #1329
Prepare e2e tests for all frameworks #1323
Reduce e2e replica-restart-policy-tests running time #1319
Improve logs structure by consolidating libs from controller runtime and controllers #1313
Enable tests for all frameworks #1311
[bug] The pod wil be recreated until the expectation expires #1306
Upgrade CRDs to apiextensions.k8s.io/v1 #1304
Add role details as new columns to kubectl get jobs output for CRD. #1301
How to handle long pending pods in a TF-job? #1282
Could you release a new version of Python SDK #1279
Update swagger.json schema for TFJobSpec to include RunPolicy [#1278](https://github.com/kubeflow...

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Fixes

New Contributors

Contributors

LLMs Fine-Tuning API

Breaking Changes

New Features

Control Plane Updates

SDK Improvements

Bug Fixes

Misc

Contributors

New Features

Bug Fixes

Misc

Releases: kubeflow/training-operator

v1.8.1 release

Bug Fixes

New Contributors

Contributors

v1.8.0 release

LLMs Fine-Tuning API

Breaking Changes

New Features

Control Plane Updates

SDK Improvements

Bug Fixes

Misc

Contributors

v1.8.0-rc.0 release

v1.7.0 release

v1.7.0-rc.0 release

v1.6.0 release

v1.6.0-rc.1 release

v1.6.0-rc.0 release

v1.5.0 release

New Features

Bug Fixes

Misc

v1.5.0-rc.0 release