The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
README.md	2025-01-21	15.1 kB	0
v1.9.0 release source code.tar.gz	2025-01-21	2.6 MB	0
v1.9.0 release source code.zip	2025-01-21	3.2 MB	0
Totals: 3 Items		5.8 MB	0

This is the Training Operator v1.9.0 release.

This release introduces a new JAXJob, enabling seamless distributed training with JAX.

Additionally, it adds the managedBy API to streamline the orchestration of training Jobs in multi-cluster environment using MultiKueue.

Breaking Changes

Upgrade Kubernetes to v1.31.3 (#2330 by @astefanutti)
Upgrade Kubernetes to v1.30.7 (#2332 by @astefanutti)
Update the name of PVC in train API (#2187 by @helenxie-bit)
Remove support for MXJob (#2150 by @tariq-hasan)
Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)

New Features

Distributed JAX

Add JAX controller (#2194 by @sandipanpanda)
Add JAX API (#2163 by @sandipanpanda)
JAX Integration Enhancement Proposal (#2125 by @sandipanpanda)
JAX example for MNIST SPMD and add CI testing (https://github.com/kubeflow/training-operator/pull/2390 by @saileshd1402)

New Examples

FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286 by @andreyvelich)
Add DeepSpeed Example with Pytorch Operator (#2235 by @Syulin7)

Control Plane Updates

Validate pytorchjob workers are configured when elasticpolicy is configured (#2320 by @tarat44)
[Feature] Support managed by external controller (#2203 by @mszadkow)
Update trainer to ensure type consistency for train_args and lora_config (#2181 by @helenxie-bit)
Support ARM64 platform in TensorFlow examples (#2119 by @akhilsaivenkata)
Feat: Support ARM64 platform in XGBoost examples (#2114 by @tico88612)
ARM64 supported in PyTorch examples (#2116 by @danielsuh05)

SDK Updates

[SDK] Adding env vars (#2285 by @tarekabouzeid)
[SDK] Use torchrun to create PyTorchJob from function (#2276 by @andreyvelich)
[SDK] move env var to constants.py (#2268 by @varshaprasad96)
[SDK] Allow customising base trainer and storage images in Train API (#2261 by @varshaprasad96)
[SDK] Read namespace from the current context (#2255 by @andreyvelich)
[SDK] Sync Transformers version for train API (#2146 by @andreyvelich)
[SDK] Explain Python version support cycle (#2144 by @andreyvelich)

Kubeflow Trainer V2

KEP-2170: Kubeflow Training V2 API (#2171 by @andreyvelich)
KEP-2170: Update V2 KEP with MPI Runtime info (#2345 by @andreyvelich)
Always update TrainJob status on errors (#2352 by @astefanutti)
Fix TrainJob status comparison and update (#2353 by @astefanutti)
Add required RBAC on TrainJob finalizer sub-resources (#2350 by @astefanutti)
KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK (#2324 by @andreyvelich)
KEP-2170: Add Torch Distributed Runtime (#2328 by @andreyvelich)
KEP-2170: Add TrainJob conditions (#2322 by @tenzen-y)
KEP-2170: Add the TrainJob state transition design (#2298 by @tenzen-y)
KEP-2170: Implement Initializer builders in the JobSet plugin (#2316 by @andreyvelich)
KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308 by @andreyvelich)
KEP-2170: Create model and dataset initializers (#2303 by @andreyvelich)
KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310 by @andreyvelich)
KEP-2170: Initialize runtimes before the manager starts (#2306 by @tenzen-y)
KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304 by @tenzen-y)
KEP-2170: Decouple JobSet from TrainJob (#2296 by @tenzen-y)
KEP-2170: Implement TrainJob Reconciler to manage objects (#2295 by @tenzen-y)
KEP-2170: Add manifests for Kubeflow Training V2 (#2289 by @andreyvelich)
KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260 by @akshaychitneni)
KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283 by @andreyvelich)
KEP-2170: Implement runtime framework (#2248 by @tenzen-y)
[v2alpha] Move GV related codebase (#2281 by @varshaprasad96)
KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273 by @varshaprasad96)
KEP-2170: Implement skeleton webhook servers (#2251 by @tenzen-y)
KEP-2170: Initial Implementations for v2 Manager (#2236 by @tenzen-y)
KEP-2170: Generate CRD manifests for v2 CustomResources (#2237 by @tenzen-y)
KEP-2170: Update Training V2 APIs in the KEP (#2240 by @andreyvelich)
KEP-2170: Add TrainJob and TrainingRuntime APIs (#2223 by @andreyvelich)
KEP-2170: Bind repository into the build environment instead of filecopy (#2222 by @tenzen-y)
KEP-2170: Add directories for the V2 APIs (#2221 by @andreyvelich)
KEP-2170: Add the apiGroup to the TrainingRuntimeRef (#2201 by @tenzen-y)
KEP-2170: Make API specification more restricting (#2198 by @tenzen-y)

Bug Fixes

[release-1.9] V1: Fix versions in HuggingFace dataset initializer (#2370 by @andreyvelich)
Pin accelerate package version in trainer (#2340 by @gavrissh)
[fix] Resolve v2alpha API exceptions (#2317 by @varshaprasad96)
[SDK] Minor fix in wait_for_job_conditions with job_kind python training API (#2265 by @saileshd1402)
[SDK] Fix typo of "get_pvc_spec" (#2250 by @helenxie-bit)
[Bug] Finish CleanupJob early if the job is suspended. (#2243 by @mszadkow)
[SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
Update huggingface_hub Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit)
[SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
fix volcano podgroup update issue (#2079 by @ckyuto)
[SDK] Fix Incorrect Events in get_job_logs API (#2122 by @andreyvelich)

Misc

[release-1.9] Add release branch to the image push trigger (#2377 by @andreyvelich)
Add e2e test for train API (#2199 by @helenxie-bit)
buildx link was broken (#2356 by @Veer0x1)
Upgrade helm/kind-action to v1.11.0 (#2357 by @astefanutti)
Upgrade Go version to v1.23 (#2302 by @tenzen-y)
Ensure code generation dependencies are downloaded (#2339 by @astefanutti)
Added test for create-pytorchjob.ipynb python notebook (#2274 by @saileshd1402)
Remove zw0610 from approvers (#2343 by @zw0610)
Upgrade kustomization files to Kustomize v5 (#2326 by @oksanabaza)
Add openapi-generator CLI option to skip SDK v2 test generation (#2338 by @astefanutti)
Refine the server-side apply installation args (#2337 by @tenzen-y)
Ignore cache exporting errors in the image building workflows (#2336 by @tenzen-y)
Pin Gloo repository in JAX Dockerfile to a specific commit (#2329 by @sandipanpanda)
Update tf job examples to tf v2 (#2270 by @YosiElias)
Remove Prometheus Monitoring doc (#2301 by @sophie0730)
Upgrade Deepspeed demo dependencies (#2294 by @Syulin7)
[SDK] test: add unit test for list_jobs method of the training_client (#2267 by @seanlaii)
[SDK] Training Client Conditions related unit tests (#2253 by @Bobbins228)
[SDK] test: add unit test for get_job_logs method of the training_client (#2275 by @seanlaii)
[SDK] test: add unit test for get_job method of the training_client (#2205 by @Bobbins228)
[SDK] test: add unit tests for delete_job() method (#2232 by @Bobbins228)
[SDK] Add UTs for wait_for_job_conditions (#2196 by @Electronic-Waste)
[SDK] Unit tests for TrainingClient APIs - get_job_pod_names and update_job (#2192 by @YosiElias)
[SDK] Add more unit tests for TrainingClient APIs - get_job_pods (#2175 by @YosiElias)
Update JAX image to use image published by Kubeflow (#2264 by @sandipanpanda)
Update README and out-of-date docs (#2252 by @andreyvelich)
Clean up Go modules (#2238 by @tenzen-y)
Change isort profile to black for full compatibility (#2234 by @Ygnas)
Enhance pre-commit hooks with flake8 linting (#2195 by @Ygnas)
Implement pre-commit hooks (#2184 by @droctothorpe)
Add command to re-run GitHub Actions tests (#2167 by @andreyvelich)
Update JAX integration proposal (#2165 by @sandipanpanda)
Update release document (#2153 by @andreyvelich)
update volcano to v1.9.0 (#2148 by @lowang-bh)
Update Slack Invitation (#2142 by @andreyvelich)
Refine the integration tests for the immutable PyTorchJob queueName (#2130 by @tenzen-y)
Add GitHub Issue Template (#2129 by @andreyvelich)
Update the images to the latest tag in master branch (#2128 by @johnugeorge)
Updated Github Action Workflows as per issue [#2117] (#2123 by @hkiiita)
changed package name to flake8 to fix pytests pip install (#2109 by @ChristopheBrown)
chore(fix): isort xgboost (#2098 by @harshithbelagur)
Fix isort on examples/pytorch (#2094 by @marcmaliar)

Source: README.md, updated 2025-01-21

Kubeflow Training Operator Files

Distributed ML Training and Fine-Tuning on Kubernetes

Breaking Changes

New Features

Distributed JAX

New Examples

Control Plane Updates

SDK Updates

Kubeflow Trainer V2

Bug Fixes

Misc

Kubeflow Training Operator Files

Distributed ML Training and Fine-Tuning on Kubernetes

Get an email when there's a new version of Kubeflow Training Operator

Breaking Changes

New Features

Distributed JAX

New Examples

Control Plane Updates

SDK Updates

Kubeflow Trainer V2

Bug Fixes

Misc