Download Latest Version v2.0.0 source code.tar.gz (6.2 MB)
Email in envelope

Get an email when there's a new version of Kubeflow Training Operator

Home / v1.9.0
Name Modified Size InfoDownloads / Week
Parent folder
README.md 2025-01-21 15.1 kB
v1.9.0 release source code.tar.gz 2025-01-21 2.6 MB
v1.9.0 release source code.zip 2025-01-21 3.2 MB
Totals: 3 Items   5.8 MB 0

This is the Training Operator v1.9.0 release.

This release introduces a new JAXJob, enabling seamless distributed training with JAX.

Additionally, it adds the managedBy API to streamline the orchestration of training Jobs in multi-cluster environment using MultiKueue.

Breaking Changes

  • Upgrade Kubernetes to v1.31.3 (#2330 by @astefanutti)
  • Upgrade Kubernetes to v1.30.7 (#2332 by @astefanutti)
  • Update the name of PVC in train API (#2187 by @helenxie-bit)
  • Remove support for MXJob (#2150 by @tariq-hasan)
  • Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)

New Features

Distributed JAX

New Examples

  • FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286 by @andreyvelich)
  • Add DeepSpeed Example with Pytorch Operator (#2235 by @Syulin7)

Control Plane Updates

  • Validate pytorchjob workers are configured when elasticpolicy is configured (#2320 by @tarat44)
  • [Feature] Support managed by external controller (#2203 by @mszadkow)
  • Update trainer to ensure type consistency for train_args and lora_config (#2181 by @helenxie-bit)
  • Support ARM64 platform in TensorFlow examples (#2119 by @akhilsaivenkata)
  • Feat: Support ARM64 platform in XGBoost examples (#2114 by @tico88612)
  • ARM64 supported in PyTorch examples (#2116 by @danielsuh05)

SDK Updates

  • [SDK] Adding env vars (#2285 by @tarekabouzeid)
  • [SDK] Use torchrun to create PyTorchJob from function (#2276 by @andreyvelich)
  • [SDK] move env var to constants.py (#2268 by @varshaprasad96)
  • [SDK] Allow customising base trainer and storage images in Train API (#2261 by @varshaprasad96)
  • [SDK] Read namespace from the current context (#2255 by @andreyvelich)
  • [SDK] Sync Transformers version for train API (#2146 by @andreyvelich)
  • [SDK] Explain Python version support cycle (#2144 by @andreyvelich)

Kubeflow Trainer V2

  • KEP-2170: Kubeflow Training V2 API (#2171 by @andreyvelich)
  • KEP-2170: Update V2 KEP with MPI Runtime info (#2345 by @andreyvelich)
  • Always update TrainJob status on errors (#2352 by @astefanutti)
  • Fix TrainJob status comparison and update (#2353 by @astefanutti)
  • Add required RBAC on TrainJob finalizer sub-resources (#2350 by @astefanutti)
  • KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK (#2324 by @andreyvelich)
  • KEP-2170: Add Torch Distributed Runtime (#2328 by @andreyvelich)
  • KEP-2170: Add TrainJob conditions (#2322 by @tenzen-y)
  • KEP-2170: Add the TrainJob state transition design (#2298 by @tenzen-y)
  • KEP-2170: Implement Initializer builders in the JobSet plugin (#2316 by @andreyvelich)
  • KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308 by @andreyvelich)
  • KEP-2170: Create model and dataset initializers (#2303 by @andreyvelich)
  • KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310 by @andreyvelich)
  • KEP-2170: Initialize runtimes before the manager starts (#2306 by @tenzen-y)
  • KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304 by @tenzen-y)
  • KEP-2170: Decouple JobSet from TrainJob (#2296 by @tenzen-y)
  • KEP-2170: Implement TrainJob Reconciler to manage objects (#2295 by @tenzen-y)
  • KEP-2170: Add manifests for Kubeflow Training V2 (#2289 by @andreyvelich)
  • KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260 by @akshaychitneni)
  • KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283 by @andreyvelich)
  • KEP-2170: Implement runtime framework (#2248 by @tenzen-y)
  • [v2alpha] Move GV related codebase (#2281 by @varshaprasad96)
  • KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273 by @varshaprasad96)
  • KEP-2170: Implement skeleton webhook servers (#2251 by @tenzen-y)
  • KEP-2170: Initial Implementations for v2 Manager (#2236 by @tenzen-y)
  • KEP-2170: Generate CRD manifests for v2 CustomResources (#2237 by @tenzen-y)
  • KEP-2170: Update Training V2 APIs in the KEP (#2240 by @andreyvelich)
  • KEP-2170: Add TrainJob and TrainingRuntime APIs (#2223 by @andreyvelich)
  • KEP-2170: Bind repository into the build environment instead of filecopy (#2222 by @tenzen-y)
  • KEP-2170: Add directories for the V2 APIs (#2221 by @andreyvelich)
  • KEP-2170: Add the apiGroup to the TrainingRuntimeRef (#2201 by @tenzen-y)
  • KEP-2170: Make API specification more restricting (#2198 by @tenzen-y)

Bug Fixes

  • [release-1.9] V1: Fix versions in HuggingFace dataset initializer (#2370 by @andreyvelich)
  • Pin accelerate package version in trainer (#2340 by @gavrissh)
  • [fix] Resolve v2alpha API exceptions (#2317 by @varshaprasad96)
  • [SDK] Minor fix in wait_for_job_conditions with job_kind python training API (#2265 by @saileshd1402)
  • [SDK] Fix typo of "get_pvc_spec" (#2250 by @helenxie-bit)
  • [Bug] Finish CleanupJob early if the job is suspended. (#2243 by @mszadkow)
  • [SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
  • Update huggingface_hub Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit)
  • [SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
  • fix volcano podgroup update issue (#2079 by @ckyuto)
  • [SDK] Fix Incorrect Events in get_job_logs API (#2122 by @andreyvelich)

Misc

  • [release-1.9] Add release branch to the image push trigger (#2377 by @andreyvelich)
  • Add e2e test for train API (#2199 by @helenxie-bit)
  • buildx link was broken (#2356 by @Veer0x1)
  • Upgrade helm/kind-action to v1.11.0 (#2357 by @astefanutti)
  • Upgrade Go version to v1.23 (#2302 by @tenzen-y)
  • Ensure code generation dependencies are downloaded (#2339 by @astefanutti)
  • Added test for create-pytorchjob.ipynb python notebook (#2274 by @saileshd1402)
  • Remove zw0610 from approvers (#2343 by @zw0610)
  • Upgrade kustomization files to Kustomize v5 (#2326 by @oksanabaza)
  • Add openapi-generator CLI option to skip SDK v2 test generation (#2338 by @astefanutti)
  • Refine the server-side apply installation args (#2337 by @tenzen-y)
  • Ignore cache exporting errors in the image building workflows (#2336 by @tenzen-y)
  • Pin Gloo repository in JAX Dockerfile to a specific commit (#2329 by @sandipanpanda)
  • Update tf job examples to tf v2 (#2270 by @YosiElias)
  • Remove Prometheus Monitoring doc (#2301 by @sophie0730)
  • Upgrade Deepspeed demo dependencies (#2294 by @Syulin7)
  • [SDK] test: add unit test for list_jobs method of the training_client (#2267 by @seanlaii)
  • [SDK] Training Client Conditions related unit tests (#2253 by @Bobbins228)
  • [SDK] test: add unit test for get_job_logs method of the training_client (#2275 by @seanlaii)
  • [SDK] test: add unit test for get_job method of the training_client (#2205 by @Bobbins228)
  • [SDK] test: add unit tests for delete_job() method (#2232 by @Bobbins228)
  • [SDK] Add UTs for wait_for_job_conditions (#2196 by @Electronic-Waste)
  • [SDK] Unit tests for TrainingClient APIs - get_job_pod_names and update_job (#2192 by @YosiElias)
  • [SDK] Add more unit tests for TrainingClient APIs - get_job_pods (#2175 by @YosiElias)
  • Update JAX image to use image published by Kubeflow (#2264 by @sandipanpanda)
  • Update README and out-of-date docs (#2252 by @andreyvelich)
  • Clean up Go modules (#2238 by @tenzen-y)
  • Change isort profile to black for full compatibility (#2234 by @Ygnas)
  • Enhance pre-commit hooks with flake8 linting (#2195 by @Ygnas)
  • Implement pre-commit hooks (#2184 by @droctothorpe)
  • Add command to re-run GitHub Actions tests (#2167 by @andreyvelich)
  • Update JAX integration proposal (#2165 by @sandipanpanda)
  • Update release document (#2153 by @andreyvelich)
  • update volcano to v1.9.0 (#2148 by @lowang-bh)
  • Update Slack Invitation (#2142 by @andreyvelich)
  • Refine the integration tests for the immutable PyTorchJob queueName (#2130 by @tenzen-y)
  • Add GitHub Issue Template (#2129 by @andreyvelich)
  • Update the images to the latest tag in master branch (#2128 by @johnugeorge)
  • Updated Github Action Workflows as per issue [#2117] (#2123 by @hkiiita)
  • changed package name to flake8 to fix pytests pip install (#2109 by @ChristopheBrown)
  • chore(fix): isort xgboost (#2098 by @harshithbelagur)
  • Fix isort on examples/pytorch (#2094 by @marcmaliar)
Source: README.md, updated 2025-01-21