Download Latest Version v2.0.0 source code.tar.gz (6.2 MB)
Email in envelope

Get an email when there's a new version of Kubeflow Training Operator

Home / v2.0.0
Name Modified Size InfoDownloads / Week
Parent folder
README.md 2025-07-17 20.2 kB
v2.0.0 source code.tar.gz 2025-07-17 6.2 MB
v2.0.0 source code.zip 2025-07-17 6.9 MB
Totals: 3 Items   13.2 MB 1

This is the major release of the Kubeflow Trainer 2.0 project.

For more information, please see the

Quickstart

Install the Kubeflow Trainer control plane:

:::sh
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.0.0"

$ kubectl get pods -n kubeflow-system

NAME                                                  READY   STATUS    RESTARTS   AGE
jobset-controller-manager-54968bd57b-88dk4            2/2     Running   0          65s
kubeflow-trainer-controller-manager-cc6468559-dblnw   1/1     Running   0          65s

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.0.0"

Install Kubeflow Python SDK:

:::sh
pip install git+https://github.com/kubeflow/sdk.git@main#subdirectory=python

Run your first TrainJob by following the getting started guide.

Breaking Changes

  • Migrate SDK to the kubeflow/sdk repository (#2657 by @eoinfennessy)
  • KEP-2170: Change API Group Name to trainer.kubeflow.org (#2413 by @Electronic-Waste)
  • Move generated Python models into kubeflow_trainer_api package (#2632 by @kramaranya)
  • Upgrade kubernetes Go module version to 1.32 (#2450 by @tenzen-y)
  • Remove kubeflow-trainer prefix from jobset resource names (#2596 by @ChenYi015)
  • Remove the Training Operator V1 Source Code (#2389 by @andreyvelich)

New Features

LLM Trainer V2

  • KEP-2401: Support loading local LLMs (#2644 by @Electronic-Waste)
  • KEP-2401: Support mutating dataset preprocessing config in SDK (#2638 by @Electronic-Waste)
  • KEP-2401: Create LLM Training Runtimes for Llama 3.2 model family (#2590 by @Electronic-Waste)
  • KEP-2401: Complement torch plugin to support torchtune config mutation (#2587 by @Electronic-Waste)
  • KEP-2401: Create torchtune trainer image (#2516 by @Electronic-Waste)
  • KEP-2401: Refactor current train() API (#2513 by @Electronic-Waste)
  • KEP-2401: Kubeflow LLM Trainer V2 (#2410 by @Electronic-Waste)

Runtime Framework

  • feat(runtimes): Support MLX Distributed Runtime with OpenMPI (#2565 by @andreyvelich)
  • feat(runtimes): Support DeepSpeed Runtime with OpenMPI (#2559 by @andreyvelich)
  • feat(runtime): remove needless Launcher chainer. (#2558 by @IRONICBo)
  • Store the TrainingRuntime numNodes as runtime.Info.PodSet.Count (#2539 by @tenzen-y)
  • Add dependencies to RuntimeRegistrar (#2476 by @tenzen-y)
  • KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs (#2313 by @akshaychitneni)
  • Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to ClusterTrainingRuntime (#2625 by @tenzen-y)
  • Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to TrainingRuntime (#2608 by @tenzen-y)

MPI Plugin

  • [feature]:add validations for MPIRuntime with RunLauncherAsNode (#2551 by @Harshal292004)
  • Implement CustomValidation UT for MPI plugin (#2555 by @tenzen-y)
  • Implemenet MPI Plugin for OpenMPI (#2493 by @tenzen-y)
  • Implement MPI plugin UTs (#2481 by @tenzen-y)
  • Implement MPIImplementation Enum CRD validation (#2482 by @tenzen-y)
  • Implement MPI numProcPerNode defaulter (#2483 by @tenzen-y)
  • Add MPIMLPolicySource CRD defaulters (#2474 by @tenzen-y)
  • Make MPIMLPolicySource optional fields as a pointer (#2472 by @tenzen-y)
  • KEP-2170: Implement MPI Plugin for Kubeflow Trainer (#2394 by @andreyvelich)

JobSet

  • Retrieve JobSetSpec from runtime.Info in CustomValidations (#2557 by @tenzen-y)
  • KEP-2170: Deploy JobSet in kubeflow-system namespace (#2388 by @andreyvelich)
  • Bump JobSet to v0.8.0 (#2463 by @andreyvelich)
  • Upgrade jobset SDK version to v0.7.3 (#2445 by @Electronic-Waste)

New Examples

  • Add question-answer example for v2 trainer (#2580 by @solanyn)
  • KEP-2170: Add PyTorch DDP MNIST training example (#2387 by @astefanutti)

SDK Updates

  • feat(sdk): Get namespace from the provided context (#2593 by @andreyvelich)
  • feat(sdk): Support MPI-based TrainJobs (#2545 by @andreyvelich)
  • feat(sdk): Migrate to OpenAPI V3 (#2490 by @andreyvelich)
  • feat(sdk): Generate external Kubernetes and JobSet models (#2466 by @andreyvelich)

Bug Fixes

  • [release-2.0] fix(manifests): add rbac config of events for event recorders (#2733 by @rudeigerc)
  • [release-2.0] fix(manifests): fix position of labels of dataset-initializer from pod to job (#2720 by @rudeigerc)
  • [release-2.0] fix(module): Change Go module name to v2 (#2708 by @andreyvelich)
  • [cherry-pick] fix(manifests): Update manifests to enable LLM fine-tuning workflow w… (#2696 by @Electronic-Waste)
  • [release-2.0] fix(plugins): Fix some errors in torchtune mutation process. (#2693 by @Electronic-Waste)
  • [release-2.0] fix(rbac): Add required RBAC to update ClusterTrainingRuntimes on OpenShift (#2684 by @astefanutti)
  • Revert "fix(sdk): Fix type annotation for train method's trainer parameter" (#2651 by @Electronic-Waste)
  • fix(sdk): Fix bad arg passed to get_args_using_torchtune_config (#2647 by @eoinfennessy)
  • fix(sdk): Fix type annotation for train method's trainer parameter (#2646 by @eoinfennessy)
  • fix(controller): Fix RBAC permissions for TrainJob controller (#2626 by @andreyvelich)
  • Fix close-pr message in Stale GitHub Action (#2622 by @kramaranya)
  • fix: remove redundant K8s version matrix from integration tests (#2617 by @tr33k)
  • fix(doc): tidy up KEP-2401. (#2594 by @Electronic-Waste)
  • Fix MPI Test runnable errors (#2570 by @tenzen-y)
  • Fix issue with fetching clustertrainingruntime for validations (#2564 by @akshaychitneni)
  • fix(sdk): Add missing import types. (#2566 by @Electronic-Waste)
  • fix(sdk): Using correct entrypoint for mpirun (#2552 by @andreyvelich)
  • fix(sdk): add missing import type Initializer. (#2541 by @Electronic-Waste)
  • fix(ci): update test-go coverage ci config and replace trainer badge with new address. (#2534 by @IRONICBo)
  • fix(doc): Update train() API in KEP-2401 (#2536 by @Electronic-Waste)
  • fix(test): Update images for DockerHub publish (#2535 by @andreyvelich)
  • [hotfix] fix checkout on workflow (#2531 by @mahdikhashan)
  • [hotfix] fix docker cred (#2530 by @mahdikhashan)
  • fix: remove unused parameter name in default case of shouldUseCPU function (#2521 by @Diasker)
  • Fix [#2407]: Cap nproc_per_node based on CPU resources for PyTorch TrainJob (#2492 by @Diasker)
  • fix type in model initializer entrypoint (#2489 by @szaher)
  • fix(runtime): fix error label name. (#2487 by @Electronic-Waste)
  • fix(sdk): resolve errors in deserialization (#2457 by @Electronic-Waste)
  • Fix missing external types in apply configurations (#2429 by @astefanutti)
  • Fix API Group for Torch Runtime (#2424 by @andreyvelich)
  • Fix Kustomize patchesStrategicMerge deprecation warning (#2405 by @astefanutti)
  • ControlPlane: Fix flaky integraion testings due to missing the latest version of object (#2414 by @tenzen-y)

Misc

  • [release-2.0] chore: update github runners to oci gh arc runners (#2741 by @koksay)
  • [release-2.0] feat(operator): force trainjob name to be compliant with RFC 1035 for jobset (#2736 by @rudeigerc)
  • [release-2.0] chore: Upgrade JobSet to version 0.8.2 (#2727 by @google-oss-robot)
  • [release-2.0] chore: Copy generated CRDs into Helm charts (#2704 by @astefanutti)
  • [release-2.0] feat: Add schedulingGates to PodSpecOverrides (#2705 by @astefanutti)
  • [cherry-pick] feat(example): Add alpaca-trianjob-yaml.ipynb. (#2670) (#2702 by @Electronic-Waste)
  • [release-2.0] feat: Mutable PodSpecOverrides for suspended TrainJob (#2698 by @astefanutti)
  • [release-2.0] chore: Replace the deprecated intstr.FromInt with intstr.FromInt32 (#2697 by @tenzen-y)
  • [release-2.0] chore: Remove the vendor specific parameters (#2694 by @tenzen-y)
  • [Release 2.0] KEP-2170: Add the manifests overlay for Kubeflow Training V2 (#2692 by @Doris-xm)
  • [release-2.0] chore(runtime): Bump Torch to 2.7.1 and DeepSpeed to 0.17.1 (#2687 by @andreyvelich)
  • [release-2.0] chore(helm): Sync ClusterRule in Helm chart (#2688 by @astefanutti)
  • Tag Docker images with GitHub release tags (#2662 by @kramaranya)
  • feat(controller): Implement PodSpecOverride API (#2614 by @andreyvelich)
  • Nominate @Electronic-Waste as approver and @astefanutti as reviewer (#2659 by @andreyvelich)
  • chore(build): Support Podman to run OpenAPI generator (#2656 by @astefanutti)
  • chore(docs): Add OpenSSF Best Practices Badge (#2611 by @andreyvelich)
  • [chore] update stale action version to latest (#2642 by @mahdikhashan)
  • Remove TrainJobCreated condition (#2621 by @astefanutti)
  • ci: refactor build-push-images workflow (#2607 by @milinddethe15)
  • Update Go to v1.24 (#2615) (#2620 by @vzamboulingame)
  • test(runtime): add UT for IndexTrainJobTrainingRuntime (#2603 by @Harshal292004)
  • ci: add k8s v1.32 for tests env (#2613 by @milinddethe15)
  • chore(deps): bump torch from 2.5.0 to 2.6.0 in /cmd/runtimes/deepspeed (#2606 by @dependabot[bot])
  • chore(deps): bump golang.org/x/net from 0.36.0 to 0.38.0 (#2602 by @dependabot[bot])
  • test(runtime): add UT for jobset runtime valid function. (#2562 by @Harshal292004)
  • Add Helm chart for kubeflow trainer (#2435 by @ChenYi015)
  • chore(test): Removed the no longer needed github-trigger-rerun-test.yaml (#2589 by @hbelmiro)
  • Add PodNetwork plugin to KEP-2170 Job Pipeline Framework description (#2578 by @tenzen-y)
  • chore(docs): Update Slack channel (#2569 by @andreyvelich)
  • docs: update CONTRIBUTING.md for Kubeflow Trainer V2 (#2561 by @muzzlol)
  • test(runtime): add UT for torch runtime valid function. (#2560 by @IRONICBo)
  • feat(doc): add Runtime API design in KEP-2401. (#2501 by @Electronic-Waste)
  • Update CONTRIBUTING.md (#2512 by @MuhammedgitAli)
  • feat: add replicatedJobs.replicas validations in validateReplicatedJobs function. (#2533 by @IRONICBo)
  • Construct Trainer based on trainer.kubeflow.org/trainjob-ancestor-step label (#2548 by @tenzen-y)
  • chore: Enable GCI for golangci-lint (#2540 by @tenzen-y)
  • [feature] merge GHCR and DockerHub CI jobs (#2537 by @ashwinr64)
  • feat(controller): Refactor the Initializer APIs of TrainJob (#2523 by @andreyvelich)
  • Migrate InfoOptions.podSpecReplias and info.Scheduler.TotalRequests to info.TemplateSpec.PodSet (#2524 by @tenzen-y)
  • [feature] pull images in manifest from ghcr (#2529 by @mahdikhashan)
  • [feature] migrate images to ghcr (#2455 by @mahdikhashan)
  • KEP-2170: Adding validation webhook for v2 trainjob (#2307 by @akshaychitneni)
  • Migrate Info.Trainer to Info.TemplateSpec.PodSet (#2520 by @tenzen-y)
  • Implement E2E for OpenMPI workload (#2500 by @tenzen-y)
  • Bump golang.org/x/net from 0.33.0 to 0.36.0 (#2514 by @dependabot[bot])
  • Move TrainJob marker defaulting and validation integration tests to test/integration/webhooks pkg (#2486 by @tenzen-y)
  • feat(controller): Integrate DependsOn API (#2484 by @andreyvelich)
  • Store E2E manifests to artifacts directory (#2478 by @tenzen-y)
  • Use large runner for building container image (#2475 by @tenzen-y)
  • chore(test): Upload artifacts from dir (#2473 by @andreyvelich)
  • Implement UTs for PlainML plugin (#2469 by @tenzen-y)
  • chore(test): Add E2E tests for Kubeflow Trainer (#2470 by @andreyvelich)
  • KEP-2170: Add Kubeflow Trainer Pipeline Framework Design (#2439 by @tenzen-y)
  • Replace Kueue PodRequests helper with core k/k one (#2461 by @tenzen-y)
  • KEP-2170: Use SSA to reconcile TrainJob components (#2431 by @astefanutti)
  • Bump golang.org/x/net from 0.30.0 to 0.33.0 (#2451 by @dependabot[bot])
  • Use the correct apiVersion name (#2444 by @runzhen)
  • Add 'KEP Usage' KEP and template link (#2423 by @anishasthana)
  • KEP-2170: Add validation to Torch numProcPerNode field (#2409 by @astefanutti)
  • update migration url on readme file (#2436 by @varodrig)
  • IntegraionTests: Waiting for expected conditions before emulate JobSet controller manager (#2425 by @tenzen-y)
  • Nominate @Electronic-Waste as a reviewer (#2427 by @andreyvelich)
  • Update the naming conventions for Kubeflow Trainer (#2415 by @andreyvelich)
  • Rename paddlepaddle_defaults.go file name (#2399 by @ChristianZaccaria)
  • Bump golang.org/x/net from 0.30.0 to 0.33.0 (#2391 by @dependabot[bot])
  • KEP-2170: Add unit and Integration tests for model and dataset initializers (#2323 by @seanlaii)
  • Testing CI in JAX example (#2385 by @saileshd1402)
  • Upgrade huggingface_hub to v0.27.x in dataset initializer v2 (#2379 by @astefanutti)
  • Add Changelog for Training Operator v1.9.0-rc.0 (#2380 by @andreyvelich)
  • Add release branch to the image push trigger (#2376 by @andreyvelich)
Source: README.md, updated 2025-07-17