Kubeflow Trainer - Browse /v2.1.0 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
README.md	2025-11-07	17.1 kB	0
v2.1.0 source code.tar.gz	2025-11-07	8.3 MB	0
v2.1.0 source code.zip	2025-11-07	9.1 MB	0
Totals: 3 Items		17.4 MB	0

This is Kubeflow Trainer v2.1.0 release.

:::bash
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.1.0"

$ kubectl get pods -n kubeflow-system

NAME                                                  READY   STATUS    RESTARTS   AGE
jobset-controller-manager-54968bd57b-88dk4            2/2     Running   0          65s
kubeflow-trainer-controller-manager-cc6468559-dblnw   1/1     Running   0          65s

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.1.0"

You can now install controller manager with Helm charts 🚀

:::bash
helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.1.0

Install Kubeflow Python SDK:

:::bash
pip install -U kubeflow

For more information, please see the Kubeflow Trainer docs.

Breaking Changes

feat(api): Replace deprecated PodSpecOverrides API with PodTemplateOverrides in TrainJob (#2785 by @xigang)
feat(operator): Replace TrainJob controller settings with the Config API (#2879 by @kapil27)
chore(operator): Upgrade JobSet to v0.10.1 (#2875 by @astefanutti)
chore(operator): Upgrade Kubernetes to v1.34 (#2804 by @astefanutti)
Upgrade Kubernetes to v1.33 (#2756 by @astefanutti)

New Features

Distributed AI Data Cache

feat(cache): KEP-2655: Adding default runtime with cache and example (#2928 by @akshaychitneni)
feat(cache): KEP-2655 - Supporting readiness probes on cache nodes (#2920 by @akshaychitneni)
feat(cache): KEP-2655 - Add build pipeline and address vulnerabilities for data_cache (#2890 by @akshaychitneni)
feat(cache): KEP-2655: Adding cache initializer (#2793 by @akshaychitneni)
feat: KEP-2655: Add data cache system (#2755 by @akshaychitneni)

Stream data directly to your GPU nodes with zero-copy transfers from an in-memory cache cluster powered by Apache Arrow and Apache DataFusion. This allows users to load massive tabular datasets efficiently, maximize GPU utilization, and minimize I/O in for large-scale pre- or post-training distributed AI workloads.

Explore more about data cache in:

LLM Post-Training

feat(runtimes): Add LoRA/QLoRA/DoRA support in LLM Trainer V2 (#2832 by @Electronic-Waste)
feat: Add Qwen 2.5 1.5b runtime, example and fix gpu e2e test (#2835 by @jaiakash)
feat(runtimes): Support Distributed MLX on CUDA (#2790 by @andreyvelich)

Kueue Enhancements

Support Topology Aware Scheduling for TrainJobs ([kubernetes-sigs/kueue#7249](https://github.com/kubernetes-sigs/kueue/issues/7249) by @kaisoz)
fix: Allow multiple podSpec overrides to target the same TargetJob (#2880 by @kaisoz)
feat: support affinity in TrainJob pod spec overrides (#2796 by @toVersus)
feat: Add schedulingGates to PodSpecOverrides (#2700 by @astefanutti)

Check out the official Kueue docs.

Volcano Scheduler

feat: KEP-2437 - PodGroup Creation for Volcano Scheduler (#2729 by @Doris-xm)
feat(docs): KEP-2437-Support Volcano Scheduler in Kubeflow Trainer V2 (#2672 by @Doris-xm)

API Updates

feat(runtimes): add support for launcher resource allocation in MPI jobs (#2653 by @jskswamy)
feat: Add PodTemplateOverrides into TrainJob V2 API (#2882 by @xigang)
feat(api): Sync TrainJob JobsStatus from JobSet ReplicatedJobsStatus (#2802 by @astefanutti)
feat: support imagePullSecrets in TrainJob pod spec overrides (#2806 by @toVersus)
feat(operator): enforce RFC 1035 validation for TrainJob name (#2767 by @juniemariam)

Bug Fixes

[release-2.1] fix(ci): Fix the Kubeflow SDK installation with Docker (#2927 by @andreyvelich)
fix(manifests): Add RBAC rules for Leases in Helm Charts (#2901 by @astefanutti)
fix(docs): correct example usage in KEP-2437-Support-Volcano-Scheduler (#2898 by @Doris-xm)
fix(api): Keep mpiImplementation field a pointer (#2897 by @astefanutti)
fix(api): Fix lint errors for the config API (#2896 by @astefanutti)
fix: charts dependencies (#2892 by @ls-2018)
fix(runtimes): fix missing dependency in torchtune trainer image. (#2887 by @Electronic-Waste)
fix(ci): Add latest image tag only for the master branch (#2854 by @andreyvelich)
fix: read only permission for PRs (#2829 by @jaiakash)
fix: read only permission for PRs (#2827 by @jaiakash)
fix: update examples to reflect func_args now being unpacked (#2815 by @briangallagher)
fix(examples): Update get_job_logs() API in examples (#2813 by @andreyvelich)
fix: teraform for oci gpu based vm (#2810 by @jaiakash)
fix(api): Regenerate TrainJob CRD (#2805 by @astefanutti)
fix(ci): disable Unit and Integration Test - Go gh action in forked repos (#2746 by @milinddethe15)
fix(manifests): Add missing permissions for the RuntimeClass and LimitRange (#2787 by @tenzen-y)
fix: update kubeflow sdk reference (#2780 by @kramaranya)
fix(api): update license path for kubeflow_trainer_api (#2778 by @kramaranya)
fix(runtimes): Set numProcPerNode: 1 in DeepSpeed Runtime (#2774 by @andreyvelich)
fix(docs): update KEP-2401 according to current implementation. (#2765 by @Electronic-Waste)
fix(ci): Remove coverage from Go integration tests (#2773 by @andreyvelich)
fix(api): Fix license path for Kubeflow Trainer Python API (#2771 by @andreyvelich)
fix(examples): Update the argument for Runtime framework (#2766 by @andreyvelich)
fix(test): Fix Ginkgo command for integration tests (#2758 by @astefanutti)
fix: fix the command for fetching Kubeflow Trainer version in the issue template (#2732 by @rudeigerc)
fix(manifests): add rbac config of events for event recorders (#2731 by @rudeigerc)
fix(manifests): fix position of labels of dataset-initializer from pod to job (#2719 by @rudeigerc)
fix(module): Change Go module name to v2 (#2707 by @andreyvelich)
fix(plugins): Fix some errors in torchtune mutation process. (#2675 by @Electronic-Waste)
fix(manifests): Update manifests to enable LLM fine-tuning workflow with CTR and TrainJob yaml files (#2669 by @Electronic-Waste)
fix(rbac): Add required RBAC to update ClusterTrainingRuntimes on OpenShift (#2682 by @astefanutti)

Misc

[release-2.1] feat: Adding local execution example notebook (#2924 by @Fiona-Waters)
feat(manifests): Publish Kubeflow Trainer Helm charts (#2917 by @adity1raut)
[release-2.1] chore(operator): Use SSA throughout runtime framework (#2912 by @astefanutti)
[release-2.1] feat(initializer): add s3 model and dataset initializers (#2911 by @rudeigerc)
feat(operator): Add validation for required containers in replicatedJobs (#2722 by @Electronic-Waste)
feat: add controller manager configuration helm chart (#2895 by @kapil27)
chore(ci): Enable Kubernetes API Linter (#2858 by @astefanutti)
feat(runtimes): implement clusterTrainingRuntime deprecation process (#2791 by @tdn21)
feat: add HF token and allow gpu workflow to run from pull request target (#2818 by @jaiakash)
feat(docs): KEP-2442-Support JAX Training Runtime (#2643 by @mahdikhashan)
chore(test): Support e2e cluster setup with Podman (#2861 by @astefanutti)
chore(runtimes): Upgrade torchtune version to v0.6.1 (#2876 by @Electronic-Waste)
chore(operator): Upgrade JobSet to v0.10.1 (#2875 by @astefanutti)
feat(docs): Update Trainer diagram and SDK release (#2867 by @andreyvelich)
feat(docs): Add changelog for Kubeflow Trainer v2.0.1 (#2864 by @andreyvelich)
fix(docs): Update the release document to push all changes (#2865 by @andreyvelich)
chore: Install released version of Kubeflow SDK (#2857 by @kramaranya)
chore(ci): Ignore generated files in .gitattributes (#2855 by @andreyvelich)
feat: Add a public function to create runtime info objects (#2837 by @kaisoz)
chore(test): add uts for coscheduling plugin. (#2582 by @IRONICBo)
feat(ci): Add Trivy Vulnerability Scan (#2826 by @andreyvelich)
chore: merge test cases using PodSpecOverrides into a single case (#2822 by @toVersus)
chore(runtimes): update torchtune CTRs with multiple dependson feature in jobset v0.9.0 (#2823 by @Electronic-Waste)
chore(operator): Bump JobSet to v0.9.0 version (#2821 by @andreyvelich)
feat(docs): How to release Python API modules (#2786 by @andreyvelich)
feat: support for managing gpu enabled self runner infra (#2762 by @jaiakash)
chore: Nominate @astefanutti as Kubeflow Trainer approver (#2808 by @andreyvelich)
chore: deflake test to ensure runtime is created before creating trainjob (#2807 by @toVersus)
feat: KEP-2432: GPU Testing for LLM Blueprints (#2689 by @jaiakash)
chore(docs): Add license scan report and status (#2788 by @fossabot)
chore: Remove tool.hatch.build.targets.wheel from pyproject (#2803 by @kramaranya)
chore: Add unit tests for pkg/apply (#2479 by @akagami-harsh)
chore(runtimes): Remove MPI pi Runtime (#2760 by @andreyvelich)
chore(runtimes): Update packages in DeepSpeed runtime and fix T5 example (#2781 by @andreyvelich)
feat: run workflows on /ok-to-test label (#2639 by @milinddethe15)
feat: Add security contexts to controller managers (#2759 by @kunal-511)
feat(docs): Introduce latest news to the README (#2769 by @andreyvelich)
feat(runtimes): Add Framework Label to the Runtimes (#2761 by @andreyvelich)
feat(runtimes): Remove command from the Runtimes with CustomTrainer (#2754 by @andreyvelich)
feat(docs): Kubeflow Trainer ROADMAP 2025 (#2748 by @andreyvelich)
chore(docs): Add Changelog for Kubeflow Trainer v2.0.0 (#2743 by @andreyvelich)
chore: update github runners to oci gh arc runners (#2739 by @koksay)
feat(operator): force trainjob name to be compliant with RFC 1035 for jobset (#2734 by @rudeigerc)
chore(ci): Add GitHub action to verify PR titles (#2724 by @andreyvelich)
feat(docs): Guide to report security vulnerability (#2718 by @andreyvelich)
chore: Upgrade JobSet to version 0.8.2 (#2726 by @astefanutti)
Add Red Hat to ADOPTERS.md (#2714 by @terrytangyuan)
chore(docs): Add Changelog for v2.0.0-rc.1 (#2709 by @andreyvelich)
chore(docs): Update Release Guide (#2710 by @andreyvelich)
chore: Copy generated CRDs into Helm charts (#2703 by @astefanutti)
feat(example): Add alpaca-trianjob-yaml.ipynb. (#2670 by @Electronic-Waste)
feat: Mutable PodSpecOverrides for suspended TrainJob (#2683 by @astefanutti)
chore: Replace the deprecated intstr.FromInt with intstr.FromInt32 (#2695 by @tenzen-y)
chore: Remove the vendor specific parameters (#2691 by @tenzen-y)
KEP-2170: Add the manifests overlay for Kubeflow Training V2 (#2382 by @Doris-xm)
chore(runtime): Bump Torch to 2.7.1 and DeepSpeed to 0.17.1 (#2685 by @andreyvelich)
chore(helm): Sync ClusterRule in Helm chart (#2686 by @astefanutti)
Add Changelog for Trainer v2.0.0-rc.0 (#2666 by @kramaranya)
feat(initializer): Updated base image to Debian image and changed install commands compatible with Debian image (#2528 by @Debabrata47)

Source: README.md, updated 2025-11-07

Kubeflow Trainer Files

Distributed AI Model Training and LLM Fine-Tuning on Kubernetes

Breaking Changes

New Features

Distributed AI Data Cache

LLM Post-Training

Kueue Enhancements

Volcano Scheduler

API Updates

Bug Fixes

Misc

Kubeflow Trainer Files

Distributed AI Model Training and LLM Fine-Tuning on Kubernetes

Get an email when there's a new version of Kubeflow Trainer

Breaking Changes

New Features

Distributed AI Data Cache

LLM Post-Training

Kueue Enhancements

Volcano Scheduler

API Updates

Bug Fixes

Misc