Download Latest Version v0.12.1 source code.tar.gz (2.6 MB)
Email in envelope

Get an email when there's a new version of Axolotl

Home / v0.12.0
Name Modified Size InfoDownloads / Week
Parent folder
README.md 2025-08-08 18.2 kB
v0.12.0 source code.tar.gz 2025-08-08 2.6 MB
v0.12.0 source code.zip 2025-08-08 3.1 MB
Totals: 3 Items   5.7 MB 2

We're introducing a major upgrade to our distributed training feature set, including support for ND-Parallelism for training at scale, support for DeepSpeed's Auto Tensor Parallelism, and FP8 training. We're also excited to announce support for fine-tuning the latest gpt-oss models (and many more!) and a host of fixes and dependency updates.

🎉 New features

ND-Parallel for Advanced Parallelism Strategies

Together with Accelerate, we've introduced ND-Parallel support, allowing you to compose different parallelism techniques like Context Parallelism, Tensor Parallelism, and Fully Sharded Data Parallelism to enable fine-tuning large models at scale. Check out the official Huggingface blogpost for more details!

  • Contributed by @SalmanMohammadi and @winglian in #2977 and #3019.

Expanded Model Support

We've added support for a new wave of powerful models:

  • GPT-OSS (#3020 by @winglian) - get up and running right away with our example configs!
  • Gemma 3 (#2852 by @NanoCode012)
  • Liquid Foundation Model 2 (#2905 by @winglian)
  • Voxtral & Magistral Small 1.1 (#2979 by @NanoCode012)
  • Devstral (#2896 by @NanoCode012)

Experimental FP8 Mixed-Precision Training with torchao

Check out experimental FP8 mixed-precision training! By leveraging the torchao library, you can train with FP8 data types, perform gather op in fp8, leading to significant memory savings and potential speedups. Read the docs to enable it.

  • Contributed by @djsaunde in #2926.

Improved Slurm Support

We've fixed some issues that may freeze tasks during preprocessing and include an easy to use Slurm example for your large cluster needs. Check out the README and example

DeepSpeed Auto Tensor Parallelism (AutoTP)

You can now leverage DeepSpeed's Auto Tensor Parallelism to automatically shard your model's layers across multiple GPUs. This dramatically reduces the VRAM requirement for each GPU, enabling you to fine-tune much larger models than previously possible on the same hardware. Enable it in your yaml config with setting tensor_parallel_size: int .

  • Contributed by @winglian in #2574.

TiledMLP Now Supports FSDP2 and Single GPU

TiledMLP, which reduces activation memory for long sequences, is now more versatile. It's fully compatible with our new FSDP2 implementation and can now be used on single-GPU setups, making it accessible for a wider range of training scenarios.

Dion Optimizer Support

We've added support for the Dion optimizer, a scalable and communication-efficient optimizer designed increase speedup during training via parallelisms, giving you another tool to fine-tune larger models on your hardware.

  • Contributed by @winglian in #3014.

Enabled LoRA kernels with FSDP2

FSDP2 and LoRA training is now significantly faster thanks to the integration of optimized kernels, reducing training overhead and speeding up your workflows.

  • Contributed by @djsaunde in #2992.

Quality-of-Life & Developer Experience Improvements

  • CLI Autocompletion: Speed up your workflow with new tab-completion for the Axolotl CLI. Simply run axolotl -h to see how to install it for your shell. (by @winglian in #2955)
  • Mid-Training Profiling: You can now start the PyTorch profiler in the middle of a training run, making it easier to debug performance bottlenecks without needing to restart. (by @winglian in #2899)
  • Generic Fused Kernels: Applied generic fused CCE and TiledMLP implementations from Liger to support a wider range of arbitrary models automatically. (by @winglian in #2908)
  • Activation Offloading with CUDA Streams: Reduced VRAM usage by offloading activations to CPU RAM using non-blocking CUDA streams for better GPU utilization. (by @winglian in #2900, fixed for LoRA in #2928)
  • New CLI Launcher: add --launcher option, support launcher args, cleanup, refactor (by @djsaunde in https://github.com/axolotl-ai-cloud/axolotl/pull/2924)
  • Support for lora_target_parameters : Allows targeting parameter names for LoRA, useful when targeting module name is not possible in cases like MoE. (by @winglian in [#3006])
  • Cut Cross Entropy support for SmollM3, Granite, and GraniteMoE: (by @NanoCode012 in [#2993])
  • LoRA Kernels Now Support Biases: Our optimized QLoRA kernels can now be applied to bias terms, increasing the flexibility of your low-rank adaptations. (by @djsaunde in [#3025])
  • Custom Trainer via Module Path: As an alternative to plugin, you can now define a trainer_cls in your YAML config by pointing to it as a module path (e.g., my_project.trainers.CustomTrainer). (by @winglian in [#3024])
  • Prime Intellect Integration: Added support for running jobs on the Prime Intellect platform. (by @winglian in [#3021])

📦 Dependency Updates

  • peft upgraded to 0.17.0 (#3006) and datasets to 4.0.0. (#2917)
  • trl upgraded to 0.20.0. (#2892 , [#2987])
  • accelerate upgraded to 1.9.0. (#2936)
  • liger upgraded to 0.6.1. (#2893 , [#2987])
  • torchao upgraded to the 0.12.0. (#2968)
  • modal upgraded to 1.0.2. (#2925)
  • transformers upgraded to 4.55.0. (#2984 , [#3018])
  • bitsandbytesupgraded to 0.46.1. (#2992)

🚨 Upcoming deprecations

Upgrading from FSDP1 → FSDP2

Axolotl now recommends PyTorch's nativeFSDP2 instead of FSDP1. This brings performance improvements, better stability, and additional features and compatibility with the latest fine-tuning techniques.

For migration guidance, please refer to our FSDP documentation and the official PyTorch FSDP guide.

  • Contributed by @SalmanMohammadi and @winglian in #2760, #2910.

Rename of Sequence Parallel config

We have renamed sequence_parallel_degree to context_parallel_size to be more consistent with the ecosystem naming by @SalmanMohammadi in [#2977].

🔧 Fixes & Improvements

Dataset & Preprocessing

  • Improved Dataset Processing: Significantly improved performance for dataset processing, sharding, and multiprocessing, resulting in faster startup times. (by @VarunGumma in #2918)
  • Smarter Defaults:
    • The warmup_ratio is now used as a better default over warmup_steps, as it adapts to your dataset size. (by @winglian in #2897)
    • pad_to_sequence_len now defaults to True if sample_packing is True for more consistent and intuitive behavior. (by @winglian in #2941)
  • SimPO Fix: Fixed an issue with using customized datasets with the SimPO trainer. (by @ganler in #2894)
  • Tool Usage Fix: Prevented the incorrect merging of tool arguments during data preprocessing. (by @greenhestu in #2909)

Distributed Training & Memory

  • DDP & DeepSpeed Fixes: Addressed an issue causing incorrect step calculation with DDP (#2915) and prevented distributed initialization during preprocessing with DeepSpeed for faster startups (#2920).
  • Checkpoint Memory: Added garbage collection before saving a checkpoint to reduce peak memory usage and prevent OOM errors. (by @winglian in #2971)
  • Plugin Registration: Ensured plugins are correctly registered in Ray workers for more robust distributed setups. (by @drikster80 in #2901)
  • torch.compile: Removed an extra, unnecessary torch.compile call to streamline execution. (by @djsaunde in #2904)

Optimizers & Schedulers

  • RexLR Fix: Fixed a bug in the RexLR scheduler where the learning rate was not being correctly deep-copied. (by @nyxkrage in #3012)
  • Optimizer Validation: Added validation to prevent using low-bit torchao optimizers with unsupported configurations like parameter groups. (by @winglian in #3003)

Distributed Training & Kernels

  • Tensor Parallelism Guardrails: Added validation to prevent using Tensor Parallelism (TP) with models that have tied embeddings, which is an unsupported configuration. (by @winglian in #2999)
  • Multi-Node Improvements: The base Docker image now includes IB/RDMA libraries for improved multi-node performance out-of-the-box. (by @winglian in #3002)

Other Improvements

New Contributors

Full Changelog: https://github.com/axolotl-ai-cloud/axolotl/compare/v0.11.0.post1...v0.12.0

Source: README.md, updated 2025-08-08