Axolotl - Browse /v0.12.0 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
README.md	2025-08-08	18.2 kB	0
v0.12.0 source code.tar.gz	2025-08-08	2.6 MB	0
v0.12.0 source code.zip	2025-08-08	3.1 MB	2
Totals: 3 Items		5.7 MB	2

We're introducing a major upgrade to our distributed training feature set, including support for ND-Parallelism for training at scale, support for DeepSpeed's Auto Tensor Parallelism, and FP8 training. We're also excited to announce support for fine-tuning the latest gpt-oss models (and many more!) and a host of fixes and dependency updates.

🎉 New features

ND-Parallel for Advanced Parallelism Strategies

Together with Accelerate, we've introduced ND-Parallel support, allowing you to compose different parallelism techniques like Context Parallelism, Tensor Parallelism, and Fully Sharded Data Parallelism to enable fine-tuning large models at scale. Check out the official Huggingface blogpost for more details!

Contributed by @SalmanMohammadi and @winglian in #2977 and #3019.

Expanded Model Support

We've added support for a new wave of powerful models:

GPT-OSS (#3020 by @winglian) - get up and running right away with our example configs!
Gemma 3 (#2852 by @NanoCode012)
Liquid Foundation Model 2 (#2905 by @winglian)
Voxtral & Magistral Small 1.1 (#2979 by @NanoCode012)
Devstral (#2896 by @NanoCode012)

Experimental FP8 Mixed-Precision Training with `torchao`

Check out experimental FP8 mixed-precision training! By leveraging the torchao library, you can train with FP8 data types, perform gather op in fp8, leading to significant memory savings and potential speedups. Read the docs to enable it.

Contributed by @djsaunde in #2926.

Improved Slurm Support

We've fixed some issues that may freeze tasks during preprocessing and include an easy to use Slurm example for your large cluster needs. Check out the README and example

Contributed by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/3038

DeepSpeed Auto Tensor Parallelism (AutoTP)

You can now leverage DeepSpeed's Auto Tensor Parallelism to automatically shard your model's layers across multiple GPUs. This dramatically reduces the VRAM requirement for each GPU, enabling you to fine-tune much larger models than previously possible on the same hardware. Enable it in your yaml config with setting tensor_parallel_size: int .

Contributed by @winglian in #2574.

TiledMLP Now Supports FSDP2 and Single GPU

TiledMLP, which reduces activation memory for long sequences, is now more versatile. It's fully compatible with our new FSDP2 implementation and can now be used on single-GPU setups, making it accessible for a wider range of training scenarios.

Contributed by @winglian in #2950, #2891.

Dion Optimizer Support

We've added support for the Dion optimizer, a scalable and communication-efficient optimizer designed increase speedup during training via parallelisms, giving you another tool to fine-tune larger models on your hardware.

Contributed by @winglian in #3014.

Enabled LoRA kernels with FSDP2

FSDP2 and LoRA training is now significantly faster thanks to the integration of optimized kernels, reducing training overhead and speeding up your workflows.

Contributed by @djsaunde in #2992.

Quality-of-Life & Developer Experience Improvements

CLI Autocompletion: Speed up your workflow with new tab-completion for the Axolotl CLI. Simply run axolotl -h to see how to install it for your shell. (by @winglian in #2955)
Mid-Training Profiling: You can now start the PyTorch profiler in the middle of a training run, making it easier to debug performance bottlenecks without needing to restart. (by @winglian in #2899)
Generic Fused Kernels: Applied generic fused CCE and TiledMLP implementations from Liger to support a wider range of arbitrary models automatically. (by @winglian in #2908)
Activation Offloading with CUDA Streams: Reduced VRAM usage by offloading activations to CPU RAM using non-blocking CUDA streams for better GPU utilization. (by @winglian in #2900, fixed for LoRA in #2928)
New CLI Launcher: add --launcher option, support launcher args, cleanup, refactor (by @djsaunde in https://github.com/axolotl-ai-cloud/axolotl/pull/2924)
Support for lora_target_parameters : Allows targeting parameter names for LoRA, useful when targeting module name is not possible in cases like MoE. (by @winglian in [#3006])
Cut Cross Entropy support for SmollM3, Granite, and GraniteMoE: (by @NanoCode012 in [#2993])
LoRA Kernels Now Support Biases: Our optimized QLoRA kernels can now be applied to bias terms, increasing the flexibility of your low-rank adaptations. (by @djsaunde in [#3025])
Custom Trainer via Module Path: As an alternative to plugin, you can now define a trainer_cls in your YAML config by pointing to it as a module path (e.g., my_project.trainers.CustomTrainer). (by @winglian in [#3024])
Prime Intellect Integration: Added support for running jobs on the Prime Intellect platform. (by @winglian in [#3021])

📦 Dependency Updates

peft upgraded to 0.17.0 (#3006) and datasets to 4.0.0. (#2917)
trl upgraded to 0.20.0. (#2892 , [#2987])
accelerate upgraded to 1.9.0. (#2936)
liger upgraded to 0.6.1. (#2893 , [#2987])
torchao upgraded to the 0.12.0. (#2968)
modal upgraded to 1.0.2. (#2925)
transformers upgraded to 4.55.0. (#2984 , [#3018])
bitsandbytesupgraded to 0.46.1. (#2992)

🚨 Upcoming deprecations

Upgrading from FSDP1 → FSDP2

Axolotl now recommends PyTorch's nativeFSDP2 instead of FSDP1. This brings performance improvements, better stability, and additional features and compatibility with the latest fine-tuning techniques.

For migration guidance, please refer to our FSDP documentation and the official PyTorch FSDP guide.

Contributed by @SalmanMohammadi and @winglian in #2760, #2910.

Rename of Sequence Parallel config

We have renamed sequence_parallel_degree to context_parallel_size to be more consistent with the ecosystem naming by @SalmanMohammadi in [#2977].

🔧 Fixes & Improvements

Dataset & Preprocessing

Improved Dataset Processing: Significantly improved performance for dataset processing, sharding, and multiprocessing, resulting in faster startup times. (by @VarunGumma in #2918)
Smarter Defaults:
- The warmup_ratio is now used as a better default over warmup_steps, as it adapts to your dataset size. (by @winglian in #2897)
- pad_to_sequence_len now defaults to True if sample_packing is True for more consistent and intuitive behavior. (by @winglian in #2941)
SimPO Fix: Fixed an issue with using customized datasets with the SimPO trainer. (by @ganler in #2894)
Tool Usage Fix: Prevented the incorrect merging of tool arguments during data preprocessing. (by @greenhestu in #2909)

Distributed Training & Memory

DDP & DeepSpeed Fixes: Addressed an issue causing incorrect step calculation with DDP (#2915) and prevented distributed initialization during preprocessing with DeepSpeed for faster startups (#2920).
Checkpoint Memory: Added garbage collection before saving a checkpoint to reduce peak memory usage and prevent OOM errors. (by @winglian in #2971)
Plugin Registration: Ensured plugins are correctly registered in Ray workers for more robust distributed setups. (by @drikster80 in #2901)
torch.compile: Removed an extra, unnecessary torch.compile call to streamline execution. (by @djsaunde in #2904)

Optimizers & Schedulers

RexLR Fix: Fixed a bug in the RexLR scheduler where the learning rate was not being correctly deep-copied. (by @nyxkrage in #3012)
Optimizer Validation: Added validation to prevent using low-bit torchao optimizers with unsupported configurations like parameter groups. (by @winglian in #3003)

Distributed Training & Kernels

Tensor Parallelism Guardrails: Added validation to prevent using Tensor Parallelism (TP) with models that have tied embeddings, which is an unsupported configuration. (by @winglian in #2999)
Multi-Node Improvements: The base Docker image now includes IB/RDMA libraries for improved multi-node performance out-of-the-box. (by @winglian in #3002)

Other Improvements

fix: return proper attention for llama4 lora kernel and fsdp2 llama4 example fix by @NanoCode012 in #2943
fix: make the initial call to tokenizer.pad not spam the console by @winglian in #2946
feat: add call method to mistral tokenizer wrapper by @NanoCode012 in #2898
feat(doc): add all providers to readme by @NanoCode012 in #2972
fix: limit num_proc when saving datasets to disk by @winglian in #2948
chore: update pre-commit hooks by @github-actions[bot] in #2954
fix: upstream fixes in cce for dora and tensor parallel support by @winglian in #2960
fix: handle refactor upstream for flash attention by @winglian in #2966
fix: don't check dataset labels during preprocess for GRPO by @winglian in #2952
fix: revert changing default optimizer to muon by @NanoCode012 in #2965
jagged lr restart scheudler by @winglian in [#1680]
feat(doc): add all providers to readme by @NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2972
don't publish to netlify on contributor submissions since it requires… by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2985
don't create a reference model if grpo beta is 0.0 by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2983
feat(dataset): Individual shuffling of datasets before merging by @Nicolas-BZRD in https://github.com/axolotl-ai-cloud/axolotl/pull/2981
Use '<|finetune_right_pad|>' as padding token for LLama4 by @v-dicicco in [#2988]
feat(docs): Added documentation for N-D Parallelism by @NanoCode012 in #2989
fix(plugins): Ensure plugin registration method is correctly called by @winglian in #2991
chore: update pre-commit hooks by @github-actions[bot] in #3009
fix: Resolved issues with spinning up vllm service for GRPO by @winglian in #3001
fix: Added validation for Tensor Parallelism with tied embeddings by @winglian in #2999
fix: Prevent using torchao low-bit optimizers with unsupported parameter groups by @winglian in #3003
fix: deepcopy lr in RexLR scheduler. by @nyxkrage in #3012
fix: Move memory usage log to trainer.log and format to 2 decimal places by @NanoCode012 in #2996, #3011
fix: use skip_move_to_device for all cases by @winglian in #3015
fix:kd_distillation key_error logprobs by @ved1beta in #2990
chore: drop old patches and code that are no longer needed by @winglian in #3007
fix: KeyError on bitsandbytes quantization config access by @winglian in #3023
fix: lora kernels for mistral3 by @NanoCode012 in #3027
fix: clear cache before clean up by @ved1beta in #3031
feat(doc): add complete optimizer docs and update gpt-oss readme by @NanoCode012 in #3017, #3029
fix:kd_distillation key_error logprobs by @ved1beta in https://github.com/axolotl-ai-cloud/axolotl/pull/2990
drop old patches and code that are no longer needed by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/3007
add kernels for gpt oss models by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/3020
fix keyerror on load_in_8bit/load_in_4bit access in _set_quantization_config by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/3023
Lora kernels bias support by @djsaunde in https://github.com/axolotl-ai-cloud/axolotl/pull/3025
allow custom trainer_cls to be defined as a module reference in the YAML by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/3024
ND Parallel Doc Nits by @SalmanMohammadi in https://github.com/axolotl-ai-cloud/axolotl/pull/3032
fix: lora kernels for mistral3 by @NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/3027
clear cache before clean up by @ved1beta in https://github.com/axolotl-ai-cloud/axolotl/pull/3031
feat(doc): update gpt-oss readme by @NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/3029
Add support for Accelerate CP, ND examples, and fix for parallel config w fsdp by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/3019
Add 2.8.0 base images and uv images by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/3034
feat: update nd parallelism readme by @NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/3039
feat(doc): standardize the axolotl install to a release by @NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/3040
add 120b and deepspeed zero3 examples by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/3035
Feat: add arcee by @NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/3028
use nanmean for loss aggregation (CP fix) by @djsaunde in https://github.com/axolotl-ai-cloud/axolotl/pull/3033
feat(doc): add links to new features on README by @NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2980
tag for v0.12.0 release by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/3041

New Contributors

@drikster80 made their first contribution in https://github.com/axolotl-ai-cloud/axolotl/pull/2901
@ganler made their first contribution in https://github.com/axolotl-ai-cloud/axolotl/pull/2894
@greenhestu made their first contribution in https://github.com/axolotl-ai-cloud/axolotl/pull/2909
@VarunGumma made their first contribution in https://github.com/axolotl-ai-cloud/axolotl/pull/2918
@Nicolas-BZRD made their first contribution in https://github.com/axolotl-ai-cloud/axolotl/pull/2981
@ved1beta made their first contribution in https://github.com/axolotl-ai-cloud/axolotl/pull/2990

Full Changelog: https://github.com/axolotl-ai-cloud/axolotl/compare/v0.11.0.post1...v0.12.0

Source: README.md, updated 2025-08-08

Axolotl Files

Go ahead and axolotl questions

🎉 New features