Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
README.md | 2025-08-08 | 18.2 kB | |
v0.12.0 source code.tar.gz | 2025-08-08 | 2.6 MB | |
v0.12.0 source code.zip | 2025-08-08 | 3.1 MB | |
Totals: 3 Items | 5.7 MB | 2 |
We're introducing a major upgrade to our distributed training feature set, including support for ND-Parallelism for training at scale, support for DeepSpeed's Auto Tensor Parallelism, and FP8 training. We're also excited to announce support for fine-tuning the latest gpt-oss models (and many more!) and a host of fixes and dependency updates.
🎉 New features
ND-Parallel for Advanced Parallelism Strategies
Together with Accelerate, we've introduced ND-Parallel support, allowing you to compose different parallelism techniques like Context Parallelism, Tensor Parallelism, and Fully Sharded Data Parallelism to enable fine-tuning large models at scale. Check out the official Huggingface blogpost for more details!
Expanded Model Support
We've added support for a new wave of powerful models:
- GPT-OSS (#3020 by @winglian) - get up and running right away with our example configs!
- Gemma 3 (#2852 by @NanoCode012)
- Liquid Foundation Model 2 (#2905 by @winglian)
- Voxtral & Magistral Small 1.1 (#2979 by @NanoCode012)
- Devstral (#2896 by @NanoCode012)
Experimental FP8 Mixed-Precision Training with torchao
Check out experimental FP8 mixed-precision training! By leveraging the torchao
library, you can train with FP8 data types, perform gather op in fp8, leading to significant memory savings and potential speedups. Read the docs to enable it.
- Contributed by @djsaunde in #2926.
Improved Slurm Support
We've fixed some issues that may freeze tasks during preprocessing and include an easy to use Slurm example for your large cluster needs. Check out the README and example
- Contributed by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/3038
DeepSpeed Auto Tensor Parallelism (AutoTP)
You can now leverage DeepSpeed's Auto Tensor Parallelism to automatically shard your model's layers across multiple GPUs. This dramatically reduces the VRAM requirement for each GPU, enabling you to fine-tune much larger models than previously possible on the same hardware. Enable it in your yaml config with setting tensor_parallel_size: int
.
- Contributed by @winglian in #2574.
TiledMLP Now Supports FSDP2 and Single GPU
TiledMLP, which reduces activation memory for long sequences, is now more versatile. It's fully compatible with our new FSDP2 implementation and can now be used on single-GPU setups, making it accessible for a wider range of training scenarios.
Dion Optimizer Support
We've added support for the Dion optimizer, a scalable and communication-efficient optimizer designed increase speedup during training via parallelisms, giving you another tool to fine-tune larger models on your hardware.
- Contributed by @winglian in #3014.
Enabled LoRA kernels with FSDP2
FSDP2 and LoRA training is now significantly faster thanks to the integration of optimized kernels, reducing training overhead and speeding up your workflows.
- Contributed by @djsaunde in #2992.
Quality-of-Life & Developer Experience Improvements
- CLI Autocompletion: Speed up your workflow with new tab-completion for the Axolotl CLI. Simply run
axolotl -h
to see how to install it for your shell. (by @winglian in #2955) - Mid-Training Profiling: You can now start the PyTorch profiler in the middle of a training run, making it easier to debug performance bottlenecks without needing to restart. (by @winglian in #2899)
- Generic Fused Kernels: Applied generic fused CCE and TiledMLP implementations from Liger to support a wider range of arbitrary models automatically. (by @winglian in #2908)
- Activation Offloading with CUDA Streams: Reduced VRAM usage by offloading activations to CPU RAM using non-blocking CUDA streams for better GPU utilization. (by @winglian in #2900, fixed for LoRA in #2928)
- New CLI Launcher: add --launcher option, support launcher args, cleanup, refactor (by @djsaunde in https://github.com/axolotl-ai-cloud/axolotl/pull/2924)
- Support for
lora_target_parameters
: Allows targeting parameter names for LoRA, useful when targeting module name is not possible in cases like MoE. (by @winglian in [#3006]) - Cut Cross Entropy support for SmollM3, Granite, and GraniteMoE: (by @NanoCode012 in [#2993])
- LoRA Kernels Now Support Biases: Our optimized QLoRA kernels can now be applied to bias terms, increasing the flexibility of your low-rank adaptations. (by @djsaunde in [#3025])
- Custom Trainer via Module Path: As an alternative to plugin, you can now define a
trainer_cls
in your YAML config by pointing to it as a module path (e.g.,my_project.trainers.CustomTrainer
). (by @winglian in [#3024]) - Prime Intellect Integration: Added support for running jobs on the Prime Intellect platform. (by @winglian in [#3021])
📦 Dependency Updates
peft
upgraded to 0.17.0 (#3006) anddatasets
to 4.0.0. (#2917)trl
upgraded to 0.20.0. (#2892 , [#2987])accelerate
upgraded to 1.9.0. (#2936)liger
upgraded to 0.6.1. (#2893 , [#2987])torchao
upgraded to the 0.12.0. (#2968)modal
upgraded to 1.0.2. (#2925)transformers
upgraded to 4.55.0. (#2984 , [#3018])bitsandbytes
upgraded to 0.46.1. (#2992)
🚨 Upcoming deprecations
Upgrading from FSDP1 → FSDP2
Axolotl now recommends PyTorch's nativeFSDP2 instead of FSDP1. This brings performance improvements, better stability, and additional features and compatibility with the latest fine-tuning techniques.
For migration guidance, please refer to our FSDP documentation and the official PyTorch FSDP guide.
Rename of Sequence Parallel config
We have renamed sequence_parallel_degree
to context_parallel_size
to be more consistent with the ecosystem naming by @SalmanMohammadi in [#2977].
🔧 Fixes & Improvements
Dataset & Preprocessing
- Improved Dataset Processing: Significantly improved performance for dataset processing, sharding, and multiprocessing, resulting in faster startup times. (by @VarunGumma in #2918)
- Smarter Defaults:
- SimPO Fix: Fixed an issue with using customized datasets with the SimPO trainer. (by @ganler in #2894)
- Tool Usage Fix: Prevented the incorrect merging of tool arguments during data preprocessing. (by @greenhestu in #2909)
Distributed Training & Memory
- DDP & DeepSpeed Fixes: Addressed an issue causing incorrect step calculation with DDP (#2915) and prevented distributed initialization during preprocessing with DeepSpeed for faster startups (#2920).
- Checkpoint Memory: Added garbage collection before saving a checkpoint to reduce peak memory usage and prevent OOM errors. (by @winglian in #2971)
- Plugin Registration: Ensured plugins are correctly registered in Ray workers for more robust distributed setups. (by @drikster80 in #2901)
torch.compile
: Removed an extra, unnecessarytorch.compile
call to streamline execution. (by @djsaunde in #2904)
Optimizers & Schedulers
- RexLR Fix: Fixed a bug in the RexLR scheduler where the learning rate was not being correctly deep-copied. (by @nyxkrage in #3012)
- Optimizer Validation: Added validation to prevent using low-bit
torchao
optimizers with unsupported configurations like parameter groups. (by @winglian in #3003)
Distributed Training & Kernels
- Tensor Parallelism Guardrails: Added validation to prevent using Tensor Parallelism (TP) with models that have tied embeddings, which is an unsupported configuration. (by @winglian in #2999)
- Multi-Node Improvements: The base Docker image now includes IB/RDMA libraries for improved multi-node performance out-of-the-box. (by @winglian in #3002)
Other Improvements
- fix: return proper attention for llama4 lora kernel and fsdp2 llama4 example fix by @NanoCode012 in #2943
- fix: make the initial call to tokenizer.pad not spam the console by @winglian in #2946
- feat: add call method to mistral tokenizer wrapper by @NanoCode012 in #2898
- feat(doc): add all providers to readme by @NanoCode012 in #2972
- fix: limit num_proc when saving datasets to disk by @winglian in #2948
- chore: update pre-commit hooks by @github-actions[bot] in #2954
- fix: upstream fixes in cce for dora and tensor parallel support by @winglian in #2960
- fix: handle refactor upstream for flash attention by @winglian in #2966
- fix: don't check dataset labels during preprocess for GRPO by @winglian in #2952
- fix: revert changing default optimizer to muon by @NanoCode012 in #2965
- jagged lr restart scheudler by @winglian in [#1680]
- feat(doc): add all providers to readme by @NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2972
- don't publish to netlify on contributor submissions since it requires… by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2985
- don't create a reference model if grpo beta is 0.0 by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2983
- feat(dataset): Individual shuffling of datasets before merging by @Nicolas-BZRD in https://github.com/axolotl-ai-cloud/axolotl/pull/2981
- Use '<|finetune_right_pad|>' as padding token for LLama4 by @v-dicicco in [#2988]
- feat(docs): Added documentation for N-D Parallelism by @NanoCode012 in #2989
- fix(plugins): Ensure plugin registration method is correctly called by @winglian in #2991
- chore: update pre-commit hooks by @github-actions[bot] in #3009
- fix: Resolved issues with spinning up vllm service for GRPO by @winglian in #3001
- fix: Added validation for Tensor Parallelism with tied embeddings by @winglian in #2999
- fix: Prevent using torchao low-bit optimizers with unsupported parameter groups by @winglian in #3003
- fix: deepcopy lr in RexLR scheduler. by @nyxkrage in #3012
- fix: Move memory usage log to trainer.log and format to 2 decimal places by @NanoCode012 in #2996, #3011
- fix: use skip_move_to_device for all cases by @winglian in #3015
- fix:kd_distillation key_error logprobs by @ved1beta in #2990
- chore: drop old patches and code that are no longer needed by @winglian in #3007
- fix: KeyError on bitsandbytes quantization config access by @winglian in #3023
- fix: lora kernels for mistral3 by @NanoCode012 in #3027
- fix: clear cache before clean up by @ved1beta in #3031
- feat(doc): add complete optimizer docs and update gpt-oss readme by @NanoCode012 in #3017, #3029
- fix:kd_distillation key_error logprobs by @ved1beta in https://github.com/axolotl-ai-cloud/axolotl/pull/2990
- drop old patches and code that are no longer needed by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/3007
- add kernels for gpt oss models by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/3020
- fix keyerror on load_in_8bit/load_in_4bit access in _set_quantization_config by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/3023
- Lora kernels bias support by @djsaunde in https://github.com/axolotl-ai-cloud/axolotl/pull/3025
- allow custom trainer_cls to be defined as a module reference in the YAML by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/3024
- ND Parallel Doc Nits by @SalmanMohammadi in https://github.com/axolotl-ai-cloud/axolotl/pull/3032
- fix: lora kernels for mistral3 by @NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/3027
- clear cache before clean up by @ved1beta in https://github.com/axolotl-ai-cloud/axolotl/pull/3031
- feat(doc): update gpt-oss readme by @NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/3029
- Add support for Accelerate CP, ND examples, and fix for parallel config w fsdp by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/3019
- Add 2.8.0 base images and uv images by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/3034
- feat: update nd parallelism readme by @NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/3039
- feat(doc): standardize the axolotl install to a release by @NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/3040
- add 120b and deepspeed zero3 examples by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/3035
- Feat: add arcee by @NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/3028
- use nanmean for loss aggregation (CP fix) by @djsaunde in https://github.com/axolotl-ai-cloud/axolotl/pull/3033
- feat(doc): add links to new features on README by @NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2980
- tag for v0.12.0 release by @winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/3041
New Contributors
- @drikster80 made their first contribution in https://github.com/axolotl-ai-cloud/axolotl/pull/2901
- @ganler made their first contribution in https://github.com/axolotl-ai-cloud/axolotl/pull/2894
- @greenhestu made their first contribution in https://github.com/axolotl-ai-cloud/axolotl/pull/2909
- @VarunGumma made their first contribution in https://github.com/axolotl-ai-cloud/axolotl/pull/2918
- @Nicolas-BZRD made their first contribution in https://github.com/axolotl-ai-cloud/axolotl/pull/2981
- @ved1beta made their first contribution in https://github.com/axolotl-ai-cloud/axolotl/pull/2990
Full Changelog: https://github.com/axolotl-ai-cloud/axolotl/compare/v0.11.0.post1...v0.12.0