| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| transformer_engine_torch-2.12.0+cu12torch2.8.0+cu129cxx11abiTRUE-cp312-cp312-linux_x86_64.whl | 2026-02-24 | 666.1 kB | |
| transformer_engine_torch-2.12.0+cu13torch25.10cxx11abiTRUE-cp312-cp312-linux_x86_64.whl | 2026-02-24 | 756.0 kB | |
| transformer_engine_torch-2.12.0+cu13torch25.11cxx11abiTRUE-cp312-cp312-linux_x86_64.whl | 2026-02-24 | 766.2 kB | |
| transformer_engine_torch-2.12.0+cu13torch25.12cxx11abiTRUE-cp312-cp312-linux_x86_64.whl | 2026-02-24 | 766.5 kB | |
| transformer_engine_torch-2.12.0+cu13torch26.01cxx11abiTRUE-cp312-cp312-linux_x86_64.whl | 2026-02-24 | 796.8 kB | |
| README.md | 2026-01-28 | 3.5 kB | |
| v2.12 source code.tar.gz | 2026-01-28 | 3.9 MB | |
| v2.12 source code.zip | 2026-01-28 | 4.4 MB | |
| Totals: 8 Items | 12.1 MB | 0 | |
Transformer Engine v2.12 Release Notes
Key Features and Enhancements
- Made miscellaneous improvements and fixes to the documentation.
- [C] Improved performance of NVFP4 quantization kernels. (#2412)
- [C] Documented environment variables. (#2552)
- [PyTorch] Added fused permute+pad and unpermute+unpad operations for FP8 optimization. (#1921)
- [PyTorch] Improved the performance in CPU-limited scenarios.
- [PyTorch] Added support for Sliding Window Attention (left, right) with fused attention. (#2477)
- [PyTorch] Improved the performance of MXFP8 and NVFP4 by fusing the swizzling into the quantization (#2486)
- [PyTorch] Added cudagraph support for activation recomputation. (#2518)
- [JAX] Added a tutorial for integrating TE/JAX quantization into existing frameworks. (#2423)
- [JAX] Added custom partitioning for permutation primitives. (#2591)
Fixed Issues
- [C] Fixed SM120 compilation with CUDA 12. (#2482)
- [C] Fixed overflow in padding and unpadding kernels. (#2548)
- [C] Fixed a numerical issue in
sort_chunks_by_index. (#2566) - [C] Fixed a numerical issue in swizzling blockwise E8 scales. (#2589)
- [PyTorch] Fixed an AttributeError issue when checkpointing the model with MXFP8 parameters. (#2427)
- [PyTorch] Fixed cross-entropy loss calculation when some tokens are ignored. (#2476)
- [PyTorch] Fixed
Float8Tensor.contiguousautograd support. (#2533) - [PyTorch] Fixed multiple CPU offloading issues. (#2535)
- [PyTorch] Fixed uninitialized
permuted_scalevalues. (#2547) - [PyTorch] Fixed FP8 quantization for the second MLP in
LayerNormMLP. (#2577) - [PyTorch] Fixed ONNX tests and added FP8 attention export support. (#2598)
- [JAX] Removed unused TE DPA dtype handling to improve cuDNN backend dtype detection. (#2485)
- [JAX] Fixed segment-position calculation from segment IDs in
SequenceDescriptorclass. (#2523) - [JAX] Fixed bugs in permutation custom partitioning. (#2617)
- [JAX] Fixed issue in encoder and MNIST examples due to dataset path moving. (#2625)
Breaking Changes in This Release
No breaking changes in this release.
Deprecated Features
No features deprecated in this release.