| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| transformer_engine_torch-2.13.0+cu12torch2.8.0+cu129cxx11abiTRUE-cp312-cp312-linux_x86_64.whl | 2026-03-31 | 668.4 kB | |
| transformer_engine_torch-2.13.0+cu12torch2.8.0+cu129cxx11abiTRUE-cp312-cp312-linux_aarch64.whl | 2026-03-31 | 619.8 kB | |
| transformer_engine_torch-2.13.0+cu13torch26.02cxx11abiTRUE-cp312-cp312-linux_x86_64.whl | 2026-03-31 | 800.4 kB | |
| transformer_engine_torch-2.13.0+cu13torch26.01cxx11abiTRUE-cp312-cp312-linux_x86_64.whl | 2026-03-31 | 800.4 kB | |
| transformer_engine_torch-2.13.0+cu13torch25.11cxx11abiTRUE-cp312-cp312-linux_x86_64.whl | 2026-03-31 | 768.7 kB | |
| transformer_engine_torch-2.13.0+cu13torch25.12cxx11abiTRUE-cp312-cp312-linux_x86_64.whl | 2026-03-31 | 768.8 kB | |
| transformer_engine_torch-2.13.0+cu13torch26.03cxx11abiTRUE-cp312-cp312-linux_x86_64.whl | 2026-03-31 | 798.7 kB | |
| README.md | 2026-02-28 | 5.6 kB | |
| v2.13 source code.tar.gz | 2026-02-28 | 4.1 MB | |
| v2.13 source code.zip | 2026-02-28 | 4.6 MB | |
| Totals: 10 Items | 13.9 MB | 6 | |
Transformer Engine v2.13 Release Notes
Key Features and Enhancements
- Added detailed documentation for low precision training with Transformer Engine, covering FP8, MXFP8, NVFP4, and other quantization recipes with examples for both PyTorch and JAX. (#2343).
- [Build] Added
NVTE_BUILD_USE_NVIDIA_WHEELSenvironment variable to allow building TE using CUDA headers from PyPI NVIDIA wheels instead of a system CUDA installation. (#2623) - [C] Enabled deterministic FP8 fused attention on Blackwell (SM100) GPUs. (#2621)
- [C] Updated cuBLASMp integration to version 0.8.0, replacing the nvshmem dependency with NCCL-based symmetric memory. (#2661)
- [C] Added MXFP8 quantization kernels for grouped tensors used in MoE, with fused scale-factor swizzling for improved performance. (#2586, #2630)
- [C] Added NVFP4 quantization kernels for grouped tensors used in MoE models. (#2655)
- [C] Reduced cuDNN graph recompilations in THD fused attention by rounding large batch sizes to 512-element increments. (#2653)
- [C] Added
sqrtsoftplusscoring function to the fused MoE router and improved router kernel performance on Blackwell GPUs. (#2633, #2683) - [PyTorch] Introduced
GroupedTensor, enabling MoE expert weights to be stored as a single contiguous allocation while remaining individually addressable. (#2654) - [PyTorch] Added fusible
GroupedLinearandScaledSwiGLUops for building fully fused MoE grouped MLP pipelines. (#2664) - [PyTorch] Added
register_forward_fusionandregister_backward_fusionAPIs, allowing users to define and register custom operator fusion patterns. (#2597) - [PyTorch] Added
get_backward_dw_paramsAPI to TE modules, fixing weight gradient hook management when using wgrad CUDA Graphs with Megatron-LM. (#2614) - [PyTorch] Fixed fused attention bias dimension handling and extended
dbiassupport to additional bias shapes (b1ss,bhss,11ss,111s). (#2537) - [PyTorch] Reduced peak memory usage in fused Adam optimizer by fusing BF16 momentum scaling directly into CUDA kernels, also enabling CUDA Graph capture for this path. (#2632)
- [PyTorch] Added the sigmoid-gated GLU activation (
activation="glu") toLayerNormMLPandTransformerLayer. (#2656) - [PyTorch] Extended debug statistics tracking to NVFP4 quantization (underflow and MSE metrics), and gracefully skipped stat logging for layers not using quantization. (#2296, #2652)
- [PyTorch] Fixed CUDA Graph capture for Megatron-Core vision encoder models. (#2657)
- [JAX] Added experimental
inspect_arraydebugging utility for dumping tensor snapshots during multi-GPU execution. (#2651) - [JAX] Fixed MoE permutation to correctly mask padding tokens and handle tensor sizes under expert parallelism. (#2672)
- [JAX] MoE permutation now always returns
tokens_per_expert, required for ragged all-to-all communication in expert parallelism. (#2613)
Fixed Issues
- [C] Fixed incorrect results from the
exp2f_rcpfast-math helper when inputs are NaN or have biased exponent 254. (#2647) - [C] Fixed a race condition in Randomized Hadamard Transform amax kernels where a missing memory fence could cause incorrect amax values. (#2695)
- [PyTorch] Fixed the TE Llama example to work with HuggingFace Transformers 4.57+, which changed decoder layer output conventions. (#2572)
- [Build] Fixed
TypeErrorduring build when NCCL is installed from PyPI as a namespace package without a__file__attribute. (#2580) - [Build] Fixed
ModuleNotFoundErrorwhen installing from cached source distributions (e.g., viauv) by includingbuild_toolsinMANIFEST.in. (#2684)
Breaking Changes in This Release
- [C] Removed the deprecated packed fused attention C APIs (
nvte_fused_attn_{fwd,bwd}_{qkvpacked,kvpacked}); users must migrate to the non-packed API variants. (#2696) - Versions of cuBLASMp prior to 0.8.0 are no longer supported.
Deprecated Features
No features deprecated in this release.