The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
transformer_engine_torch-2.13.0+cu12torch2.8.0+cu129cxx11abiTRUE-cp312-cp312-linux_x86_64.whl	2026-03-31	668.4 kB	0
transformer_engine_torch-2.13.0+cu12torch2.8.0+cu129cxx11abiTRUE-cp312-cp312-linux_aarch64.whl	2026-03-31	619.8 kB	0
transformer_engine_torch-2.13.0+cu13torch26.02cxx11abiTRUE-cp312-cp312-linux_x86_64.whl	2026-03-31	800.4 kB	0
transformer_engine_torch-2.13.0+cu13torch26.01cxx11abiTRUE-cp312-cp312-linux_x86_64.whl	2026-03-31	800.4 kB	0
transformer_engine_torch-2.13.0+cu13torch25.11cxx11abiTRUE-cp312-cp312-linux_x86_64.whl	2026-03-31	768.7 kB	0
transformer_engine_torch-2.13.0+cu13torch25.12cxx11abiTRUE-cp312-cp312-linux_x86_64.whl	2026-03-31	768.8 kB	0
transformer_engine_torch-2.13.0+cu13torch26.03cxx11abiTRUE-cp312-cp312-linux_x86_64.whl	2026-03-31	798.7 kB	0
README.md	2026-02-28	5.6 kB	0
v2.13 source code.tar.gz	2026-02-28	4.1 MB	0
v2.13 source code.zip	2026-02-28	4.6 MB	6
Totals: 10 Items		13.9 MB	6

Transformer Engine v2.13 Release Notes

Key Features and Enhancements

Added detailed documentation for low precision training with Transformer Engine, covering FP8, MXFP8, NVFP4, and other quantization recipes with examples for both PyTorch and JAX. (#2343).
[Build] Added NVTE_BUILD_USE_NVIDIA_WHEELS environment variable to allow building TE using CUDA headers from PyPI NVIDIA wheels instead of a system CUDA installation. (#2623)
[C] Enabled deterministic FP8 fused attention on Blackwell (SM100) GPUs. (#2621)
[C] Updated cuBLASMp integration to version 0.8.0, replacing the nvshmem dependency with NCCL-based symmetric memory. (#2661)
[C] Added MXFP8 quantization kernels for grouped tensors used in MoE, with fused scale-factor swizzling for improved performance. (#2586, #2630)
[C] Added NVFP4 quantization kernels for grouped tensors used in MoE models. (#2655)
[C] Reduced cuDNN graph recompilations in THD fused attention by rounding large batch sizes to 512-element increments. (#2653)
[C] Added sqrtsoftplus scoring function to the fused MoE router and improved router kernel performance on Blackwell GPUs. (#2633, #2683)
[PyTorch] Introduced GroupedTensor, enabling MoE expert weights to be stored as a single contiguous allocation while remaining individually addressable. (#2654)
[PyTorch] Added fusible GroupedLinear and ScaledSwiGLU ops for building fully fused MoE grouped MLP pipelines. (#2664)
[PyTorch] Added register_forward_fusion and register_backward_fusion APIs, allowing users to define and register custom operator fusion patterns. (#2597)
[PyTorch] Added get_backward_dw_params API to TE modules, fixing weight gradient hook management when using wgrad CUDA Graphs with Megatron-LM. (#2614)
[PyTorch] Fixed fused attention bias dimension handling and extended dbias support to additional bias shapes (b1ss, bhss, 11ss, 111s). (#2537)
[PyTorch] Reduced peak memory usage in fused Adam optimizer by fusing BF16 momentum scaling directly into CUDA kernels, also enabling CUDA Graph capture for this path. (#2632)
[PyTorch] Added the sigmoid-gated GLU activation (activation="glu") to LayerNormMLP and TransformerLayer. (#2656)
[PyTorch] Extended debug statistics tracking to NVFP4 quantization (underflow and MSE metrics), and gracefully skipped stat logging for layers not using quantization. (#2296, #2652)
[PyTorch] Fixed CUDA Graph capture for Megatron-Core vision encoder models. (#2657)
[JAX] Added experimental inspect_array debugging utility for dumping tensor snapshots during multi-GPU execution. (#2651)
[JAX] Fixed MoE permutation to correctly mask padding tokens and handle tensor sizes under expert parallelism. (#2672)
[JAX] MoE permutation now always returns tokens_per_expert, required for ragged all-to-all communication in expert parallelism. (#2613)

Fixed Issues

[C] Fixed incorrect results from the exp2f_rcp fast-math helper when inputs are NaN or have biased exponent 254. (#2647)
[C] Fixed a race condition in Randomized Hadamard Transform amax kernels where a missing memory fence could cause incorrect amax values. (#2695)
[PyTorch] Fixed the TE Llama example to work with HuggingFace Transformers 4.57+, which changed decoder layer output conventions. (#2572)
[Build] Fixed TypeError during build when NCCL is installed from PyPI as a namespace package without a __file__ attribute. (#2580)
[Build] Fixed ModuleNotFoundError when installing from cached source distributions (e.g., via uv) by including build_tools in MANIFEST.in. (#2684)

Breaking Changes in This Release

[C] Removed the deprecated packed fused attention C APIs (nvte_fused_attn_{fwd,bwd}_{qkvpacked,kvpacked}); users must migrate to the non-packed API variants. (#2696)
Versions of cuBLASMp prior to 0.8.0 are no longer supported.

Deprecated Features

No features deprecated in this release.

Source: README.md, updated 2026-02-28

Transformer Engine Files

A library for accelerating Transformer models on NVIDIA GPUs

Transformer Engine v2.13 Release Notes

Key Features and Enhancements

Fixed Issues

Breaking Changes in This Release

Deprecated Features

Transformer Engine Files

A library for accelerating Transformer models on NVIDIA GPUs

Get an email when there's a new version of Transformer Engine

Transformer Engine v2.13 Release Notes

Key Features and Enhancements

Fixed Issues

Breaking Changes in This Release

Deprecated Features