Download Latest Version v2.13 source code.tar.gz (4.1 MB)
Email in envelope

Get an email when there's a new version of Transformer Engine

Home / v2.12
Name Modified Size InfoDownloads / Week
Parent folder
transformer_engine_torch-2.12.0+cu12torch2.8.0+cu129cxx11abiTRUE-cp312-cp312-linux_x86_64.whl 2026-02-24 666.1 kB
transformer_engine_torch-2.12.0+cu13torch25.10cxx11abiTRUE-cp312-cp312-linux_x86_64.whl 2026-02-24 756.0 kB
transformer_engine_torch-2.12.0+cu13torch25.11cxx11abiTRUE-cp312-cp312-linux_x86_64.whl 2026-02-24 766.2 kB
transformer_engine_torch-2.12.0+cu13torch25.12cxx11abiTRUE-cp312-cp312-linux_x86_64.whl 2026-02-24 766.5 kB
transformer_engine_torch-2.12.0+cu13torch26.01cxx11abiTRUE-cp312-cp312-linux_x86_64.whl 2026-02-24 796.8 kB
README.md 2026-01-28 3.5 kB
v2.12 source code.tar.gz 2026-01-28 3.9 MB
v2.12 source code.zip 2026-01-28 4.4 MB
Totals: 8 Items   12.1 MB 0

Transformer Engine v2.12 Release Notes

Key Features and Enhancements

  • Made miscellaneous improvements and fixes to the documentation.
  • [C] Improved performance of NVFP4 quantization kernels. (#2412)
  • [C] Documented environment variables. (#2552)
  • [PyTorch] Added fused permute+pad and unpermute+unpad operations for FP8 optimization. (#1921)
  • [PyTorch] Improved the performance in CPU-limited scenarios.
  • [PyTorch] Added support for Sliding Window Attention (left, right) with fused attention. (#2477)
  • [PyTorch] Improved the performance of MXFP8 and NVFP4 by fusing the swizzling into the quantization (#2486)
  • [PyTorch] Added cudagraph support for activation recomputation. (#2518)
  • [JAX] Added a tutorial for integrating TE/JAX quantization into existing frameworks. (#2423)
  • [JAX] Added custom partitioning for permutation primitives. (#2591)

Fixed Issues

  • [C] Fixed SM120 compilation with CUDA 12. (#2482)
  • [C] Fixed overflow in padding and unpadding kernels. (#2548)
  • [C] Fixed a numerical issue in sort_chunks_by_index. (#2566)
  • [C] Fixed a numerical issue in swizzling blockwise E8 scales. (#2589)
  • [PyTorch] Fixed an AttributeError issue when checkpointing the model with MXFP8 parameters. (#2427)
  • [PyTorch] Fixed cross-entropy loss calculation when some tokens are ignored. (#2476)
  • [PyTorch] Fixed Float8Tensor.contiguous autograd support. (#2533)
  • [PyTorch] Fixed multiple CPU offloading issues. (#2535)
  • [PyTorch] Fixed uninitialized permuted_scale values. (#2547)
  • [PyTorch] Fixed FP8 quantization for the second MLP in LayerNormMLP. (#2577)
  • [PyTorch] Fixed ONNX tests and added FP8 attention export support. (#2598)
  • [JAX] Removed unused TE DPA dtype handling to improve cuDNN backend dtype detection. (#2485)
  • [JAX] Fixed segment-position calculation from segment IDs in SequenceDescriptor class. (#2523)
  • [JAX] Fixed bugs in permutation custom partitioning. (#2617)
  • [JAX] Fixed issue in encoder and MNIST examples due to dataset path moving. (#2625)

Breaking Changes in This Release

No breaking changes in this release.

Deprecated Features

No features deprecated in this release.

Source: README.md, updated 2026-01-28