The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
transformer_engine_torch-2.12.0+cu12torch2.8.0+cu129cxx11abiTRUE-cp312-cp312-linux_x86_64.whl	2026-02-24	666.1 kB	0
transformer_engine_torch-2.12.0+cu13torch25.10cxx11abiTRUE-cp312-cp312-linux_x86_64.whl	2026-02-24	756.0 kB	0
transformer_engine_torch-2.12.0+cu13torch25.11cxx11abiTRUE-cp312-cp312-linux_x86_64.whl	2026-02-24	766.2 kB	0
transformer_engine_torch-2.12.0+cu13torch25.12cxx11abiTRUE-cp312-cp312-linux_x86_64.whl	2026-02-24	766.5 kB	0
transformer_engine_torch-2.12.0+cu13torch26.01cxx11abiTRUE-cp312-cp312-linux_x86_64.whl	2026-02-24	796.8 kB	0
README.md	2026-01-28	3.5 kB	0
v2.12 source code.tar.gz	2026-01-28	3.9 MB	0
v2.12 source code.zip	2026-01-28	4.4 MB	0
Totals: 8 Items		12.1 MB	0

Transformer Engine v2.12 Release Notes

Key Features and Enhancements

Made miscellaneous improvements and fixes to the documentation.
[C] Improved performance of NVFP4 quantization kernels. (#2412)
[C] Documented environment variables. (#2552)
[PyTorch] Added fused permute+pad and unpermute+unpad operations for FP8 optimization. (#1921)
[PyTorch] Improved the performance in CPU-limited scenarios.
[PyTorch] Added support for Sliding Window Attention (left, right) with fused attention. (#2477)
[PyTorch] Improved the performance of MXFP8 and NVFP4 by fusing the swizzling into the quantization (#2486)
[PyTorch] Added cudagraph support for activation recomputation. (#2518)
[JAX] Added a tutorial for integrating TE/JAX quantization into existing frameworks. (#2423)
[JAX] Added custom partitioning for permutation primitives. (#2591)

[C] Fixed SM120 compilation with CUDA 12. (#2482)
[C] Fixed overflow in padding and unpadding kernels. (#2548)
[C] Fixed a numerical issue in sort_chunks_by_index. (#2566)
[C] Fixed a numerical issue in swizzling blockwise E8 scales. (#2589)
[PyTorch] Fixed an AttributeError issue when checkpointing the model with MXFP8 parameters. (#2427)
[PyTorch] Fixed cross-entropy loss calculation when some tokens are ignored. (#2476)
[PyTorch] Fixed Float8Tensor.contiguous autograd support. (#2533)
[PyTorch] Fixed multiple CPU offloading issues. (#2535)
[PyTorch] Fixed uninitialized permuted_scale values. (#2547)
[PyTorch] Fixed FP8 quantization for the second MLP in LayerNormMLP. (#2577)
[PyTorch] Fixed ONNX tests and added FP8 attention export support. (#2598)
[JAX] Removed unused TE DPA dtype handling to improve cuDNN backend dtype detection. (#2485)
[JAX] Fixed segment-position calculation from segment IDs in SequenceDescriptor class. (#2523)
[JAX] Fixed bugs in permutation custom partitioning. (#2617)
[JAX] Fixed issue in encoder and MNIST examples due to dataset path moving. (#2625)

No breaking changes in this release.

No features deprecated in this release.

Source: README.md, updated 2026-01-28