Download Latest Version triton-3.6.0.tar.gz (6.4 MB)
Email in envelope

Get an email when there's a new version of Triton

Home / v3.4.0
Name Modified Size InfoDownloads / Week
Parent folder
triton-3.4.0.tar.gz 2025-07-30 6.0 MB
README.md 2025-07-30 17.8 kB
Triton 3.4.0 Release source code.tar.gz 2025-07-30 6.1 MB
Triton 3.4.0 Release source code.zip 2025-07-30 6.7 MB
Totals: 4 Items   18.8 MB 0

Highlights

Gluon Framework Comprehensive Enhancement

The Gluon framework has received major enhancements across all areas including new APIs, tensor memory management, layout operations, and synchronization primitives. Key additions include static_assert functionality, TensorDescriptor kernel arguments, async TMA operations, tensor memory implementation, thread synchronization barriers, and comprehensive tensor operations like split/join/reshape and reductions. (#7172, #7168, #7165, #7160, #7152, #7151, #7149, #7145, #7142, #7122, #7121, #7120, #7115, #7114, #7106, #7102, #7099, #7097, #7091, #7089, #7080, #7061, #7057, #7022, #7020, #7009, #7006, #7004, #7001, #6998, #6997, #6994, #6992, #6989, #6985, #6971, #6950)

Hardware Support Expansion

  • AMD GFX950 Architecture Support - Comprehensive support for GFX950 including WMMA operations, performance optimizations, and architectural-specific features (#7175, #7171, #7127, #6744, #6594)
  • Blackwell Enhanced TMEM Support - Improved tensor memory operations with better register usage and performance optimizations (#7160, #7079, #6817)
  • Hopper WGMMA Improvements - Enhanced matrix multiplication with subtiling and prefetching optimizations (#7136, #6130)

Performance Optimizations

  • Automatic Warp Specialization - Introduced automatic warp specialization optimization for enhanced kernel performance on NVIDIA GPUs (#6289, #6246, #6217)
  • MMAv5 Pipelining - Re-enabled and improved MMAv5 pipelining with better performance and scheduling (#6732, #6613, #6256)
  • TMA Operations Enhancement - Improved tensor memory access with better layout support and reduced register pressure (#6725, #6238, #6580)

New Features

Language and Frontend

  • Aggregate Type Support - Added @tl.aggregate decorator for autogenerating Triton types from Python classes (#6970)
  • JITFunction Constexpr Support - Enhanced constexpr support for function lists and improved JIT functionality (#6988, #6963, #7105)
  • Enhanced Boolean Operations - Improved handling of boolean operators and scalars with chained operations (#6769)
  • Bitonic Top-k and Sorting - Added support for bitonic top-k operations and improved sort implementations (#6461, #6486)
  • Masked Histograms - Added support for masked histogram operations (#6695)
  • Syntactic Sugar Additions - Added .item() as syntactic sugar for .reshape([]) (#6873)

Backend and Compilation

  • Generic Swizzling Implementation - Implemented generic swizzling algorithm for convert_layout lowering (#6982)
  • Enhanced Register Allocation - Improved dynamic register reallocation for warp specialization (#6877, #6694, #6407)
  • TMA Reduce Operations - Added TMA reduce operations for descriptor-based reducing stores (#6580)
  • Improved Subtiling - Enhanced subtiling code generation for tensor memory loading (#6415)
  • BF16 Atomic Operations - Added support for BF16 atomic add operations (#6519)
  • Stmatrix Support - Added comprehensive stmatrix support including transpose operations (#6910, #6899)

Hardware-Specific Features

  • AMD AsyncCopy Optimizations - Enhanced AsyncCopy support in StreamPipeliner with improved memory operations (#6270, #6639, #6382)
  • AMD Buffer Operations - Comprehensive improvements to buffer operations with better vectorization and alignment (#6126, #6145, #6329)
  • AMD Ping-pong Scheduler - Enhanced ping-pong scheduler for better memory operation handling (#6254, #6301, #6198)
  • NVIDIA PDL Support - Enabled Programmatic Dependent Launch for overlapping kernel execution (#6394)
  • AMD HIP AOT Support - Added HIP Ahead-of-Time compilation support (#7007)

Improvements

Performance

  • Routing Kernel Optimizations - Multiple performance improvements achieving up to 5% runtime reduction (#6866, #6546, #7040)
  • Matrix Multiplication Enhancements - Enhanced persistent TMA matmul with epilogue subtiling and metadata alignment (#6724, #6882, #7123)
  • SwiGLU Optimizations - Improved SwiGLU kernel performance and fused activation functions (#6797, #6553)
  • Attention Kernel Fixes - Fixed and optimized attention tutorials with better performance metrics (#7037, #6839)

Developer Experience

  • Enhanced CI/CD - Improved continuous integration with better caching and timeout handling (#6815, #6816, #6582)
  • Testing Infrastructure - Enhanced test coverage and organization (#7109, #6867)
  • Documentation Updates - Improved documentation for installation and new features (#7103, #6778, #6235)
  • Build System Improvements - Better CMake support and dependency management (#6330, #6903)

Code Quality

  • Type System Enhancements - Improved type checking with mypy integration (#6596, #6704)
  • Layout System Improvements - Better layout handling with LinearLayout-based implementations (#6252, #6169, #6170)
  • Code Organization - Extensive refactoring and cleanup for better maintainability (#6500, #6285)

Bug Fixes

Critical Fixes

  • AST Parsing Regression - Fixed parsing failures for float("inf") and float("-inf") expressions (#6344)
  • Memory Allocation Issues - Fixed tensor memory allocation boundary collisions and use-after-free errors (#6318, #6433)
  • TMA Layout Consistency - Fixed layout assignment from rank-reducing loads (#6362)
  • Dot Operation Fixes - Fixed bug where passing None as accumulator caused errors (#7130)
  • Version Detection - Fixed version detection when using source tarballs (#7164, #6381)

Hardware-Specific Fixes

  • AMD Range Analysis - Improved range analysis for persistent kernels and loop bounds (#6390, #6133)
  • AMD Buffer Operations - Fixed vector size computation and alignment issues (#6114, #6126)
  • AMD Atomic Operations - Fixed f16/bf16 buffer atomic operations (#6090, #6139)
  • NVIDIA Register Pressure - Fixed register allocation issues in warp specialization (#6403)
  • NVIDIA TMEM Operations - Fixed various tensor memory access issues (#6888)

Stability Improvements

  • Test Reliability - Resolved intermittent test failures across various components (#6861, #6889)
  • Memory Usage - Fixed memory leaks and reduced peak memory consumption (#6796)
  • Error Handling - Improved error messages and crash prevention (#6865)

Deprecations and Breaking Changes

Breaking Changes

  • Cumsum Type Promotion - Upcast boolean inputs in cumsum to uint32_t for correct results (#6927)
  • Experimental API Cleanup - Removed outdated experimental descriptor APIs (#6488)
  • Python Support - Dropped Python 3.8 support, minimum version now 3.9 (#6649)
  • Tensor Descriptor APIs - Removed experimental prefix from tensor descriptor operations (#6194)
  • Register Spilling Performance Regression - Bad interaction between new LLVM changes and PTXAS optimizations can cause increased register spilling in some kernels (#7138)

Deprecations

  • FP8 Format Warnings - Enhanced warnings for deprecated FP8 formats (#6931)
  • Configuration Module - Renamed config.py to knobs.py to avoid confusion (#6641)

Performance

Benchmark Results

  • Matrix Multiplication - Up to 15% speedup in dense 8k x 8k x 8k operations (#6804)
  • Attention Kernels - Achieved 700+ TFLOPS on DHEAD=64, 960-1080 TFLOPS on DHEAD=128 (#6660)
  • Routing Operations - 5% runtime reduction with optimized kernels (#6866)
  • MoE Kernels - Up to 30% performance boost with optimized TMA layouts (#7123)

Memory Optimizations

  • Register Usage - Reduced register pressure in various operations (#6817)
  • Shared Memory - Improved shared memory utilization with better swizzling (#6982)
  • Cache Efficiency - Enhanced cache utilization with L2 cache hints (#6278)

Documentation

New Guides

  • Community Meetups - Added documentation for running Triton Community Meetups (#7103)
  • Installation Instructions - Updated with better memory management guidance (#6235)
  • Hardware Support - Updated PyTorch installation for Blackwell support (#6778)

API Documentation

  • Tensor Descriptors - Comprehensive documentation for tensor descriptor APIs (#6911, #7028)
  • Cache Modifiers - Updated tl.load documentation with correct cache modifier usage (#6214)
  • Scan Operations - Enhanced docstrings with appropriate parameters (#6946)

Developers

Build System

  • LLVM Integration - Multiple LLVM version bumps with latest upstream changes (#7138, #7129, #6754, #6361)
  • CMake Updates - Improved build configuration and parallel building support (#6830, #6953)
  • Dependency Management - Better handling of external dependencies (#7078)

Testing Infrastructure

  • Lit Tests - Enhanced lit test coverage and organization (#6855, #6661)
  • Benchmarking - Enhanced benchmarking infrastructure with roofline analysis (#6703)
  • CI/CD Improvements - Better hardware support and workflow organization (#6582)

Code Organization

  • Module Structure - Better organization of modules and passes (#6500)
  • Type System - Enhanced type checking and inference (#6285, #6231)
  • Error Handling - Improved error messages and debugging support throughout the codebase
Source: README.md, updated 2025-07-30