Triton - Browse /v3.4.0 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
triton-3.4.0.tar.gz	2025-07-30	6.0 MB	0
README.md	2025-07-30	17.8 kB	0
Triton 3.4.0 Release source code.tar.gz	2025-07-30	6.1 MB	0
Triton 3.4.0 Release source code.zip	2025-07-30	6.7 MB	0
Totals: 4 Items		18.8 MB	0

Highlights

Gluon Framework Comprehensive Enhancement

The Gluon framework has received major enhancements across all areas including new APIs, tensor memory management, layout operations, and synchronization primitives. Key additions include static_assert functionality, TensorDescriptor kernel arguments, async TMA operations, tensor memory implementation, thread synchronization barriers, and comprehensive tensor operations like split/join/reshape and reductions. (#7172, #7168, #7165, #7160, #7152, #7151, #7149, #7145, #7142, #7122, #7121, #7120, #7115, #7114, #7106, #7102, #7099, #7097, #7091, #7089, #7080, #7061, #7057, #7022, #7020, #7009, #7006, #7004, #7001, #6998, #6997, #6994, #6992, #6989, #6985, #6971, #6950)

Hardware Support Expansion

AMD GFX950 Architecture Support - Comprehensive support for GFX950 including WMMA operations, performance optimizations, and architectural-specific features (#7175, #7171, #7127, #6744, #6594)
Blackwell Enhanced TMEM Support - Improved tensor memory operations with better register usage and performance optimizations (#7160, #7079, #6817)
Hopper WGMMA Improvements - Enhanced matrix multiplication with subtiling and prefetching optimizations (#7136, #6130)

Performance Optimizations

Automatic Warp Specialization - Introduced automatic warp specialization optimization for enhanced kernel performance on NVIDIA GPUs (#6289, #6246, #6217)
MMAv5 Pipelining - Re-enabled and improved MMAv5 pipelining with better performance and scheduling (#6732, #6613, #6256)
TMA Operations Enhancement - Improved tensor memory access with better layout support and reduced register pressure (#6725, #6238, #6580)

New Features

Language and Frontend

Aggregate Type Support - Added @tl.aggregate decorator for autogenerating Triton types from Python classes (#6970)
JITFunction Constexpr Support - Enhanced constexpr support for function lists and improved JIT functionality (#6988, #6963, #7105)
Enhanced Boolean Operations - Improved handling of boolean operators and scalars with chained operations (#6769)
Bitonic Top-k and Sorting - Added support for bitonic top-k operations and improved sort implementations (#6461, #6486)
Masked Histograms - Added support for masked histogram operations (#6695)
Syntactic Sugar Additions - Added .item() as syntactic sugar for .reshape([]) (#6873)

Backend and Compilation

Generic Swizzling Implementation - Implemented generic swizzling algorithm for convert_layout lowering (#6982)
Enhanced Register Allocation - Improved dynamic register reallocation for warp specialization (#6877, #6694, #6407)
TMA Reduce Operations - Added TMA reduce operations for descriptor-based reducing stores (#6580)
Improved Subtiling - Enhanced subtiling code generation for tensor memory loading (#6415)
BF16 Atomic Operations - Added support for BF16 atomic add operations (#6519)
Stmatrix Support - Added comprehensive stmatrix support including transpose operations (#6910, #6899)

Hardware-Specific Features

AMD AsyncCopy Optimizations - Enhanced AsyncCopy support in StreamPipeliner with improved memory operations (#6270, #6639, #6382)
AMD Buffer Operations - Comprehensive improvements to buffer operations with better vectorization and alignment (#6126, #6145, #6329)
AMD Ping-pong Scheduler - Enhanced ping-pong scheduler for better memory operation handling (#6254, #6301, #6198)
NVIDIA PDL Support - Enabled Programmatic Dependent Launch for overlapping kernel execution (#6394)
AMD HIP AOT Support - Added HIP Ahead-of-Time compilation support (#7007)

Improvements

Performance

Routing Kernel Optimizations - Multiple performance improvements achieving up to 5% runtime reduction (#6866, #6546, #7040)
Matrix Multiplication Enhancements - Enhanced persistent TMA matmul with epilogue subtiling and metadata alignment (#6724, #6882, #7123)
SwiGLU Optimizations - Improved SwiGLU kernel performance and fused activation functions (#6797, #6553)
Attention Kernel Fixes - Fixed and optimized attention tutorials with better performance metrics (#7037, #6839)

Developer Experience

Enhanced CI/CD - Improved continuous integration with better caching and timeout handling (#6815, #6816, #6582)
Testing Infrastructure - Enhanced test coverage and organization (#7109, #6867)
Documentation Updates - Improved documentation for installation and new features (#7103, #6778, #6235)
Build System Improvements - Better CMake support and dependency management (#6330, #6903)

Code Quality

Type System Enhancements - Improved type checking with mypy integration (#6596, #6704)
Layout System Improvements - Better layout handling with LinearLayout-based implementations (#6252, #6169, #6170)
Code Organization - Extensive refactoring and cleanup for better maintainability (#6500, #6285)

Bug Fixes

Critical Fixes

AST Parsing Regression - Fixed parsing failures for float("inf") and float("-inf") expressions (#6344)
Memory Allocation Issues - Fixed tensor memory allocation boundary collisions and use-after-free errors (#6318, #6433)
TMA Layout Consistency - Fixed layout assignment from rank-reducing loads (#6362)
Dot Operation Fixes - Fixed bug where passing None as accumulator caused errors (#7130)
Version Detection - Fixed version detection when using source tarballs (#7164, #6381)

Hardware-Specific Fixes

AMD Range Analysis - Improved range analysis for persistent kernels and loop bounds (#6390, #6133)
AMD Buffer Operations - Fixed vector size computation and alignment issues (#6114, #6126)
AMD Atomic Operations - Fixed f16/bf16 buffer atomic operations (#6090, #6139)
NVIDIA Register Pressure - Fixed register allocation issues in warp specialization (#6403)
NVIDIA TMEM Operations - Fixed various tensor memory access issues (#6888)

Stability Improvements

Test Reliability - Resolved intermittent test failures across various components (#6861, #6889)
Memory Usage - Fixed memory leaks and reduced peak memory consumption (#6796)
Error Handling - Improved error messages and crash prevention (#6865)

Deprecations and Breaking Changes

Breaking Changes

Cumsum Type Promotion - Upcast boolean inputs in cumsum to uint32_t for correct results (#6927)
Experimental API Cleanup - Removed outdated experimental descriptor APIs (#6488)
Python Support - Dropped Python 3.8 support, minimum version now 3.9 (#6649)
Tensor Descriptor APIs - Removed experimental prefix from tensor descriptor operations (#6194)
Register Spilling Performance Regression - Bad interaction between new LLVM changes and PTXAS optimizations can cause increased register spilling in some kernels (#7138)

Deprecations

FP8 Format Warnings - Enhanced warnings for deprecated FP8 formats (#6931)
Configuration Module - Renamed config.py to knobs.py to avoid confusion (#6641)

Performance

Benchmark Results

Matrix Multiplication - Up to 15% speedup in dense 8k x 8k x 8k operations (#6804)
Attention Kernels - Achieved 700+ TFLOPS on DHEAD=64, 960-1080 TFLOPS on DHEAD=128 (#6660)
Routing Operations - 5% runtime reduction with optimized kernels (#6866)
MoE Kernels - Up to 30% performance boost with optimized TMA layouts (#7123)

Memory Optimizations

Register Usage - Reduced register pressure in various operations (#6817)
Shared Memory - Improved shared memory utilization with better swizzling (#6982)
Cache Efficiency - Enhanced cache utilization with L2 cache hints (#6278)

Documentation

New Guides

Community Meetups - Added documentation for running Triton Community Meetups (#7103)
Installation Instructions - Updated with better memory management guidance (#6235)
Hardware Support - Updated PyTorch installation for Blackwell support (#6778)

API Documentation

Tensor Descriptors - Comprehensive documentation for tensor descriptor APIs (#6911, #7028)
Cache Modifiers - Updated tl.load documentation with correct cache modifier usage (#6214)
Scan Operations - Enhanced docstrings with appropriate parameters (#6946)

Developers

Build System

LLVM Integration - Multiple LLVM version bumps with latest upstream changes (#7138, #7129, #6754, #6361)
CMake Updates - Improved build configuration and parallel building support (#6830, #6953)
Dependency Management - Better handling of external dependencies (#7078)

Testing Infrastructure

Lit Tests - Enhanced lit test coverage and organization (#6855, #6661)
Benchmarking - Enhanced benchmarking infrastructure with roofline analysis (#6703)
CI/CD Improvements - Better hardware support and workflow organization (#6582)

Code Organization

Module Structure - Better organization of modules and passes (#6500)
Type System - Enhanced type checking and inference (#6285, #6231)
Error Handling - Improved error messages and debugging support throughout the codebase

Source: README.md, updated 2025-07-30

Triton Files

Development repository for the Triton language and compiler

Highlights