Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
CUTLASS 4.2.0 source code.tar.gz | 2025-09-16 | 33.3 MB | |
CUTLASS 4.2.0 source code.zip | 2025-09-16 | 42.1 MB | |
README.md | 2025-09-16 | 8.7 kB | |
Totals: 3 Items | 75.4 MB | 0 |
CuTe DSL
- More Python versions are now supported for both x86-64 and aarch64, including
- Python 3.10, 3.11, 3.12, and 3.13
- Added new example and updated notebook to get started with CuTe DSL
- Call kernels with dlpack bypassed
- Updates on TensorSSA demonstration
- Added a section for introducing the broadcast
- API updates
- Please refer to DSL API changelog for details
- Bug fixings and improvements
- Fixed
cute.print_tensor
for coordinate tensor - Fixed
cute.print
for tuple of layouts - Fixed frozen object is not properly updated after fully assigned in dynamic control flow
- Fixed assign tuple/list element in a dynamic control flow may cause compilation failure
- Improved error message when CUDA context is not initialized
- Improved docstring of congruent and weakly_congruent
- Fixed
CUTLASS C++
- Support for Blackwell SM103 kernels for B300 GPUs.
- Collective mainloop codes: Blockscaled datatypes with support for dense GEMM mainloop
- New GEMM and epilogue dispatch policies for collectives, kernel layers, and builders.
- Kernel codes: Blockscaled datatypes with support for dense GEMM kernel.
- Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM103 architecture:
- Set of unit tests that demonstrate the usage of Blackwell SM103 blockscaled GEMM
- Unit test files with prefix name of
sm103_
under GEMM device unit tests.
- Unit test files with prefix name of
- Support for Blackwell SM121 kernels for DGX Spark GPUs.
- Share the major codes with Blackwell SM120 kernels.
- Add support for heuristics-based kernel filtering and autotuning using
nvidia-matmul-heuristics
to find the best kernels for a given scenario.- Details please refer to heuristics doc.
- Further enhance Blackwell SM100 Attention kernels in example 77.
- Add fused reduction kernel support for cutlass MLA.
- Add softmax skip correction.
- Support for GQA in FMHA backward kernel.
- Fix an issue where
get_unmasked_trip_count
may return a negative value. - Fix an issue where mbarriers are initialized with a zero arrival count.
- Fix a corner case issue where the sequence length of q is not a multiple of tile_q.
- Remove tma padding for forward kernel inputs.
- Add Blackwell SM100 kernels for MoEs (focusing on Low-Latency inference performance): example 92. It uses TMA (for weights) and CPASYNC (for tokens) to load input matrices and allow only one problem dimension to vary across groups/experts, unlike general Grouped GEMMs. Note: further API simplifications and kernel improvements are upcoming. Any feedback on API is welcome.
- Further enhance blockwise and groupwise GEMMs on Hopper and Blackwell
- On Blackwell SM120, a blockwise gemm kernel is added: example 87.
- On Hopper, add K major scale factor support for SM90 blockwise kernels.
- On Hopper, relax the restriction that the k dimension of the problem size has to be the multiple of the k dimension of the tile size.
- On Hopper, grouped version supports the case when k = 0.
- Support for Blackwell SM100 fp4 gemv kernels.
- Kernel codes: Gemv kernel.
- Example codes: example 91
- Support for Blackwell SM100 legacy mixed input GEMM kernels.
- Collective mainloop codes: Mixed input mainloop.
- Kernel codes: Mixed input kernel.
- Example codes: example 86.
- Support for Blackwell SM100 cpasync kernel.
- Collective mainloop codes: cpasync mainloop.
- Kernel codes: cpasync kernel.
- Support Blackwell SM120 mixed input blockscaled grouped GEMM.
- Instantiating more Blackwell kernels in profiler.
- Blackwell SM100 and SM103 kernels support
CUTLASS_LIBRARY_INSTANTIATION_LEVEL
to instantiate all possible combinations. - To use this feature,
CUTLASS_LIBRARY_KERNELS
must be non-empty. Profiler will combineCUTLASS_LIBRARY_KERNELS
andCUTLASS_LIBRARY_INSTANTIATION_LEVEL
to instantiate specific kernels. - Details please check Profiler Doc.
- Blackwell SM100 and SM103 kernels support
- Fix some profiler issues:
- Modify default cluster callback values to none 0 to avoid profiler failure when these values are not set in command line.
- Fix some no output and timeout issues.
- Fix Pingpong Blockwise Hopper library generation.
- From CUDA 13.0, the Blackwell SM101 for Thor GPUs is renamed to SM110.
- For CUDA toolkit version < 13.0, SM101 is still used for Thor GPUs.
- For CUDA toolkit version >= 13.0, SM110 is used for Thor GPUs and SM101 is no longer valid.
- Rename legacy Python API package from
cutlass
tocutlass_cppgen
and add Blackwell EVT support to legacy Python interface.- Restructuring the C++ Blackwell SM100 Collective Epilogue Builder to work with the Python interface's
EpilogueDescriptors
. - Added Blackwell SM100 EVT Emitter on the Python side and routed most emission through Hopper SM90 Emitter.
- Added some support for running SM100 kernels via the Python interface.
- Restructuring the C++ Blackwell SM100 Collective Epilogue Builder to work with the Python interface's
- CuTe changes:
- Fix inaccurate GridDim calculation under CuTe tutorial.
- Add movmatrix support.
- Fix smallest MMA-N allowed for Blackwell fp8 and fp16 gemm kernels.
- Support fp16 accmulator for sm89 fp8 mma.
- Shorten
nullspace
implementation. - Isolate and comment on
cosize
hacks. - Important documentation correction:
E<0,1> == 1@0@1
.
- Fix some kernel issues:
- Fix Hopper SM90 group gemm kernel to only use the commit group and wait group instead of also waiting on mbarriers.
- Fix a tiny bug when K is large for Blackwell SM103 fp4 grouped GEMM kernel.
- Add following unit tests:
- Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
- Optimal code generation with CUDA toolkit versions 13.0U1.