| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| CUTLASS 4.5.0 source code.tar.gz | < 19 hours ago | 39.6 MB | |
| CUTLASS 4.5.0 source code.zip | < 19 hours ago | 48.7 MB | |
| README.md | < 19 hours ago | 4.2 kB | |
| Totals: 3 Items | 88.3 MB | 0 | |
CuTe DSL
- New features
- New Block API
block_copy()to simplify TMA and S2T copy. Users can ignore detail about multicast and 2CTA partition for TMA byblock_copy()and need not to invoketma_partition(). And users can remove bulk of S2T initialization to simplify S2T copy. - MXF8F6F4 mixed precision support
- BlockScaled MMA now supports MXF8MXF4 or MXF8MXF6
- Block Scaled MMA for SM120 now works on Spark
- EFC broadcast semantics support
- EFC epilogue functions can now broadcast and remap tensor modes via
C.remap_modes[:, 0, 1]subscript syntax (where:marks a broadcast dimension and integers select source mode indices). Covers scalar broadcast, row/column broadcast, and arbitrary mode permutations (e.g. transpose). The PyTorch reference evaluator mirrors the same transformations.
- EFC epilogue functions can now broadcast and remap tensor modes via
- Initial linter support: Improved type hints on CuTe DSL APIs to support static type checkers like MyPy
- dataclasses.dataclass is now supported for JIT compilaton and cute.compile for both plain and tvm-ffi path
-
cute.copy now supports user specified loop unrolling
-
Bug fixing and improvements
- Improved source code correlation for profiling/debugging
- Fixed an aarch64 segfault issue with tvm-ffi
-
Re-organization for CuTe DSL examples/tutorials for better discoverability
-
More examples of authorizing peak-performance kernels
-
MOE examles
- A new style of grouped-gemm that aligns to torch's grouped_mm and scaled_groued_mm interface.
- Expert-wise tensormap descriptor setup by a cheap helper kernel (~2us) to avoid long latency in tile switching, kernel structure is much more closer to a normal GEMM.
- Compared to torch_210_cu13, very few problem has worse perf in B200.
- mxfp8_2dx3d: avg 1.29 speedup;
- mxfp8_2dx2d: avg 1.41 speedup;
- nvfp4_2dx3d: avg 1.11 speedup;
- nvfp4_2dx2d: avg 1.12 speedup (worst case 0.98)
- bf16_2dx3d: avg 1.15 speedup (worst case 0.98)
- bf16_2dx2d: avg 1.17 speedup (worst case 0.96)
- Note: The perf is measured from torch profiler, this impl includes the helper kernel + main kernel, while torch's includes its setup kernel and cutlass_cpp main kernel.
-
API changes
- ab_dtype is deprecated in make_trivial_tiled_mma and make_blockscaled_trivial_tiled_mma from blackwell_helpers.py. Please specify a_dtype and b_dtype separately instead.
CUTLASS C++
- Add 2SM MMA instruction support to mixed TMA+CpAsync SM100 vanilla GEMM kernels.
- Mixed TMA+CpAsync can now accept static, but non trivial cluster shapes.
- Uses TMA multicast for A tile when using non-trivial cluster size along N mode.
- Uses an additional barrier (mma_trampoline_barrier) to track cp.async arrivals in both CTAs.
- Changes included in example 92.
- Add support for 128x32xK and 128x64xK tile sizes for SM120 blockscaled MMA collective builders, yielding up to 30% performance improvement on Blackwell SM121 related kernels.
- Add static load to tensor memory support, included in example 77.
- Use 64-bit adds for SM100 MMA descriptor offsets and reduce move instructions for improved code generation.
- Add example 95 to support green context SM partition
- Enables launching GEMM on stream with partial SM allocation.
- Add Snake activation functor for EVT.
- Fix some kernel issues:
- Fix l2_capacity=0 handling in Blackwell SM100/SM120 kernel templates
- Fix CUTLASS clang build issues
- Fix atomicCAS read-modify-write loop in
ConstSubbyteReference - Replace
__nv_atomic_load_nwithvolatilefor CUDA 11.4 compatibility in subbyte reference - Remove
PipelineStorageshadowing in SM100 complex epilogue - Fix build issue in SM90 epilogue fusion visitor TMA warpspecialized
- Fix some profiler issues:
- Add missing reference kernels for blockwise GEMM profiler.