Download Latest Version CUTLASS 4.5.0 source code.tar.gz (39.6 MB)
Email in envelope

Get an email when there's a new version of CUTLASS

Home / v4.5.0
Name Modified Size InfoDownloads / Week
Parent folder
CUTLASS 4.5.0 source code.tar.gz < 19 hours ago 39.6 MB
CUTLASS 4.5.0 source code.zip < 19 hours ago 48.7 MB
README.md < 19 hours ago 4.2 kB
Totals: 3 Items   88.3 MB 0

CuTe DSL

  • New features
  • New Block API block_copy() to simplify TMA and S2T copy. Users can ignore detail about multicast and 2CTA partition for TMA by block_copy() and need not to invoke tma_partition(). And users can remove bulk of S2T initialization to simplify S2T copy.
  • MXF8F6F4 mixed precision support
    • BlockScaled MMA now supports MXF8MXF4 or MXF8MXF6
  • Block Scaled MMA for SM120 now works on Spark
  • EFC broadcast semantics support
    • EFC epilogue functions can now broadcast and remap tensor modes via C.remap_modes[:, 0, 1] subscript syntax (where : marks a broadcast dimension and integers select source mode indices). Covers scalar broadcast, row/column broadcast, and arbitrary mode permutations (e.g. transpose). The PyTorch reference evaluator mirrors the same transformations.
  • Initial linter support: Improved type hints on CuTe DSL APIs to support static type checkers like MyPy
  • dataclasses.dataclass is now supported for JIT compilaton and cute.compile for both plain and tvm-ffi path
  • cute.copy now supports user specified loop unrolling

  • Bug fixing and improvements

  • Improved source code correlation for profiling/debugging
  • Fixed an aarch64 segfault issue with tvm-ffi
  • Re-organization for CuTe DSL examples/tutorials for better discoverability

  • More examples of authorizing peak-performance kernels

  • MOE examles

    • A new style of grouped-gemm that aligns to torch's grouped_mm and scaled_groued_mm interface.
    • Expert-wise tensormap descriptor setup by a cheap helper kernel (~2us) to avoid long latency in tile switching, kernel structure is much more closer to a normal GEMM.
    • Compared to torch_210_cu13, very few problem has worse perf in B200.
      • mxfp8_2dx3d: avg 1.29 speedup;
      • mxfp8_2dx2d: avg 1.41 speedup;
        • nvfp4_2dx3d: avg 1.11 speedup;
      • nvfp4_2dx2d: avg 1.12 speedup (worst case 0.98)
      • bf16_2dx3d: avg 1.15 speedup (worst case 0.98)
      • bf16_2dx2d: avg 1.17 speedup (worst case 0.96)
      • Note: The perf is measured from torch profiler, this impl includes the helper kernel + main kernel, while torch's includes its setup kernel and cutlass_cpp main kernel.
  • API changes

  • ab_dtype is deprecated in make_trivial_tiled_mma and make_blockscaled_trivial_tiled_mma from blackwell_helpers.py. Please specify a_dtype and b_dtype separately instead.

CUTLASS C++

  • Add 2SM MMA instruction support to mixed TMA+CpAsync SM100 vanilla GEMM kernels.
  • Mixed TMA+CpAsync can now accept static, but non trivial cluster shapes.
  • Uses TMA multicast for A tile when using non-trivial cluster size along N mode.
  • Uses an additional barrier (mma_trampoline_barrier) to track cp.async arrivals in both CTAs.
  • Changes included in example 92.
  • Add support for 128x32xK and 128x64xK tile sizes for SM120 blockscaled MMA collective builders, yielding up to 30% performance improvement on Blackwell SM121 related kernels.
  • Add static load to tensor memory support, included in example 77.
  • Use 64-bit adds for SM100 MMA descriptor offsets and reduce move instructions for improved code generation.
  • Add example 95 to support green context SM partition
  • Enables launching GEMM on stream with partial SM allocation.
  • Add Snake activation functor for EVT.
  • Fix some kernel issues:
  • Fix l2_capacity=0 handling in Blackwell SM100/SM120 kernel templates
  • Fix CUTLASS clang build issues
  • Fix atomicCAS read-modify-write loop in ConstSubbyteReference
  • Replace __nv_atomic_load_n with volatile for CUDA 11.4 compatibility in subbyte reference
  • Remove PipelineStorage shadowing in SM100 complex epilogue
  • Fix build issue in SM90 epilogue fusion visitor TMA warpspecialized
  • Fix some profiler issues:
  • Add missing reference kernels for blockwise GEMM profiler.
Source: README.md, updated 2026-05-13