Download Latest Version v0.9.0 source code.tar.gz (1.4 MB)
Email in envelope

Get an email when there's a new version of CubeCL

Home / v0.5.0
Name Modified Size InfoDownloads / Week
Parent folder
README.md 2025-04-23 5.1 kB
v0.5.0 source code.tar.gz 2025-04-23 1.4 MB
v0.5.0 source code.zip 2025-04-23 1.7 MB
Totals: 3 Items   3.1 MB 0

CubeCL Release Notes

Features

  • Autotune Rework: Enhanced autotuning with type magic and persistent cache support. (#430, [#567], [#598], [#604], [#630], [#635])
  • Fast Float Math: Added fast floating-point math operations for SPIR-V. (#432)
  • Tensor Memory Accelerator (TMA): Introduced TMA for faster matmul and im2col convolution. (#533, [#584], [#572])
  • Uniformity Analysis: Implemented for SPIR-V to optimize kernel execution. (#460)
  • Full Atomic Sum: Added support for full atomic sum operations. (#448)
  • Pipeline API for CUDA: New API to streamline CUDA pipeline operations. (#422)
  • Block-Wise Quantization: Initial support for per-tensor and block-wise quantization in matmul. (#536, [#578])
  • Clustering Support: Basic clustering with metadata for distributed workloads. (#560)
  • Min/Max Reduction: Added min/max reduction operations. (#594)
  • Double Buffering Multi-Tasks: Enhanced double buffering for multi-task matmul. (#626)
  • CubeCL Standard Library: Introduced cubecl-std for common utilities. (#431)

Performance Improvements

  • Matmul Optimizations:
  • Async buffer loading and multi-row selection. (#535, [#616], [#623], [#638])
  • Refactored loaders, stage buffering, and tiling layout for efficiency. (#528, [#573], [#575], [#577], [#583], [#586], [#587], [#593], [#597], [#609], [#611], [#613], [#632])
  • Simplified configuration and quantized test metadata. (#469, [#481], [#538])
  • Double buffering fragments and precision type support. (#547, [#550], [#636])
  • Convolution: Refactored for Burn and added conv2d benchmark. (#500, [#531], [#631])
  • Reduce Operations: Optimized reduce kernels with stride 0 and bound checks. (#534, [#580], [#594])
  • Memory Management: Streamlined memory handling and ExclusivePages allocator improvements. (#419, [#445], [#512], [#529])
  • Fusion: Improved kernel fusion for better performance. (#463, [#484], [#499])

Bug Fixes

  • Matmul:
  • Fixed naive kernel, lower precision, and compilation issues. (#546, [#553], [#557], [#636])
  • Corrected cyclic loading and strided loader bugs. (#440, [#444], [#482], [#507], [#509])
  • Reduce: Fixed shared sum test and general reduce issues. (#467, [#554])
  • SPIR-V: Resolved spirv-dump and mixed kernel feature registration. (#466)
  • WASM: Fixed compilation and arc-related issues. (#454, [#559], [#592])
  • HIP: Corrected shuffle intrinsics, bf16 reduce, and ROCm 6.4.0 updates. (#450, [#601], [#614], [#617], [#627])
  • Metal: Fixed simdgroup instructions, mulhi, ffs, and cmma synchronization. (#540, [#566], [#591], [#606], [#607], [#612], [#624])
  • Reinterpret Operations: Fixed slice and read/write issues. (#561, [#568], [#569], [#570], [#603])
  • Cache and Autotune: Addressed cache file issues and autotune timing/locking. (#517, [#521], [#598], [#604], [#630])
  • Miscellaneous: Fixed bitwise unary ops, path issues on Arch Linux, and debug print macro. (#421, [#428], [#462], [#475])

Platform Support

  • ARM64 Compilation: Fixed compilation issues for ARM64. (#413)
  • HIP Bindings: Improved bindings and documentation. (#427, [#588])
  • WGPU: Upgraded to versions 24 and 25, with dynamic compiler selection. (#436, [#470], [#589])
  • Metal MSL CPP Compiler: Added support with WGPU runtime. (#540)
  • Rust: Updated to Rust 1.85.1 and edition 2024. (#532)

Refactorings

  • IR Refactor:
  • Separated IR into its own crate with reflection and semantic categories. (#435, [#442])
  • Made IR compatible with no_std. (#456)
  • Matmul:
  • Unified multi-buffer and single-buffer algorithms. (#587)
  • Refactored loaders, stage matmul, and job configurations. (#528, [#548], [#593], [#597], [#613])
  • Memory Management:
  • Merged CubeContext and Scope. (#452)
  • Replaced Arcs and improved deallocation. (#443, [#454], [#512])
  • Error Handling: Consolidated error types into a single type. (#453)
  • CubeLaunch and CubeType: Major refactor of derive macros. (#530)
  • Runtime: Refactored CUDA, backend arguments, and binding passing. (#522, [#526], [#543])

Developer Experience

  • Debugging: Improved debug symbols, print macro, and general debug tools. (#462, [#474], [#562])
  • Testing: Added flexible matmul tests, TMA tests, and conv2d benchmarks. (#476, [#572], [#631])
  • Documentation: Updated README example and added pull request template. (#490, [#605])
  • Dependencies:
  • Upgraded rand to 0.9.0 and cudarc to 0.13.9. (#473, [#503])
  • Fixed getrandom for no_std. (#477)
  • Macros: Replaced return with terminate macro and added CubeOption. (#449, [#494])

Miscellaneous

  • New Operations: Added leading_zeros, find_first_set, plane_ballot, and inclusive/exclusive sum/prod. (#446, [#461])
  • Type System: Moved to custom typehash implementation. (#455)
  • Hardware Properties: Added max cube count and dimension to hardware metadata. (#515)
Source: README.md, updated 2025-04-23