| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| README.md | 2025-04-23 | 5.1 kB | |
| v0.5.0 source code.tar.gz | 2025-04-23 | 1.4 MB | |
| v0.5.0 source code.zip | 2025-04-23 | 1.7 MB | |
| Totals: 3 Items | 3.1 MB | 0 | |
CubeCL Release Notes
Features
- Autotune Rework: Enhanced autotuning with type magic and persistent cache support. (#430, [#567], [#598], [#604], [#630], [#635])
- Fast Float Math: Added fast floating-point math operations for SPIR-V. (#432)
- Tensor Memory Accelerator (TMA): Introduced TMA for faster matmul and im2col convolution. (#533, [#584], [#572])
- Uniformity Analysis: Implemented for SPIR-V to optimize kernel execution. (#460)
- Full Atomic Sum: Added support for full atomic sum operations. (#448)
- Pipeline API for CUDA: New API to streamline CUDA pipeline operations. (#422)
- Block-Wise Quantization: Initial support for per-tensor and block-wise quantization in matmul. (#536, [#578])
- Clustering Support: Basic clustering with metadata for distributed workloads. (#560)
- Min/Max Reduction: Added min/max reduction operations. (#594)
- Double Buffering Multi-Tasks: Enhanced double buffering for multi-task matmul. (#626)
- CubeCL Standard Library: Introduced
cubecl-stdfor common utilities. (#431)
Performance Improvements
- Matmul Optimizations:
- Async buffer loading and multi-row selection. (#535, [#616], [#623], [#638])
- Refactored loaders, stage buffering, and tiling layout for efficiency. (#528, [#573], [#575], [#577], [#583], [#586], [#587], [#593], [#597], [#609], [#611], [#613], [#632])
- Simplified configuration and quantized test metadata. (#469, [#481], [#538])
- Double buffering fragments and precision type support. (#547, [#550], [#636])
- Convolution: Refactored for Burn and added conv2d benchmark. (#500, [#531], [#631])
- Reduce Operations: Optimized reduce kernels with stride 0 and bound checks. (#534, [#580], [#594])
- Memory Management: Streamlined memory handling and ExclusivePages allocator improvements. (#419, [#445], [#512], [#529])
- Fusion: Improved kernel fusion for better performance. (#463, [#484], [#499])
Bug Fixes
- Matmul:
- Fixed naive kernel, lower precision, and compilation issues. (#546, [#553], [#557], [#636])
- Corrected cyclic loading and strided loader bugs. (#440, [#444], [#482], [#507], [#509])
- Reduce: Fixed shared sum test and general reduce issues. (#467, [#554])
- SPIR-V: Resolved
spirv-dumpand mixed kernel feature registration. (#466) - WASM: Fixed compilation and arc-related issues. (#454, [#559], [#592])
- HIP: Corrected shuffle intrinsics, bf16 reduce, and ROCm 6.4.0 updates. (#450, [#601], [#614], [#617], [#627])
- Metal: Fixed simdgroup instructions, mulhi, ffs, and cmma synchronization. (#540, [#566], [#591], [#606], [#607], [#612], [#624])
- Reinterpret Operations: Fixed slice and read/write issues. (#561, [#568], [#569], [#570], [#603])
- Cache and Autotune: Addressed cache file issues and autotune timing/locking. (#517, [#521], [#598], [#604], [#630])
- Miscellaneous: Fixed bitwise unary ops, path issues on Arch Linux, and debug print macro. (#421, [#428], [#462], [#475])
Platform Support
- ARM64 Compilation: Fixed compilation issues for ARM64. (#413)
- HIP Bindings: Improved bindings and documentation. (#427, [#588])
- WGPU: Upgraded to versions 24 and 25, with dynamic compiler selection. (#436, [#470], [#589])
- Metal MSL CPP Compiler: Added support with WGPU runtime. (#540)
- Rust: Updated to Rust 1.85.1 and edition 2024. (#532)
Refactorings
- IR Refactor:
- Separated IR into its own crate with reflection and semantic categories. (#435, [#442])
- Made IR compatible with
no_std. (#456) - Matmul:
- Unified multi-buffer and single-buffer algorithms. (#587)
- Refactored loaders, stage matmul, and job configurations. (#528, [#548], [#593], [#597], [#613])
- Memory Management:
- Merged
CubeContextandScope. (#452) - Replaced Arcs and improved deallocation. (#443, [#454], [#512])
- Error Handling: Consolidated error types into a single type. (#453)
- CubeLaunch and CubeType: Major refactor of derive macros. (#530)
- Runtime: Refactored CUDA, backend arguments, and binding passing. (#522, [#526], [#543])
Developer Experience
- Debugging: Improved debug symbols, print macro, and general debug tools. (#462, [#474], [#562])
- Testing: Added flexible matmul tests, TMA tests, and conv2d benchmarks. (#476, [#572], [#631])
- Documentation: Updated README example and added pull request template. (#490, [#605])
- Dependencies:
- Upgraded
randto 0.9.0 andcudarcto 0.13.9. (#473, [#503]) - Fixed
getrandomforno_std. (#477) - Macros: Replaced
returnwithterminatemacro and addedCubeOption. (#449, [#494])
Miscellaneous
- New Operations: Added
leading_zeros,find_first_set,plane_ballot, and inclusive/exclusive sum/prod. (#446, [#461]) - Type System: Moved to custom
typehashimplementation. (#455) - Hardware Properties: Added max cube count and dimension to hardware metadata. (#515)