The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
README.md	2025-04-23	5.1 kB	0
v0.5.0 source code.tar.gz	2025-04-23	1.4 MB	0
v0.5.0 source code.zip	2025-04-23	1.7 MB	0
Totals: 3 Items		3.1 MB	0

CubeCL Release Notes

Features

Autotune Rework: Enhanced autotuning with type magic and persistent cache support. (#430, [#567], [#598], [#604], [#630], [#635])
Fast Float Math: Added fast floating-point math operations for SPIR-V. (#432)
Tensor Memory Accelerator (TMA): Introduced TMA for faster matmul and im2col convolution. (#533, [#584], [#572])
Uniformity Analysis: Implemented for SPIR-V to optimize kernel execution. (#460)
Full Atomic Sum: Added support for full atomic sum operations. (#448)
Pipeline API for CUDA: New API to streamline CUDA pipeline operations. (#422)
Block-Wise Quantization: Initial support for per-tensor and block-wise quantization in matmul. (#536, [#578])
Clustering Support: Basic clustering with metadata for distributed workloads. (#560)
Min/Max Reduction: Added min/max reduction operations. (#594)
Double Buffering Multi-Tasks: Enhanced double buffering for multi-task matmul. (#626)
CubeCL Standard Library: Introduced cubecl-std for common utilities. (#431)

Matmul Optimizations:
Async buffer loading and multi-row selection. (#535, [#616], [#623], [#638])
Refactored loaders, stage buffering, and tiling layout for efficiency. (#528, [#573], [#575], [#577], [#583], [#586], [#587], [#593], [#597], [#609], [#611], [#613], [#632])
Simplified configuration and quantized test metadata. (#469, [#481], [#538])
Double buffering fragments and precision type support. (#547, [#550], [#636])
Convolution: Refactored for Burn and added conv2d benchmark. (#500, [#531], [#631])
Reduce Operations: Optimized reduce kernels with stride 0 and bound checks. (#534, [#580], [#594])
Memory Management: Streamlined memory handling and ExclusivePages allocator improvements. (#419, [#445], [#512], [#529])
Fusion: Improved kernel fusion for better performance. (#463, [#484], [#499])

Matmul:
Fixed naive kernel, lower precision, and compilation issues. (#546, [#553], [#557], [#636])
Corrected cyclic loading and strided loader bugs. (#440, [#444], [#482], [#507], [#509])
Reduce: Fixed shared sum test and general reduce issues. (#467, [#554])
SPIR-V: Resolved spirv-dump and mixed kernel feature registration. (#466)
WASM: Fixed compilation and arc-related issues. (#454, [#559], [#592])
HIP: Corrected shuffle intrinsics, bf16 reduce, and ROCm 6.4.0 updates. (#450, [#601], [#614], [#617], [#627])
Metal: Fixed simdgroup instructions, mulhi, ffs, and cmma synchronization. (#540, [#566], [#591], [#606], [#607], [#612], [#624])
Reinterpret Operations: Fixed slice and read/write issues. (#561, [#568], [#569], [#570], [#603])
Cache and Autotune: Addressed cache file issues and autotune timing/locking. (#517, [#521], [#598], [#604], [#630])
Miscellaneous: Fixed bitwise unary ops, path issues on Arch Linux, and debug print macro. (#421, [#428], [#462], [#475])

ARM64 Compilation: Fixed compilation issues for ARM64. (#413)
HIP Bindings: Improved bindings and documentation. (#427, [#588])
WGPU: Upgraded to versions 24 and 25, with dynamic compiler selection. (#436, [#470], [#589])
Metal MSL CPP Compiler: Added support with WGPU runtime. (#540)
Rust: Updated to Rust 1.85.1 and edition 2024. (#532)

IR Refactor:
Separated IR into its own crate with reflection and semantic categories. (#435, [#442])
Made IR compatible with no_std. (#456)
Matmul:
Unified multi-buffer and single-buffer algorithms. (#587)
Refactored loaders, stage matmul, and job configurations. (#528, [#548], [#593], [#597], [#613])
Memory Management:
Merged CubeContext and Scope. (#452)
Replaced Arcs and improved deallocation. (#443, [#454], [#512])
Error Handling: Consolidated error types into a single type. (#453)
CubeLaunch and CubeType: Major refactor of derive macros. (#530)
Runtime: Refactored CUDA, backend arguments, and binding passing. (#522, [#526], [#543])

Debugging: Improved debug symbols, print macro, and general debug tools. (#462, [#474], [#562])
Testing: Added flexible matmul tests, TMA tests, and conv2d benchmarks. (#476, [#572], [#631])
Documentation: Updated README example and added pull request template. (#490, [#605])
Dependencies:
Upgraded rand to 0.9.0 and cudarc to 0.13.9. (#473, [#503])
Fixed getrandom for no_std. (#477)
Macros: Replaced return with terminate macro and added CubeOption. (#449, [#494])

New Operations: Added leading_zeros, find_first_set, plane_ballot, and inclusive/exclusive sum/prod. (#446, [#461])
Type System: Moved to custom typehash implementation. (#455)
Hardware Properties: Added max cube count and dimension to hardware metadata. (#515)

Source: README.md, updated 2025-04-23