| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| README.md | 2025-10-28 | 5.3 kB | |
| v0.8.0 source code.tar.gz | 2025-10-28 | 1.6 MB | |
| v0.8.0 source code.zip | 2025-10-28 | 2.2 MB | |
| Totals: 3 Items | 3.8 MB | 0 | |
Summary
CubeCL 0.8.0 introduces major enhancements to quantization and matrix operations, near-complete flash attention implementation, and comprehensive matmul refactoring built on a new views and layouts system. This release brings a new MLIR-based CPU backend with LLVM, improved memory management with multi-stream support, and persistent storage capabilities.
What's New
Features
- Flash Attention: Full implementation with masking support, partitions, row-wise reductions, and multi-plane operations (@louisfd, [#845], [#962], [#902], [#920], [#907])
- MLIR CPU Backend: Initial implementation providing CPU runtime support for non-Linux systems (@marcantoinem, [#698], [#790])
- Advanced Quantization: Block-scaled MMA, global quantization for matmul, quantized views, and support for FP4/FP2 formats (@wingertge, @nathanielsimard, [#815], [#960], [#954], [#836], [#809])
- Persistent Memory: Added persistent storage capabilities for artifacts (@nathanielsimard, [#947])
- Multi-Stream Support: Implemented multi-stream processing for WGPU and CUDA (@nathanielsimard, [#914], [#896])
- Tensor Memory Arrays (TMA): Added TMA views for optimized memory access (@wingertge, [#943])
- Pinned Memory: Support for pinned memory allocations (@nathanielsimard, [#885])
- Manual MMA Operations: Added manually managed MMA operations with custom tile support (@wingertge, [#935], [#810])
- Stacked and Tensor Layouts: New layout system for matmul and advanced tensor operations (@wingertge, [#855], [#835], [#839])
- Saturating Arithmetic: Added saturating add/sub operations (@wingertge, [#898])
- Shuffle Operations: Basic shuffle operations support (@huy209vn, [#968])
- Additional Ops:
Trunc,IsNan,IsInf, andpowifor CUDA/HIP (@mooori, @laggui, @wingertge, [#956], [#937], [#857]) - Partition Scheduler: New scheduling system for shared memory reads in Matmul (@louisfd, [#837])
Performance Improvements
- Optimized Line Sizes: Unrolled line sizes for matmul, convolution, reduce, and attention operations (@wingertge, [#918])
- Memory Management: Refactored memory management API and static memory pool (@wingertge, @nathanielsimard, [#800], [#787])
- Device Locking: Improved device management and CUDA device change optimization (@nathanielsimard, [#959], [#864])
- Reusable Shared Memory: Enhanced shared memory management (@wingertge, [#931])
Breaking Changes
- CUDA 12.8 Default: Bumped default CUDA version to 12.8 with new feature implementations (@wingertge, [#820])
- Item Rework: Refactored item handling system (@wingertge, [#844])
Refactoring
- Matmul Restructuring: Extensive refactoring of matmul components including inputs, tile operations, generics, and stage memory configuration (@wingertge, @louisfd, [#949], [#886], [#819], [#795], [#794])
- Launch System: Refactored launch mechanism (@wingertge, [#944])
- Stage and Global Writers: Improved writer architecture (@wingertge, [#924])
- Runtime Features: Split and reorganized runtime traits (@wingertge, [#883], [#868])
- Convolution: Refactored convolution implementation (@wingertge, [#822])
Bug Fixes
- Quantized Matmul: Fixed quant matmul line sizes and packed matmul issues (@wingertge, [#978], [#967])
- Tensor Operations: Corrected tensor shapes in reduce operations and fixed reverse sequence mutation (@TsaoLun, @wingertge, [#976], [#957])
- Metal Backend: Fixed plane operations on Metal (@louisfd, [#964])
- WGPU Improvements: Fixed async readback, multi-stream support, and out-of-bounds writes (@ArthurBrussee, @nathanielsimard, [#925], [#912], [#961])
- Broadcasting: Fixed broadcasting issues in compare ops and binary operations (@wingertge, [#916], [#895])
- WGSL Fixes: Corrected scalar declarations, vec-to-scalar casts, and boolean logic (@wingertge, @Cielbird, [#818], [#808], [#840])
- Profiling: Resolved profiling deadlock (@nathanielsimard, [#963])
- Type Conversions: Fixed packed FP4 casting and comparison vectorization (@wingertge, @laggui, [#890], [#858])
Infrastructure
- WGPU 26: Upgraded to wgpu version 26 (@janhohenheim, [#850])
- Vulkan/rspirv Fork: Forked and integrated Vulkan/rspirv (@wingertge, [#880])
- SPIRV Dump: Auto-enable spirv-dump when output path is set during build (@wingertge, [#928])
- Deterministic Hashing: Made hash generation deterministic (@wingertge, [#948])
- No-std Support: Added no-std compatibility for cubecl-quant (@laggui, [#911], [#812])
- Streaming Logger: Added streaming logger and configuration (@nathanielsimard, [#917])
- Build Improvements: Enhanced CUDA version selection with build scripts (@wingertge, [#856])
Documentation
- Book Updates: Various improvements to documentation (@louisfd, [#977])
- Getting Started: Fixed GpuTensor examples (@ChosunOne, [#852])
Platform Support
- CPU on All OSes: Enabled cubecl-cpu on all operating systems (@syl20bnr, [#897])
- WebGPU/WASM: Fixed WebGPU and WASM support (@ArthurBrussee, [#824], [#908])
- HIP Updates: Updated HIP backend with wmma compiler refactoring (@nathanielsimard, [#975], [#789])