The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
README.md	2025-10-28	5.3 kB	0
v0.8.0 source code.tar.gz	2025-10-28	1.6 MB	0
v0.8.0 source code.zip	2025-10-28	2.2 MB	0
Totals: 3 Items		3.8 MB	0

Summary

CubeCL 0.8.0 introduces major enhancements to quantization and matrix operations, near-complete flash attention implementation, and comprehensive matmul refactoring built on a new views and layouts system. This release brings a new MLIR-based CPU backend with LLVM, improved memory management with multi-stream support, and persistent storage capabilities.

What's New

Features

Flash Attention: Full implementation with masking support, partitions, row-wise reductions, and multi-plane operations (@louisfd, [#845], [#962], [#902], [#920], [#907])
MLIR CPU Backend: Initial implementation providing CPU runtime support for non-Linux systems (@marcantoinem, [#698], [#790])
Advanced Quantization: Block-scaled MMA, global quantization for matmul, quantized views, and support for FP4/FP2 formats (@wingertge, @nathanielsimard, [#815], [#960], [#954], [#836], [#809])
Persistent Memory: Added persistent storage capabilities for artifacts (@nathanielsimard, [#947])
Multi-Stream Support: Implemented multi-stream processing for WGPU and CUDA (@nathanielsimard, [#914], [#896])
Tensor Memory Arrays (TMA): Added TMA views for optimized memory access (@wingertge, [#943])
Pinned Memory: Support for pinned memory allocations (@nathanielsimard, [#885])
Manual MMA Operations: Added manually managed MMA operations with custom tile support (@wingertge, [#935], [#810])
Stacked and Tensor Layouts: New layout system for matmul and advanced tensor operations (@wingertge, [#855], [#835], [#839])
Saturating Arithmetic: Added saturating add/sub operations (@wingertge, [#898])
Shuffle Operations: Basic shuffle operations support (@huy209vn, [#968])
Additional Ops: Trunc, IsNan, IsInf, and powi for CUDA/HIP (@mooori, @laggui, @wingertge, [#956], [#937], [#857])
Partition Scheduler: New scheduling system for shared memory reads in Matmul (@louisfd, [#837])

Performance Improvements

Optimized Line Sizes: Unrolled line sizes for matmul, convolution, reduce, and attention operations (@wingertge, [#918])
Memory Management: Refactored memory management API and static memory pool (@wingertge, @nathanielsimard, [#800], [#787])
Device Locking: Improved device management and CUDA device change optimization (@nathanielsimard, [#959], [#864])
Reusable Shared Memory: Enhanced shared memory management (@wingertge, [#931])

Breaking Changes

CUDA 12.8 Default: Bumped default CUDA version to 12.8 with new feature implementations (@wingertge, [#820])
Item Rework: Refactored item handling system (@wingertge, [#844])

Refactoring

Matmul Restructuring: Extensive refactoring of matmul components including inputs, tile operations, generics, and stage memory configuration (@wingertge, @louisfd, [#949], [#886], [#819], [#795], [#794])
Launch System: Refactored launch mechanism (@wingertge, [#944])
Stage and Global Writers: Improved writer architecture (@wingertge, [#924])
Runtime Features: Split and reorganized runtime traits (@wingertge, [#883], [#868])
Convolution: Refactored convolution implementation (@wingertge, [#822])

Bug Fixes

Quantized Matmul: Fixed quant matmul line sizes and packed matmul issues (@wingertge, [#978], [#967])
Tensor Operations: Corrected tensor shapes in reduce operations and fixed reverse sequence mutation (@TsaoLun, @wingertge, [#976], [#957])
Metal Backend: Fixed plane operations on Metal (@louisfd, [#964])
WGPU Improvements: Fixed async readback, multi-stream support, and out-of-bounds writes (@ArthurBrussee, @nathanielsimard, [#925], [#912], [#961])
Broadcasting: Fixed broadcasting issues in compare ops and binary operations (@wingertge, [#916], [#895])
WGSL Fixes: Corrected scalar declarations, vec-to-scalar casts, and boolean logic (@wingertge, @Cielbird, [#818], [#808], [#840])
Profiling: Resolved profiling deadlock (@nathanielsimard, [#963])
Type Conversions: Fixed packed FP4 casting and comparison vectorization (@wingertge, @laggui, [#890], [#858])

Infrastructure

WGPU 26: Upgraded to wgpu version 26 (@janhohenheim, [#850])
Vulkan/rspirv Fork: Forked and integrated Vulkan/rspirv (@wingertge, [#880])
SPIRV Dump: Auto-enable spirv-dump when output path is set during build (@wingertge, [#928])
Deterministic Hashing: Made hash generation deterministic (@wingertge, [#948])
No-std Support: Added no-std compatibility for cubecl-quant (@laggui, [#911], [#812])
Streaming Logger: Added streaming logger and configuration (@nathanielsimard, [#917])
Build Improvements: Enhanced CUDA version selection with build scripts (@wingertge, [#856])

Documentation

Book Updates: Various improvements to documentation (@louisfd, [#977])
Getting Started: Fixed GpuTensor examples (@ChosunOne, [#852])

Platform Support

CPU on All OSes: Enabled cubecl-cpu on all operating systems (@syl20bnr, [#897])
WebGPU/WASM: Fixed WebGPU and WASM support (@ArthurBrussee, [#824], [#908])
HIP Updates: Updated HIP backend with wmma compiler refactoring (@nathanielsimard, [#975], [#789])

Source: README.md, updated 2025-10-28