Download Latest Version v0.9.0 source code.tar.gz (1.4 MB)
Email in envelope

Get an email when there's a new version of CubeCL

Home / v0.8.0
Name Modified Size InfoDownloads / Week
Parent folder
README.md 2025-10-28 5.3 kB
v0.8.0 source code.tar.gz 2025-10-28 1.6 MB
v0.8.0 source code.zip 2025-10-28 2.2 MB
Totals: 3 Items   3.8 MB 0

Summary

CubeCL 0.8.0 introduces major enhancements to quantization and matrix operations, near-complete flash attention implementation, and comprehensive matmul refactoring built on a new views and layouts system. This release brings a new MLIR-based CPU backend with LLVM, improved memory management with multi-stream support, and persistent storage capabilities.

What's New

Features

  • Flash Attention: Full implementation with masking support, partitions, row-wise reductions, and multi-plane operations (@louisfd, [#845], [#962], [#902], [#920], [#907])
  • MLIR CPU Backend: Initial implementation providing CPU runtime support for non-Linux systems (@marcantoinem, [#698], [#790])
  • Advanced Quantization: Block-scaled MMA, global quantization for matmul, quantized views, and support for FP4/FP2 formats (@wingertge, @nathanielsimard, [#815], [#960], [#954], [#836], [#809])
  • Persistent Memory: Added persistent storage capabilities for artifacts (@nathanielsimard, [#947])
  • Multi-Stream Support: Implemented multi-stream processing for WGPU and CUDA (@nathanielsimard, [#914], [#896])
  • Tensor Memory Arrays (TMA): Added TMA views for optimized memory access (@wingertge, [#943])
  • Pinned Memory: Support for pinned memory allocations (@nathanielsimard, [#885])
  • Manual MMA Operations: Added manually managed MMA operations with custom tile support (@wingertge, [#935], [#810])
  • Stacked and Tensor Layouts: New layout system for matmul and advanced tensor operations (@wingertge, [#855], [#835], [#839])
  • Saturating Arithmetic: Added saturating add/sub operations (@wingertge, [#898])
  • Shuffle Operations: Basic shuffle operations support (@huy209vn, [#968])
  • Additional Ops: Trunc, IsNan, IsInf, and powi for CUDA/HIP (@mooori, @laggui, @wingertge, [#956], [#937], [#857])
  • Partition Scheduler: New scheduling system for shared memory reads in Matmul (@louisfd, [#837])

Performance Improvements

  • Optimized Line Sizes: Unrolled line sizes for matmul, convolution, reduce, and attention operations (@wingertge, [#918])
  • Memory Management: Refactored memory management API and static memory pool (@wingertge, @nathanielsimard, [#800], [#787])
  • Device Locking: Improved device management and CUDA device change optimization (@nathanielsimard, [#959], [#864])
  • Reusable Shared Memory: Enhanced shared memory management (@wingertge, [#931])

Breaking Changes

  • CUDA 12.8 Default: Bumped default CUDA version to 12.8 with new feature implementations (@wingertge, [#820])
  • Item Rework: Refactored item handling system (@wingertge, [#844])

Refactoring

  • Matmul Restructuring: Extensive refactoring of matmul components including inputs, tile operations, generics, and stage memory configuration (@wingertge, @louisfd, [#949], [#886], [#819], [#795], [#794])
  • Launch System: Refactored launch mechanism (@wingertge, [#944])
  • Stage and Global Writers: Improved writer architecture (@wingertge, [#924])
  • Runtime Features: Split and reorganized runtime traits (@wingertge, [#883], [#868])
  • Convolution: Refactored convolution implementation (@wingertge, [#822])

Bug Fixes

  • Quantized Matmul: Fixed quant matmul line sizes and packed matmul issues (@wingertge, [#978], [#967])
  • Tensor Operations: Corrected tensor shapes in reduce operations and fixed reverse sequence mutation (@TsaoLun, @wingertge, [#976], [#957])
  • Metal Backend: Fixed plane operations on Metal (@louisfd, [#964])
  • WGPU Improvements: Fixed async readback, multi-stream support, and out-of-bounds writes (@ArthurBrussee, @nathanielsimard, [#925], [#912], [#961])
  • Broadcasting: Fixed broadcasting issues in compare ops and binary operations (@wingertge, [#916], [#895])
  • WGSL Fixes: Corrected scalar declarations, vec-to-scalar casts, and boolean logic (@wingertge, @Cielbird, [#818], [#808], [#840])
  • Profiling: Resolved profiling deadlock (@nathanielsimard, [#963])
  • Type Conversions: Fixed packed FP4 casting and comparison vectorization (@wingertge, @laggui, [#890], [#858])

Infrastructure

  • WGPU 26: Upgraded to wgpu version 26 (@janhohenheim, [#850])
  • Vulkan/rspirv Fork: Forked and integrated Vulkan/rspirv (@wingertge, [#880])
  • SPIRV Dump: Auto-enable spirv-dump when output path is set during build (@wingertge, [#928])
  • Deterministic Hashing: Made hash generation deterministic (@wingertge, [#948])
  • No-std Support: Added no-std compatibility for cubecl-quant (@laggui, [#911], [#812])
  • Streaming Logger: Added streaming logger and configuration (@nathanielsimard, [#917])
  • Build Improvements: Enhanced CUDA version selection with build scripts (@wingertge, [#856])

Documentation

  • Book Updates: Various improvements to documentation (@louisfd, [#977])
  • Getting Started: Fixed GpuTensor examples (@ChosunOne, [#852])

Platform Support

  • CPU on All OSes: Enabled cubecl-cpu on all operating systems (@syl20bnr, [#897])
  • WebGPU/WASM: Fixed WebGPU and WASM support (@ArthurBrussee, [#824], [#908])
  • HIP Updates: Updated HIP backend with wmma compiler refactoring (@nathanielsimard, [#975], [#789])
Source: README.md, updated 2025-10-28