DeepGEMM
Clean and efficient FP8 GEMM kernels with fine-grained scaling
...It supports both standard and “grouped” GEMMs, which is useful for architectures like Mixture of Experts (MoE) that require segmented matrix multiplications. One distinguishing aspect is that DeepGEMM compiles its kernels at runtime (via a lightweight Just-In-Time (JIT) module), so users don’t need to precompile CUDA kernels before installation. Despite its lean design, it includes scaling strategies (fine-grained scaling) and optimizations inspired by cutting edge systems (drawing from ideas in CUTLASS, CuTe) but in a more streamlined form.