FlashMLA is a high-performance decoding kernel library designed especially for Multi-Head Latent Attention (MLA) workloads, targeting NVIDIA Hopper GPU architectures. It provides optimized kernels for MLA decoding, including support for variable-length sequences, helping reduce latency and increase throughput in model inference systems using that attention style. The library supports both BF16 and FP16 data types, and includes a paged KV cache implementation with a block size of 64 to efficiently manage memory during decoding. On very compute-bound settings, it can reach up to ~660 TFLOPS on H800 SXM5 hardware, while in memory-bound configurations it can push memory throughput to ~3000 GB/s. The team regularly updates it with performance improvements; for example, a 2025 update claims 5 % to 15 % gains on compute-bound workloads while maintaining API compatibility.

Features

  • Decoding kernel optimized for MLA (Multi-Head Latent Attention) modules
  • Support for BF16 and FP16 precision to balance speed vs numerical stability
  • Paged KV cache with block size = 64 to efficiently handle varying sequence lengths
  • GPU-native implementation targeting NVIDIA Hopper architecture
  • Python / PyTorch integration via functions like flash_mla_with_kvcache
  • Regular performance improvements over time (e.g. 5–15 % uplift in newer versions)

Project Samples

Project Activity

See All Activity >

Categories

Libraries, AI Models

License

MIT License

Follow FlashMLA

FlashMLA Web Site

Other Useful Business Software
Stop Cyber Threats with VM-Series Next-Gen Firewall on Azure Icon
Stop Cyber Threats with VM-Series Next-Gen Firewall on Azure

Native application identity and user-based security for your Azure cloud

Gain integrated visibility across all traffic in a single pass. Deploy Palo Alto Networks VM-Series to determine application identity and content while automating security policy updates via rich APIs.
Get a free trial
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of FlashMLA!

Additional Project Details

Programming Language

C++

Related Categories

C++ Libraries, C++ AI Models

Registered

2025-10-03