Mooncake is an open-source infrastructure platform designed to optimize large language model serving by focusing on efficient management and transfer of model data and KV cache. The platform was originally developed as part of the serving infrastructure for the Kimi large language model system. Its architecture centers on a high-performance transfer engine that provides unified data transfer across different storage and networking technologies. This engine enables efficient movement of tensors and model data across heterogeneous environments such as GPU memory, system memory, and distributed storage systems. Mooncake also introduces distributed key-value cache storage that allows inference systems to reuse previously computed attention states, significantly improving throughput in large-scale deployments. The system supports advanced networking technologies such as RDMA and NVMe over Fabric, enabling high-speed communication across clusters.
Features
- High-performance transfer engine for moving tensor data across storage layers
- Distributed KV cache storage for improving LLM inference efficiency
- Support for RDMA, TCP, and NVMe-over-Fabric data transfer protocols
- Cluster-level data sharing for checkpoints and intermediate tensors
- Infrastructure designed for large-scale LLM serving environments
- Integration with inference frameworks such as vLLM and TensorRT-LLM