A high-throughput and memory-efficient inference and serving engine
Running large language models on a single GPU
MiMo-V2-Flash: Efficient Reasoning, Coding, and Agentic Foundation
C++ library for high performance inference on NVIDIA GPUs
950 line, minimal, extensible LLM inference engine built from scratch
AI memory OS for LLM and Agent systems
Deep learning optimization library: makes distributed training easy
Fast and memory-efficient exact attention
Minimal Python framework for scalable AI inference servers fast
High-performance inference server for text embeddings models API layer
Parallax is a distributed model serving framework
Lets make video diffusion practical
Alibaba's high-performance LLM inference engine for diverse apps
Low-latency REST API for serving text-embeddings
FlashMLA: Efficient Multi-head Latent Attention Kernels
MII makes low-latency and high-throughput inference possible
Rust async runtime based on io-uring
Lemonade helps users run local LLMs with the highest performance
Large Language Model Text Generation Inference
Next Generation Agentic Proxy for AI Agents and MCP servers
Open-source large language model family from Tencent Hunyuan
Block Diffusion for Ultra-Fast Speculative Decoding
Built for demanding AI workflows
C++-based high-performance parallel environment execution engine
TensorRT LLM provides users with an easy-to-use Python API