A high-throughput and memory-efficient inference and serving engine
Running large language models on a single GPU
950 line, minimal, extensible LLM inference engine built from scratch
AI memory OS for LLM and Agent systems
Deep learning optimization library: makes distributed training easy
Fast and memory-efficient exact attention
Minimal Python framework for scalable AI inference servers fast
High-performance inference server for text embeddings models API layer
Parallax is a distributed model serving framework
Lets make video diffusion practical
Supercharge Your LLM with the Fastest KV Cache Layer
Low-latency REST API for serving text-embeddings
MII makes low-latency and high-throughput inference possible
Lemonade helps users run local LLMs with the highest performance
Large Language Model Text Generation Inference
Open-source large language model family from Tencent Hunyuan
Block Diffusion for Ultra-Fast Speculative Decoding
The Modular Platform (includes MAX & Mojo)
TensorRT LLM provides users with an easy-to-use Python API
A TTS that fits in your CPU (and pocket)
Stable Diffusion WebUI Forge is a platform on top of Stable Diffusion
Document content and metadata extraction microservice
A simple, performant and scalable Jax LLM
Official inference framework for 1-bit LLMs
Accurate × Fast × Comprehensive