A high-throughput and memory-efficient inference and serving engine
Running large language models on a single GPU
950 line, minimal, extensible LLM inference engine built from scratch
AI memory OS for LLM and Agent systems
A new kind of Progress Bar, with real-time throughput, ETA
Deep learning optimization library: makes distributed training easy
Fast and memory-efficient exact attention
Minimal Python framework for scalable AI inference servers fast
High-performance inference server for text embeddings models API layer
Parallax is a distributed model serving framework
The async Python driver for MongoDB and Tornado or asyncio
CoreNet: A library for training deep neural networks
Lets make video diffusion practical
Supercharge Your LLM with the Fastest KV Cache Layer
DeepEP: an efficient expert-parallel communication library
Low-latency REST API for serving text-embeddings
MII makes low-latency and high-throughput inference possible
A Python tool that lets you search and download torrents
Lemonade helps users run local LLMs with the highest performance
Large Language Model Text Generation Inference
Open-source large language model family from Tencent Hunyuan
Block Diffusion for Ultra-Fast Speculative Decoding
The Modular Platform (includes MAX & Mojo)
TensorRT LLM provides users with an easy-to-use Python API
A TTS that fits in your CPU (and pocket)