Redundancy-aware KV Cache Compression for Reasoning Models
Semantic cache for LLMs. Fully integrated with LangChain
Unified KV Cache Compression Methods for Auto-Regressive Models
Cache-Augmented Generation: A Simple, Efficient Alternative to RAG
Supercharge Your LLM with the Fastest KV Cache Layer
Mooncake is the serving platform for Kimi
UCCL is an efficient communication library for GPUs
High-performance Inference and Deployment Toolkit for LLMs and VLMs
FlashMLA: Efficient Multi-head Latent Attention Kernels
Java wrapper for the popular chat & VOIP service
An open-source AI agent that brings the power of Grok
RGBD video generation model conditioned on camera input
Bring the notion of Model-as-a-Service to life
Claude Code, but it runs on your Mac for free
A course of learning LLM inference serving on Apple Silicon
TensorRT LLM provides users with an easy-to-use Python API
Fully private LLM chatbot that runs entirely with a browser
Unofficial .Net Client for ChatGPT
A Model Context Protocol (MCP) Gateway & Registry
simplest AI programme of tic-tac-toe game
VITS2 backbone with multilingual-bert
Calculate token/s & GPU memory requirement for any LLM
A reactive runtime for building durable AI agents
A webui for different audio related Neural Networks
Snipe Chan is a Discord Bot that snipes deleted/edited messages