A high-throughput and memory-efficient inference and serving engine
950 line, minimal, extensible LLM inference engine built from scratch
A high-performance inference engine for AI models
TokenSpeed is a speed-of-light LLM inference engine
Jlama is a modern LLM inference engine for Java
A lightweight vLLM implementation built from scratch
Alibaba's high-performance LLM inference engine for diverse apps
High-performance inference framework for large language models
LLM inference in C/C++
Fast Multimodal LLM on Mobile Devices
Universal LLM Deployment Engine with ML Compilation
Mooncake is the serving platform for Kimi
Fast, flexible LLM inference
LightLLM is a Python-based LLM (Large Language Model) inference
Run a 1-billion parameter LLM on a $10 board with 256MB RAM
WebAssembly binding for llama.cpp - Enabling on-browser LLM inference
Parallax is a distributed model serving framework
High-speed Large Language Model Serving for Local Deployment
Tensor search for humans
Inference Llama 2 in one file of pure C
Fully private LLM chatbot that runs entirely with a browser
Masks sensitive data and secrets before they reach AI
Run AI models locally on your machine with node.js bindings for llama
Llama 2 Everywhere (L2E)