Replace OpenAI GPT with another LLM in your app
A high-throughput and memory-efficient inference and serving engine
Port of Facebook's LLaMA model in C/C++
C#/.NET binding of llama.cpp, including LLaMa/GPT model inference
950 line, minimal, extensible LLM inference engine built from scratch
C++ implementation of ChatGLM-6B & ChatGLM2-6B & ChatGLM3 & GLM4(V)
A lightweight vLLM implementation built from scratch
AirLLM 70B inference with single 4GB GPU
Inference Llama 2 in one file of pure C
A high-performance inference engine for AI models
High-performance inference framework for large language models
High-performance Inference and Deployment Toolkit for LLMs and VLMs
Low-latency REST API for serving text-embeddings
A course of learning LLM inference serving on Apple Silicon
An Easy-to-Use and High-Performance AI Deployment Framework
GLM-4.5: Open-source LLM for intelligent agents by Z.ai
Ling is a MoE LLM provided and open-sourced by InclusionAI
Qwen3 is the large language model series developed by Qwen team
State-of-the-art Parameter-Efficient Fine-Tuning
Jlama is a modern LLM inference engine for Java
LightLLM is a Python-based LLM (Large Language Model) inference
Accelerate local LLM inference and finetuning
WebAssembly binding for llama.cpp - Enabling on-browser LLM inference
INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
Operating LLMs in production