Replace OpenAI GPT with another LLM in your app
A high-throughput and memory-efficient inference and serving engine
High-performance inference framework for large language models
High-performance Inference and Deployment Toolkit for LLMs and VLMs
Low-latency REST API for serving text-embeddings
Inference Llama 2 in one file of pure C
AirLLM 70B inference with single 4GB GPU
950 line, minimal, extensible LLM inference engine built from scratch
Performance-optimized AI inference on your GPUs
Parallax is a distributed model serving framework
Ling is a MoE LLM provided and open-sourced by InclusionAI
GLM-4.5: Open-source LLM for intelligent agents by Z.ai
Operating LLMs in production
Accessible large language models via k-bit quantization for PyTorch
A lightweight vLLM implementation built from scratch
Accelerate local LLM inference and finetuning
Phi-3.5 for Mac: Locally-run Vision and Language Models
PyTorch library of curated Transformer models and their components
Qwen3 is the large language model series developed by Qwen team
Run Local LLMs on Any Device. Open-source
LightLLM is a Python-based LLM (Large Language Model) inference
A high-performance ML model serving framework, offers dynamic batching
State-of-the-art Parameter-Efficient Fine-Tuning
Synthetic data curation for post-training and data extraction
Technical principles related to large models