Replace OpenAI GPT with another LLM in your app
Official inference library for Mistral models
The Triton Inference Server provides an optimized cloud
Large Language Model Text Generation Inference
Library for serving Transformers models on Amazon SageMaker
A high-throughput and memory-efficient inference and serving engine
C++ library for high performance inference on NVIDIA GPUs
Optimizing inference proxy for LLMs
FlashInfer: Kernel Library for LLM Serving
Port of Facebook's LLaMA model in C/C++
A general-purpose probabilistic programming system
ONNX Runtime: cross-platform, high performance ML inferencing
Bayesian inference with probabilistic programming
Port of OpenAI's Whisper model in C/C++
AIMET is a library that provides advanced quantization and compression
lightweight, standalone C++ inference engine for Google's Gemma models
Deep learning optimization library: makes distributed training easy
C#/.NET binding of llama.cpp, including LLaMa/GPT model inference
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Standardized Serverless ML Inference Platform on Kubernetes
MII makes low-latency and high-throughput inference possible
A high-performance inference system for large language models
Ready-to-use OCR with 80+ supported languages
On-device Speech Recognition for Apple Silicon
Trainable, memory-efficient, and GPU-friendly PyTorch reproduction