Replace OpenAI GPT with another LLM in your app
Official inference library for Mistral models
Large Language Model Text Generation Inference
High-performance inference server for text embeddings models API layer
Library for serving Transformers models on Amazon SageMaker
A high-throughput and memory-efficient inference and serving engine
C++ library for high performance inference on NVIDIA GPUs
FlashInfer: Kernel Library for LLM Serving
Optimizing inference proxy for LLMs
Port of Facebook's LLaMA model in C/C++
A general-purpose probabilistic programming system
Deep learning optimization library: makes distributed training easy
ONNX Runtime: cross-platform, high performance ML inferencing
Bayesian inference with probabilistic programming
lightweight, standalone C++ inference engine for Google's Gemma models
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
MII makes low-latency and high-throughput inference possible
AIMET is a library that provides advanced quantization and compression
High-performance inference framework for large language models
Port of OpenAI's Whisper model in C/C++
AlphaFold 3 inference pipeline
C#/.NET binding of llama.cpp, including LLaMa/GPT model inference
Standardized Serverless ML Inference Platform on Kubernetes
C++ implementation of ChatGLM-6B & ChatGLM2-6B & ChatGLM3 & GLM4(V)
A high-performance inference system for large language models