Replace OpenAI GPT with another LLM in your app
Official inference library for Mistral models
High-performance inference server for text embeddings models API layer
Large Language Model Text Generation Inference
The Triton Inference Server provides an optimized cloud
Port of Facebook's LLaMA model in C/C++
C++ library for high performance inference on NVIDIA GPUs
Library for serving Transformers models on Amazon SageMaker
A high-throughput and memory-efficient inference and serving engine
AlphaFold 3 inference pipeline
Deep learning optimization library: makes distributed training easy
Standardized Serverless ML Inference Platform on Kubernetes
High-performance reactive message-passing based Bayesian engine
Optimizing inference proxy for LLMs
A general-purpose probabilistic programming system
Port of OpenAI's Whisper model in C/C++
ONNX Runtime: cross-platform, high performance ML inferencing
AirLLM 70B inference with single 4GB GPU
LLM inference in C/C++
A scalable inference server for models optimized with OpenVINO
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
A high-performance inference system for large language models
DeepSeek 4 Flash local inference engine for Metal
Bayesian inference with probabilistic programming
C#/.NET binding of llama.cpp, including LLaMa/GPT model inference