Replace OpenAI GPT with another LLM in your app
Official inference library for Mistral models
The Triton Inference Server provides an optimized cloud
High-performance inference server for text embeddings models API layer
Large Language Model Text Generation Inference
A high-throughput and memory-efficient inference and serving engine
Library for serving Transformers models on Amazon SageMaker
C++ library for high performance inference on NVIDIA GPUs
AlphaFold 3 inference pipeline
Bayesian inference with probabilistic programming
High-performance reactive message-passing based Bayesian engine
Optimizing inference proxy for LLMs
Port of Facebook's LLaMA model in C/C++
FlashInfer: Kernel Library for LLM Serving
C#/.NET binding of llama.cpp, including LLaMa/GPT model inference
A general-purpose probabilistic programming system
Deep learning optimization library: makes distributed training easy
ONNX Runtime: cross-platform, high performance ML inferencing
Port of OpenAI's Whisper model in C/C++
lightweight, standalone C++ inference engine for Google's Gemma models
A high-performance inference system for large language models
Single-cell analysis in Python
950 line, minimal, extensible LLM inference engine built from scratch
Trainable, memory-efficient, and GPU-friendly PyTorch reproduction
Ready-to-use OCR with 80+ supported languages