A scalable inference server for models optimized with OpenVINO
A toolkit to optimize ML models for deployment for Keras & TensorFlow
Port of Facebook's LLaMA model in C/C++
Port of OpenAI's Whisper model in C/C++
User-friendly AI Interface
Ready-to-use OCR with 80+ supported languages
AIMET is a library that provides advanced quantization and compression
Bring the notion of Model-as-a-Service to life
Uncover insights, surface problems, monitor, and fine tune your LLM
A high-performance ML model serving framework, offers dynamic batching
Everything you need to build state-of-the-art foundation models
INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
Unified Model Serving Framework
The free, Open Source alternative to OpenAI, Claude and others
Trainable models and NN optimization tools
Serve, optimize and scale PyTorch models in production
C#/.NET binding of llama.cpp, including LLaMa/GPT model inference
Unofficial (Golang) Go bindings for the Hugging Face Inference API
Private Open AI on Kubernetes
Neural Network Compression Framework for enhanced OpenVINO
High-performance neural network inference framework for mobile
Simplifies the local serving of AI models from any source
The Triton Inference Server provides an optimized cloud
ONNX Runtime: cross-platform, high performance ML inferencing
Library for serving Transformers models on Amazon SageMaker