Page 3 | Best Open Source LLM Inference Tools 2026

LLM Inference Tools

View 137 business solutions

LLM Inference Clear Filters

Try Google Cloud Risk-Free With $300 in Credit
No hidden charges. No surprise bills. Cancel anytime.

Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.

Start Free
Enterprise-grade ITSM, for every business
Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity.

Freshservice is an intuitive, AI-powered platform that helps IT, operations, and business teams deliver exceptional service without the usual complexity. Automate repetitive tasks, resolve issues faster, and provide seamless support across the organization. From managing incidents and assets to driving smarter decisions, Freshservice makes it easy to stay efficient and scale with confidence.

Try it Free
1

OpenVINO Model Server

A scalable inference server for models optimized with OpenVINO

OpenVINO™ Model Server is a high-performance inference serving system designed to host and serve machine learning models that have been optimized with the OpenVINO toolkit. It’s implemented in C++ for scalability and efficiency, making it suitable for both edge and cloud deployments where inference workloads must be reliable and high throughput. The server exposes model inference via standard network protocols like REST and gRPC, allowing any client that speaks those protocols to request predictions remotely, abstracting away the complexity of where and how the model runs. It supports model deployment in diverse environments including Docker, bare-metal machines, and Kubernetes clusters, and is especially useful in microservices architectures where AI services need to scale independently. The system supports a wide range of model sources, letting you host models from local storage, remote object storage, or even pull from model hubs.

Downloads: 2 This Week

Last Update: 2026-04-08
See Project
2

Superduper

Superduper: Integrate AI models and machine learning workflows

Superduper is a Python-based framework for building end-2-end AI-data workflows and applications on your own data, integrating with major databases. It supports the latest technologies and techniques, including LLMs, vector-search, RAG, and multimodality as well as classical AI and ML paradigms. Developers may leverage Superduper by building compositional and declarative objects that out-source the details of deployment, orchestration versioning, and more to the Superduper engine. This allows developers to completely avoid implementing MLOps, ETL pipelines, model deployment, data migration, and synchronization. Using Superduper is simply "CAPE": Connect to your data, apply arbitrary AI to that data, package and reuse the application on arbitrary data, and execute AI-database queries and predictions on the resulting AI outputs and data.

Downloads: 2 This Week

Last Update: 2025-08-26
See Project
3

spaGO

Self-contained Machine Learning and Natural Language Processing lib

A Machine Learning library written in pure Go designed to support relevant neural architectures in Natural Language Processing. Spago is self-contained, in that it uses its own lightweight computational graph both for training and inference, easy to understand from start to finish. The core module of Spago relies only on testify for unit testing. In other words, it has "zero dependencies", and we are committed to keeping it that way as much as possible. Spago uses a multi-module workspace to ensure that additional dependencies are downloaded only when specific features (e.g. persistent embeddings) are used. A good place to start is by looking at the implementation of built-in neural models, such as the LSTM. Except for a few linear algebra operations written in assembly for optimal performance (a bit of copying from Gonum), it's straightforward Go code, so you don't have to worry.

Downloads: 2 This Week

Last Update: 2023-10-30
See Project
4

Arch Gateway

The AI-native (edge and LLM) proxy for agents

Arch is an AI-native proxy designed to facilitate the development of agentic applications by handling complex tasks such as input clarification, agent routing, and seamless integration of prompts with tools for common tasks. It provides unified access and observability of Large Language Models (LLMs), enabling developers to build applications more efficiently. Arch supports both edge and LLM deployments, offering flexibility in various environments.

Downloads: 1 This Week

Last Update: 1 day ago
See Project
Full-stack observability with actually useful AI | Grafana Cloud
Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.

Create free account
5

Artificial Intelligence Controller

AICI: Prompts as (Wasm) Programs

AICI is a framework that allows developers to build controllers that constrain and direct the output of Large Language Models (LLMs). By treating prompts as WebAssembly (Wasm) programs, AICI enables more precise and controlled interactions with LLMs, enhancing their utility in various applications. This approach allows for the creation of more reliable and predictable AI-driven systems.

Downloads: 1 This Week

Last Update: 2025-03-19
See Project
6

Autodistill

Images to inference with no labeling

Autodistill uses big, slower foundation models to train small, faster supervised models. Using autodistill, you can go from unlabeled images to inference on a custom model running at the edge with no human intervention in between. You can use Autodistill on your own hardware, or use the Roboflow hosted version of Autodistill to label images in the cloud.

Downloads: 1 This Week

Last Update: 2024-08-14
See Project
7

Bolt NLP

Bolt is a deep learning library with high performance

Bolt is a high-performance deep learning inference framework developed by Huawei Noah's Ark Lab. It is designed to optimize and accelerate the deployment of deep learning models across various hardware platforms. Bolt is a light-weight library for deep learning. Bolt, as a universal deployment tool for all kinds of neural networks, aims to automate the deployment pipeline and achieve extreme acceleration. Bolt has been widely deployed and used in many departments of HUAWEI company, such as 2012 Laboratory, CBG and HUAWEI Product Lines. If you have questions or suggestions, you can submit issue.

Downloads: 1 This Week

Last Update: 2025-01-30
See Project
8

DeepDetect

Deep Learning API and Server in C++14 support for Caffe, PyTorch

The core idea is to remove the error sources and difficulties of Deep Learning applications by providing a safe haven of commoditized practices, all available as a single core. While the Open Source Deep Learning Server is the core element, with REST API, and multi-platform support that allows training & inference everywhere, the Deep Learning Platform allows higher level management for training neural network models and using them as if they were simple code snippets. Ready for applications of image tagging, object detection, segmentation, OCR, Audio, Video, Text classification, CSV for tabular data and time series. Neural network templates for the most effective architectures for GPU, CPU, and Embedded devices. Training in a few hours and with small data thanks to 25+ pre-trained models. Full Open Source, with an ecosystem of tools (API clients, video, annotation, ...) Fast Server written in pure C++, a single codebase for Cloud, Desktop & Embedded.

Downloads: 1 This Week

Last Update: 2025-07-19
See Project
9

DocTR

Library for OCR-related tasks powered by Deep Learning

DocTR provides an easy and powerful way to extract valuable information from your documents. Seemlessly process documents for Natural Language Understanding tasks: we provide OCR predictors to parse textual information (localize and identify each word) from your documents. Robust 2-stage (detection + recognition) OCR predictors with pretrained parameters. User-friendly, 3 lines of code to load a document and extract text with a predictor. State-of-the-art performances on public document datasets, comparable with GoogleVision/AWS Textract. Easy integration (available templates for browser demo & API deployment). End-to-End OCR is achieved in docTR using a two-stage approach: text detection (localizing words), then text recognition (identify all characters in the word). As such, you can select the architecture used for text detection, and the one for text recognition from the list of available implementations.

Downloads: 1 This Week

Last Update: 2026-02-04
See Project
Go From AI Idea to AI App Fast
One platform to build, fine-tune, and deploy ML models. No MLOps team required.

Access Gemini 3 and 200+ models. Build chatbots, agents, or custom models with built-in monitoring and scaling.

Try Free
10

KubeAI

Private Open AI on Kubernetes

Get inferencing running on Kubernetes: LLMs, Embeddings, Speech-to-Text. KubeAI serves an OpenAI compatible HTTP API. Admins can configure ML models by using the Model Kubernetes Custom Resources. KubeAI can be thought of as a Model Operator (See Operator Pattern) that manages vLLM and Ollama servers.

Downloads: 1 This Week

Last Update: 2026-03-31
See Project
11

Mistral Inference

Official inference library for Mistral models

Open and portable generative AI for devs and businesses. We release open-weight models for everyone to customize and deploy where they want it. Our super-efficient model Mistral Nemo is available under Apache 2.0, while Mistral Large 2 is available through both a free non-commercial license, and a commercial license.

Downloads: 1 This Week

Last Update: 2025-03-20
See Project
12

Mosec

A high-performance ML model serving framework, offers dynamic batching

Mosec is a high-performance and flexible model-serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API.

Downloads: 1 This Week

Last Update: 2026-04-15
See Project
13

OpenFold

Trainable, memory-efficient, and GPU-friendly PyTorch reproduction

OpenFold carefully reproduces (almost) all of the features of the original open source inference code (v2.0.1). The sole exception is model ensembling, which fared poorly in DeepMind's own ablation testing and is being phased out in future DeepMind experiments. It is omitted here for the sake of reducing clutter. In cases where the Nature paper differs from the source, we always defer to the latter. OpenFold is trainable in full precision, half precision, or bfloat16 with or without DeepSpeed, and we've trained it from scratch, matching the performance of the original. We've publicly released model weights and our training data — some 400,000 MSAs and PDB70 template hit files — under a permissive license. Model weights are available via scripts in this repository while the MSAs are hosted by the Registry of Open Data on AWS (RODA).

Downloads: 1 This Week

Last Update: 2025-04-26
See Project
14

PaddlePaddle

PArallel Distributed Deep LEarning: Machine Learning Framework

PaddlePaddle is an open source deep learning industrial platform with advanced technologies and a rich set of features that make innovation and application of deep learning easier. It is the only independent R&D deep learning platform in China, and has been widely adopted in various sectors including manufacturing, agriculture and enterprise service. PaddlePaddle covers core deep learning frameworks, basic model libraries, end-to-end development kits and more, with support for both dynamic and static graphs.

Downloads: 1 This Week

Last Update: 2026-01-31
See Project
15

Petals

Run 100B+ language models at home, BitTorrent-style

Run 100B+ language models at home, BitTorrent‑style. Run large language models like BLOOM-176B collaboratively — you load a small part of the model, then team up with people serving the other parts to run inference or fine-tuning. Single-batch inference runs at ≈ 1 sec per step (token) — up to 10x faster than offloading, enough for chatbots and other interactive apps. Parallel inference reaches hundreds of tokens/sec. Beyond classic language model APIs — you can employ any fine-tuning and sampling methods, execute custom paths through the model, or see its hidden states. You get the comforts of an API with the flexibility of PyTorch. You can also host BLOOMZ, a version of BLOOM fine-tuned to follow human instructions in the zero-shot regime — just replace bloom-petals with bloomz-petals. Petals runs large language models like BLOOM-176B collaboratively — you load a small part of the model, then team up with people serving the other parts to run inference or fine-tuning.

Downloads: 1 This Week

Last Update: 2023-09-06
See Project
16

Phi-3-MLX

Phi-3.5 for Mac: Locally-run Vision and Language Models

Phi-3-Vision-MLX is an Apple MLX (machine learning on Apple silicon) implementation of Phi-3 Vision, a lightweight multi-modal model designed for vision and language tasks. It focuses on running vision-language AI efficiently on Apple hardware like M1 and M2 chips.

Downloads: 1 This Week

Last Update: 2025-03-13
See Project
17

Prem AI

Prem provides a unified environment to develop AI applications

An intuitive desktop application designed to effortlessly deploy and self-host Open-Source AI models without exposing sensitive data to third-party. Prem provides a unified environment to develop AI applications and deploy AI models on your infrastructure. Abstracting away all technical complexities for AI deployment and ushering in a new era of privacy-centric AI applications - users can finally retain control and ownership of their models. The AI services expose an HTTP API interface, standardized for their interface type. For example, all models of type Chat expose the OpenAI API for easy of integration of existing tools and AI app ecosystem. Each service we support it's published on the Prem Registry.

Downloads: 1 This Week

Last Update: 2023-11-16
See Project
18

RWKV Runner

A RWKV management and startup tool, full automation, only 8MB

RWKV (pronounced as RwaKuv) is an RNN with GPT-level LLM performance, which can also be directly trained like a GPT transformer (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, fast training, saves VRAM, "infinite" ctxlen, and free text embedding. Moreover it's 100% attention-free. Default configs has enabled custom CUDA kernel acceleration, which is much faster and consumes much less VRAM. If you encounter possible compatibility issues, go to the Configs page and turn off Use Custom CUDA kernel to Accelerate.

Downloads: 1 This Week

Last Update: 2026-02-01
See Project
19

RamaLama

Simplifies the local serving of AI models from any source

RamaLama is an open-source developer tool that simplifies working with and serving AI models locally or in production by leveraging container technologies like Docker, Podman, and OCI registries, allowing AI inference workflows to be treated like standard container deployments. It abstracts away much of the complexity of configuring AI runtimes, dependencies, and hardware optimizations by detecting available GPUs (or falling back to CPU) and automatically pulling a container image pre-configured for the detected hardware environment. Developers can use familiar container commands to pull, run, and interact with AI models from any source, treating models similarly to how container images are handled in OCI workflows. RamaLama supports multiple model registries and offers a REST API or chatbot interface for interacting with running models, making it flexible for local development, testing, or integration into larger systems.

Downloads: 1 This Week

Last Update: 2026-04-17
See Project
20

SageMaker Python SDK

Training and deploying machine learning models on Amazon SageMaker

SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker. With the SDK, you can train and deploy models using popular deep learning frameworks Apache MXNet and TensorFlow. You can also train and deploy models with Amazon algorithms, which are scalable implementations of core machine learning algorithms that are optimized for SageMaker and GPU training. If you have your own algorithms built into SageMaker-compatible Docker containers, you can train and host models using these as well.

Downloads: 1 This Week

Last Update: 2 days ago
See Project
21

TorchAudio

Data manipulation and transformation for audio signal processing

The aim of torchaudio is to apply PyTorch to the audio domain. By supporting PyTorch, torchaudio follows the same philosophy of providing strong GPU acceleration, having a focus on trainable features through the autograd system, and having consistent style (tensor names and dimension names). Therefore, it is primarily a machine learning library and not a general signal processing library. The benefits of PyTorch can be seen in torchaudio through having all the computations be through PyTorch operations which makes it easy to use and feel like a natural extension.

Downloads: 1 This Week

Last Update: 2026-02-17
See Project
22

TorchServe

Serve, optimize and scale PyTorch models in production

TorchServe is a performant, flexible and easy-to-use tool for serving PyTorch eager mode and torschripted models. Multi-model management with the optimized worker to model allocation. REST and gRPC support for batched inference. Export your model for optimized inference. Torchscript out of the box, ORT, IPEX, TensorRT, FasterTransformer. Performance Guide: built-in support to optimize, benchmark and profile PyTorch and TorchServe performance. Expressive handlers: An expressive handler architecture that makes it trivial to support inferencing for your use case with many supported out of the box. Out-of-box support for system-level metrics with Prometheus exports, custom metrics and PyTorch profiler support.

Downloads: 1 This Week

Last Update: 2024-09-30
See Project
23

gemma.cpp

lightweight, standalone C++ inference engine for Google's Gemma models

Gemma.cpp is a C++ implementation for running inference with Gemma models efficiently on CPUs and GPUs. Developed by Google, it allows running large language models (LLMs) like Gemma with minimal hardware, focusing on optimized performance and low latency. Gemma.cpp is intended for developers seeking to deploy LLMs in production environments without needing massive computational resources.

Downloads: 1 This Week

Last Update: 2025-03-25
See Project
24

optillm

Optimizing inference proxy for LLMs

OptiLLM is an optimizing inference proxy for Large Language Models (LLMs) that implements state-of-the-art techniques to enhance performance and efficiency. It serves as an OpenAI API-compatible proxy, allowing for seamless integration into existing workflows while optimizing inference processes. OptiLLM aims to reduce latency and resource consumption during LLM inference.

Downloads: 1 This Week

Last Update: 2026-03-19
See Project
25

ollama_manager_gui

A graphical manager for ollama that can manage your LLMs

This app will help install ollama and LLMs using the gui provided by this app. It checks for ollama when launched and if it doesn't exist it will help by bringing you to the ollama site for download. This app is heavily upgraded and now also works properly on Linux. It now has progress bars and many many many improvements. It can launch the LLM by clicking the link. it can launch multiple LLMs in separate windows. It can also remove an installed LLM. There is a confirmation dialog when removing or hard resetting an LLM. It can also install LLMs. Reset deletes all data for LLM of your choosing and automatically reinstalls that LLM of your choosing by redownloading. You can also install new LLMs by simply clicking the install button. warning... you can remove custom LLMs with this... Remember with great power comes greater responsibility! I hope this is of use to you, but it has no warranty it has a very nice improved dark mode as of version 1.2 aka 1

Downloads: 11 This Week

Last Update: 2025-08-14
See Project