LLM Inference Tools for Linux

View 15 business solutions
  • Try Google Cloud Risk-Free With $300 in Credit Icon
    Try Google Cloud Risk-Free With $300 in Credit

    No hidden charges. No surprise bills. Cancel anytime.

    Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.
    Start Free
  • Full-stack observability with actually useful AI | Grafana Cloud Icon
    Full-stack observability with actually useful AI | Grafana Cloud

    Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

    Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
    Create free account
  • 1
    ChatGLM.cpp

    ChatGLM.cpp

    C++ implementation of ChatGLM-6B & ChatGLM2-6B & ChatGLM3 & GLM4(V)

    ChatGLM.cpp is a C++ implementation of the ChatGLM-6B model, enabling efficient local inference without requiring a Python environment. It is optimized for running on consumer hardware.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 2
    Coqui STT

    Coqui STT

    The deep learning toolkit for speech-to-text

    Coqui STT is a fast, open-source, multi-platform, deep-learning toolkit for training and deploying speech-to-text models. Coqui STT is battle-tested in both production and research. Multiple possible transcripts, each with an associated confidence score. Experience the immediacy of script-to-performance. With Coqui text-to-speech, production times go from months to minutes. With Coqui, the post is a pleasure. Effortlessly clone the voices of your talent and have the clone handle the problems in post. With Coqui, dubbing is a delight. Effortlessly clone the voice of your talent into another language and let the clone do the dub. With text-to-speech, experience the immediacy of script-to-performance. Cast from a wide selection of high-quality, directable, emotive voices or clone a voice to suit your needs. With Coqui text-to-speech, production times go from months to minutes.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 3
    LLM Foundry

    LLM Foundry

    LLM training code for MosaicML foundation models

    Introducing MPT-7B, the first entry in our MosaicML Foundation Series. MPT-7B is a transformer trained from scratch on 1T tokens of text and code. It is open source, available for commercial use, and matches the quality of LLaMA-7B. MPT-7B was trained on the MosaicML platform in 9.5 days with zero human intervention at a cost of ~$200k. Large language models (LLMs) are changing the world, but for those outside well-resourced industry labs, it can be extremely difficult to train and deploy these models. This has led to a flurry of activity centered on open-source LLMs, such as the LLaMA series from Meta, the Pythia series from EleutherAI, the StableLM series from StabilityAI, and the OpenLLaMA model from Berkeley AI Research.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 4
    LLamaSharp

    LLamaSharp

    C#/.NET binding of llama.cpp, including LLaMa/GPT model inference

    The C#/.NET binding of llama.cpp. It provides APIs to infer the LLaMa Models and deploy it on the local environment. It works on both Windows, Linux and MAC without the requirement for compiling llama.cpp yourself. Its performance is close to llama.cpp. Furthermore, it provides integrations with other projects such as BotSharp to provide higher-level applications and UI.
    Downloads: 4 This Week
    Last Update:
    See Project
  • Go From AI Idea to AI App Fast Icon
    Go From AI Idea to AI App Fast

    One platform to build, fine-tune, and deploy ML models. No MLOps team required.

    Access Gemini 3 and 200+ models. Build chatbots, agents, or custom models with built-in monitoring and scaling.
    Try Free
  • 5
    PEFT

    PEFT

    State-of-the-art Parameter-Efficient Fine-Tuning

    Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. Fine-tuning large-scale PLMs is often prohibitively costly. In this regard, PEFT methods only fine-tune a small number of (extra) model parameters, thereby greatly decreasing the computational and storage costs. Recent State-of-the-Art PEFT techniques achieve performance comparable to that of full fine-tuning.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 6
    Pandas Profiling

    Pandas Profiling

    Create HTML profiling reports from pandas DataFrame objects

    pandas-profiling generates profile reports from a pandas DataFrame. The pandas df.describe() function is handy yet a little basic for exploratory data analysis. pandas-profiling extends pandas DataFrame with df.profile_report(), which automatically generates a standardized univariate and multivariate report for data understanding. High correlation warnings, based on different correlation metrics (Spearman, Pearson, Kendall, Cramér’s V, Phik). Most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic). File sizes, creation dates, dimensions, indication of truncated images and existance of EXIF metadata. Mostly global details about the dataset (number of records, number of variables, overall missigness and duplicates, memory footprint). Comprehensive and automatic list of potential data quality issues (high correlation, skewness, uniformity, zeros, missing values, constant values, between others).
    Downloads: 4 This Week
    Last Update:
    See Project
  • 7
    Scanpy

    Scanpy

    Single-cell analysis in Python

    Scanpy is a scalable toolkit for analyzing single-cell gene expression data built jointly with anndata. It includes preprocessing, visualization, clustering, trajectory inference and differential expression testing. The Python-based implementation efficiently deals with datasets of more than one million cells.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 8
    Xorbits Inference

    Xorbits Inference

    Replace OpenAI GPT with another LLM in your app

    Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop. Xorbits Inference(Xinference) is a powerful and versatile library designed to serve language, speech recognition, and multimodal models. With Xorbits Inference, you can effortlessly deploy and serve your or state-of-the-art built-in models using just a single command. Whether you are a researcher, developer, or data scientist, Xorbits Inference empowers you to unleash the full potential of cutting-edge AI models.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 9
    Arize Phoenix

    Arize Phoenix

    Uncover insights, surface problems, monitor, and fine tune your LLM

    Phoenix provides ML insights at lightning speed with zero-config observability for model drift, performance, and data quality. Phoenix is an Open Source ML Observability library designed for the Notebook. The toolset is designed to ingest model inference data for LLMs, CV, NLP and tabular datasets. It allows Data Scientists to quickly visualize their model data, monitor performance, track down issues & insights, and easily export to improve. Deep Learning Models (CV, LLM, and Generative) are an amazing technology that will power many of future ML use cases. A large set of these technologies are being deployed into businesses (the real world) in what we consider a production setting.
    Downloads: 3 This Week
    Last Update:
    See Project
  • AI-generated apps that pass security review Icon
    AI-generated apps that pass security review

    Stop waiting on engineering. Build production-ready internal tools with AI—on your company data, in your cloud.

    Retool lets you generate dashboards, admin panels, and workflows directly on your data. Type something like “Build me a revenue dashboard on my Stripe data” and get a working app with security, permissions, and compliance built in from day one. Whether on our cloud or self-hosted, create the internal software your team needs without compromising enterprise standards or control.
    Try Retool free
  • 10
    AutoGen

    AutoGen

    An Open-Source Programming Framework for Agentic AI

    AutoGen is an open-source programming framework for building AI agents and facilitating cooperation among multiple agents to solve tasks. AutoGen aims to provide an easy-to-use and flexible framework for accelerating development and research on agentic AI, like PyTorch for Deep Learning. It offers features such as agents that can converse with other agents, LLM and tool use support, autonomous and human-in-the-loop workflows, and multi-agent conversation patterns. AutoGen provides multi-agent conversation framework as a high-level abstraction. With this framework, one can conveniently build LLM workflows. AutoGen offers a collection of working systems spanning a wide range of applications from various domains and complexities. AutoGen supports enhanced LLM inference APIs, which can be used to improve inference performance and reduce cost.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 11
    CTranslate2

    CTranslate2

    Fast inference engine for Transformer models

    CTranslate2 is a C++ and Python library for efficient inference with Transformer models. The project implements a custom runtime that applies many performance optimization techniques such as weights quantization, layers fusion, batch reordering, etc., to accelerate and reduce the memory usage of Transformer models on CPU and GPU. The execution is significantly faster and requires less resources than general-purpose deep learning frameworks on supported models and tasks thanks to many advanced optimizations: layer fusion, padding removal, batch reordering, in-place operations, caching mechanism, etc. The model serialization and computation support weights with reduced precision: 16-bit floating points (FP16), 16-bit integers (INT16), and 8-bit integers (INT8). The project supports x86-64 and AArch64/ARM64 processors and integrates multiple backends that are optimized for these platforms: Intel MKL, oneDNN, OpenBLAS, Ruy, and Apple Accelerate.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 12
    Eko

    Eko

    Build Production-ready Agentic Workflow with Natural Language

    Eko (Eko Keeps Operating) is a JavaScript framework designed for building production-ready agent-based workflows using natural language commands. It allows developers to create automated agents that can handle complex workflows in both computer and browser environments. With a focus on high development efficiency, Eko simplifies the creation of multi-step workflows, enabling users to integrate and automate tasks across platforms. It provides a unified interface for managing agents, offering features such as web resource access and high task complexity handling. Eko is open-source and can be used to execute tasks like browser automation, system operations, and software testing.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 13
    Lepton AI

    Lepton AI

    A Pythonic framework to simplify AI service building

    A Pythonic framework to simplify AI service building. Cutting-edge AI inference and training, unmatched cloud-native experience, and top-tier GPU infrastructure. Ensure 99.9% uptime with comprehensive health checks and automatic repairs.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 14
    Prem AI

    Prem AI

    Prem provides a unified environment to develop AI applications

    An intuitive desktop application designed to effortlessly deploy and self-host Open-Source AI models without exposing sensitive data to third-party. Prem provides a unified environment to develop AI applications and deploy AI models on your infrastructure. Abstracting away all technical complexities for AI deployment and ushering in a new era of privacy-centric AI applications - users can finally retain control and ownership of their models. The AI services expose an HTTP API interface, standardized for their interface type. For example, all models of type Chat expose the OpenAI API for easy of integration of existing tools and AI app ecosystem. Each service we support it's published on the Prem Registry.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 15
    RWKV Runner

    RWKV Runner

    A RWKV management and startup tool, full automation, only 8MB

    RWKV (pronounced as RwaKuv) is an RNN with GPT-level LLM performance, which can also be directly trained like a GPT transformer (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, fast training, saves VRAM, "infinite" ctxlen, and free text embedding. Moreover it's 100% attention-free. Default configs has enabled custom CUDA kernel acceleration, which is much faster and consumes much less VRAM. If you encounter possible compatibility issues, go to the Configs page and turn off Use Custom CUDA kernel to Accelerate.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 16
    Transformer Engine

    Transformer Engine

    A library for accelerating Transformer models on NVIDIA GPUs

    Transformer Engine (TE) is a library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper GPUs, to provide better performance with lower memory utilization in both training and inference. TE provides a collection of highly optimized building blocks for popular Transformer architectures and an automatic mixed precision-like API that can be used seamlessly with your framework-specific code. TE also includes a framework-agnostic C++ API that can be integrated with other deep-learning libraries to enable FP8 support for Transformers. As the number of parameters in Transformer models continues to grow, training and inference for architectures such as BERT, GPT, and T5 become very memory and compute-intensive. Most deep learning frameworks train with FP32 by default. This is not essential, however, to achieve full accuracy for many deep learning models.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 17
    Diffusers

    Diffusers

    State-of-the-art diffusion models for image and audio generation

    Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Whether you're looking for a simple inference solution or training your own diffusion models, Diffusers is a modular toolbox that supports both. Our library is designed with a focus on usability over performance, simple over easy, and customizability over abstractions. State-of-the-art diffusion pipelines that can be run in inference with just a few lines of code. Interchangeable noise schedulers for different diffusion speeds and output quality. Pretrained models that can be used as building blocks, and combined with schedulers, for creating your own end-to-end diffusion systems. We recommend installing Diffusers in a virtual environment from PyPi or Conda. For more details about installing PyTorch and Flax, please refer to their official documentation.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 18
    GPT-NeoX

    GPT-NeoX

    Implementation of model parallel autoregressive transformers on GPUs

    This repository records EleutherAI's library for training large-scale language models on GPUs. Our current framework is based on NVIDIA's Megatron Language Model and has been augmented with techniques from DeepSpeed as well as some novel optimizations. We aim to make this repo a centralized and accessible place to gather techniques for training large-scale autoregressive language models, and accelerate research into large-scale training. For those looking for a TPU-centric codebase, we recommend Mesh Transformer JAX. If you are not looking to train models with billions of parameters from scratch, this is likely the wrong library to use. For generic inference needs, we recommend you use the Hugging Face transformers library instead which supports GPT-NeoX models.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 19
    Hugging Face Transformer

    Hugging Face Transformer

    CPU/GPU inference server for Hugging Face transformer models

    Optimize and deploy in production Hugging Face Transformer models in a single command line. At Lefebvre Dalloz we run in-production semantic search engines in the legal domain, in the non-marketing language it's a re-ranker, and we based ours on Transformer. In that setup, latency is key to providing a good user experience, and relevancy inference is done online for hundreds of snippets per user query. Most tutorials on Transformer deployment in production are built over Pytorch and FastAPI. Both are great tools but not very performant in inference. Then, if you spend some time, you can build something over ONNX Runtime and Triton inference server. You will usually get from 2X to 4X faster inference compared to vanilla Pytorch. It's cool! However, if you want the best in class performances on GPU, there is only a single possible combination: Nvidia TensorRT and Triton. You will usually get 5X faster inference compared to vanilla Pytorch.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 20
    LLaVA

    LLaVA

    Visual Instruction Tuning: Large Language-and-Vision Assistant

    Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 21
    LitGPT

    LitGPT

    20+ high-performance LLMs with recipes to pretrain, finetune at scale

    LitGPT is a collection of over 20 high-performance large language models (LLMs) accompanied by recipes to pretrain, finetune, and deploy them at scale. It provides implementations without abstractions, making it beginner-friendly while offering advanced features like flash attention and support for various precision levels. LitGPT is designed to run efficiently across multiple GPUs or TPUs, catering to both small-scale and large-scale deployments.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 22
    OpenFold

    OpenFold

    Trainable, memory-efficient, and GPU-friendly PyTorch reproduction

    OpenFold carefully reproduces (almost) all of the features of the original open source inference code (v2.0.1). The sole exception is model ensembling, which fared poorly in DeepMind's own ablation testing and is being phased out in future DeepMind experiments. It is omitted here for the sake of reducing clutter. In cases where the Nature paper differs from the source, we always defer to the latter. OpenFold is trainable in full precision, half precision, or bfloat16 with or without DeepSpeed, and we've trained it from scratch, matching the performance of the original. We've publicly released model weights and our training data — some 400,000 MSAs and PDB70 template hit files — under a permissive license. Model weights are available via scripts in this repository while the MSAs are hosted by the Registry of Open Data on AWS (RODA).
    Downloads: 2 This Week
    Last Update:
    See Project
  • 23
    OpenLLM

    OpenLLM

    Operating LLMs in production

    An open platform for operating large language models (LLMs) in production. Fine-tune, serve, deploy, and monitor any LLMs with ease. With OpenLLM, you can run inference with any open-source large-language models, deploy to the cloud or on-premises, and build powerful AI apps. Built-in supports a wide range of open-source LLMs and model runtime, including Llama 2, StableLM, Falcon, Dolly, Flan-T5, ChatGLM, StarCoder, and more. Serve LLMs over RESTful API or gRPC with one command, query via WebUI, CLI, our Python/Javascript client, or any HTTP client.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 24
    OpenVINO Model Server

    OpenVINO Model Server

    A scalable inference server for models optimized with OpenVINO

    OpenVINO™ Model Server is a high-performance inference serving system designed to host and serve machine learning models that have been optimized with the OpenVINO toolkit. It’s implemented in C++ for scalability and efficiency, making it suitable for both edge and cloud deployments where inference workloads must be reliable and high throughput. The server exposes model inference via standard network protocols like REST and gRPC, allowing any client that speaks those protocols to request predictions remotely, abstracting away the complexity of where and how the model runs. It supports model deployment in diverse environments including Docker, bare-metal machines, and Kubernetes clusters, and is especially useful in microservices architectures where AI services need to scale independently. The system supports a wide range of model sources, letting you host models from local storage, remote object storage, or even pull from model hubs.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 25
    Superduper

    Superduper

    Superduper: Integrate AI models and machine learning workflows

    Superduper is a Python-based framework for building end-2-end AI-data workflows and applications on your own data, integrating with major databases. It supports the latest technologies and techniques, including LLMs, vector-search, RAG, and multimodality as well as classical AI and ML paradigms. Developers may leverage Superduper by building compositional and declarative objects that out-source the details of deployment, orchestration versioning, and more to the Superduper engine. This allows developers to completely avoid implementing MLOps, ETL pipelines, model deployment, data migration, and synchronization. Using Superduper is simply "CAPE": Connect to your data, apply arbitrary AI to that data, package and reuse the application on arbitrary data, and execute AI-database queries and predictions on the resulting AI outputs and data.
    Downloads: 2 This Week
    Last Update:
    See Project
MongoDB Logo MongoDB