Showing 70 open source projects for "inference engine"

View related business solutions
  • Secure File Transfer for Windows with Cerberus by Redwood Icon
    Secure File Transfer for Windows with Cerberus by Redwood

    Protect and share files over FTP/S, SFTP, HTTPS and SCP with the #1 rated Windows file transfer server.

    Cerberus supports unlimited users and connections on a single IP, with built-in encryption, 2FA, and a browser-based web client — all deployable in under 15 minutes with a 25-day free trial.
    Try for Free
  • Enterprise-grade ITSM, for every business Icon
    Enterprise-grade ITSM, for every business

    Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity.

    Freshservice is an intuitive, AI-powered platform that helps IT, operations, and business teams deliver exceptional service without the usual complexity. Automate repetitive tasks, resolve issues faster, and provide seamless support across the organization. From managing incidents and assets to driving smarter decisions, Freshservice makes it easy to stay efficient and scale with confidence.
    Try it Free
  • 1
    Transformer Engine

    Transformer Engine

    A library for accelerating Transformer models on NVIDIA GPUs

    Transformer Engine (TE) is a library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper GPUs, to provide better performance with lower memory utilization in both training and inference. TE provides a collection of highly optimized building blocks for popular Transformer architectures and an automatic mixed precision-like API that can be used seamlessly with your framework-specific code.
    Downloads: 14 This Week
    Last Update:
    See Project
  • 2
    MLX Engine

    MLX Engine

    LM Studio Apple MLX engine

    MLX Engine is the Apple MLX-based inference backend used by LM Studio to run large language models efficiently on Apple Silicon hardware. Built on top of the mlx-lm and mlx-vlm ecosystems, the engine provides a unified architecture capable of supporting both text-only and multimodal models. Its design focuses on high-performance on-device inference, leveraging Apple’s MLX stack to accelerate computation on M-series chips.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 3
    Temporal Inference Engine

    Temporal Inference Engine

    A real time inference engine for temporal logical specifications

    A real time inference engine for temporal logical specifications, which is able to acquire, process and generate any binary or real signal through POSIX IPC, files or UNIX sockets. Specifications of signals and dynamic systems are represented as special graphs and executed in real time, with a predictable sampling time of few milliseconds. Real time signal processing, dynamic system control, state machine modeling and logical property verification are some fields of application of this software. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    ds4.c

    ds4.c

    DeepSeek 4 Flash local inference engine for Metal

    ds4.c is a specialized local inference engine created by antirez for running DeepSeek V4 Flash models directly on Apple Silicon hardware using Metal acceleration. Unlike general-purpose inference runtimes, the project is intentionally optimized for a specific model family, enabling highly efficient execution and simplified architecture. The engine includes DS4-specific model loading, KV cache management, prompt rendering, and OpenAI-compatible server APIs for local deployment workflows. ...
    Downloads: 4 This Week
    Last Update:
    See Project
  • Atera - an All-in-one platform for IT management Icon
    Atera - an All-in-one platform for IT management

    Ideal for IT departments and MSPs (managed service providers)

    Your IT essentials, integrated & elevated. Take your IT management from automated to autonomous, download Atera's agent to start your free trial!
    Try Atera now
  • 5
    vLLM

    vLLM

    A high-throughput and memory-efficient inference and serving engine

    vLLM is a fast and easy-to-use library for LLM inference and serving. High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more.
    Downloads: 14 This Week
    Last Update:
    See Project
  • 6
    SimpleLLM

    SimpleLLM

    950 line, minimal, extensible LLM inference engine built from scratch

    SimpleLLM is a minimal, extensible large language model inference engine implemented in roughly 950 lines of code, built from scratch to serve both as a learning tool and a research platform for novel inference techniques. It provides the core components of an LLM runtime—such as tokenization, batching, and asynchronous execution—without the abstraction overhead of more complex engines, making it easier for developers and researchers to understand and modify. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    Jlama

    Jlama

    Jlama is a modern LLM inference engine for Java

    Jlama is a modern inference engine written entirely in Java that enables developers to run large language models locally within Java applications. Unlike frameworks that require external APIs or remote services, Jlama performs inference directly on a machine using pre-trained models. This allows organizations to integrate generative AI features into their systems while maintaining full control over data privacy and infrastructure.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 8
    RTP-LLM

    RTP-LLM

    Alibaba's high-performance LLM inference engine for diverse apps

    RTP-LLM is an open-source large language model inference acceleration engine developed by Alibaba to provide high-performance serving infrastructure for modern LLM deployments. The system focuses on improving throughput, latency, and resource utilization when running large models in production environments. It achieves this by implementing optimized GPU kernels, batching strategies, and memory management techniques tailored for transformer inference workloads. ...
    Downloads: 7 This Week
    Last Update:
    See Project
  • 9
    uzu

    uzu

    A high-performance inference engine for AI models

    uzu is a high-performance inference engine designed to run artificial intelligence models efficiently on Apple Silicon hardware. Written primarily in Rust and leveraging Apple’s Metal framework, the project focuses on maximizing performance when executing large language models and other AI workloads on devices such as Mac computers with M-series chips. The engine implements a hybrid architecture in which model layers can be executed either as custom GPU kernels or through Apple’s MPSGraph API, allowing it to balance performance and compatibility depending on the workload. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • 10
    Chitu

    Chitu

    High-performance inference framework for large language models

    Chitu is a high-performance inference engine designed to deploy and run large language models efficiently in production environments. The framework focuses on improving efficiency, flexibility, and scalability for organizations that need to run LLM inference workloads across different hardware platforms. It supports heterogeneous computing environments, including CPUs, GPUs, and various specialized AI accelerators, allowing models to run across a wide range of infrastructure configurations. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 11
    Nano-vLLM

    Nano-vLLM

    A lightweight vLLM implementation built from scratch

    Nano-vLLM is a lightweight implementation of the vLLM inference engine designed to run large language models efficiently while maintaining a minimal and readable codebase. The project recreates the core functionality of vLLM in a simplified architecture written in approximately a thousand lines of Python, making it easier for developers and researchers to understand how modern LLM inference systems work.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    TokenSpeed

    TokenSpeed

    TokenSpeed is a speed-of-light LLM inference engine

    TokenSpeed is an LLM inference engine designed for high-performance production agent workloads. It aims to combine TensorRT-LLM-level speed with vLLM-level usability, making it relevant for teams that need fast generation without sacrificing developer ergonomics. The project is focused on the specific needs of agentic systems, where latency, throughput, and efficient scheduling matter across many short or tool-heavy requests.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    gemma.cpp

    gemma.cpp

    lightweight, standalone C++ inference engine for Google's Gemma models

    Gemma.cpp is a C++ implementation for running inference with Gemma models efficiently on CPUs and GPUs. Developed by Google, it allows running large language models (LLMs) like Gemma with minimal hardware, focusing on optimized performance and low latency. Gemma.cpp is intended for developers seeking to deploy LLMs in production environments without needing massive computational resources.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 14
    Mooncake

    Mooncake

    Mooncake is the serving platform for Kimi

    ...The platform was originally developed as part of the serving infrastructure for the Kimi large language model system. Its architecture centers on a high-performance transfer engine that provides unified data transfer across different storage and networking technologies. This engine enables efficient movement of tensors and model data across heterogeneous environments such as GPU memory, system memory, and distributed storage systems. Mooncake also introduces distributed key-value cache storage that allows inference systems to reuse previously computed attention states, significantly improving throughput in large-scale deployments. ...
    Downloads: 17 This Week
    Last Update:
    See Project
  • 15
    SAM 3

    SAM 3

    Code for running inference and finetuning with SAM 3 model

    SAM 3 (Segment Anything Model 3) is a unified foundation model for promptable segmentation in both images and videos, capable of detecting, segmenting, and tracking objects. It accepts both text prompts (open-vocabulary concepts like “red car” or “goalkeeper in white”) and visual prompts (points, boxes, masks) and returns high-quality masks, boxes, and scores for the requested concepts. Compared with SAM 2, SAM 3 introduces the ability to exhaustively segment all instances of an...
    Downloads: 26 This Week
    Last Update:
    See Project
  • 16
    Pruna AI

    Pruna AI

    Pruna is a model optimization framework built for developers

    Pruna is an open-source, self-hostable AI inference engine designed to help teams deploy and manage large language models (LLMs) efficiently across private or hybrid infrastructures. Built with performance and developer ergonomics in mind, Pruna simplifies inference workflows by enabling multi-model orchestration, autoscaling, GPU resource allocation, and compatibility with popular open-source models.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 17
    AI Runner

    AI Runner

    Offline inference engine for art, real-time voice conversations

    AI Runner is an offline inference engine designed to run a collection of AI workloads on your own machine, including image generation for art, real-time voice conversations, LLM-powered chatbots and automated workflows. It is implemented as a desktop-oriented Python application and emphasizes privacy and self-hosting, allowing users to work with text-to-speech, speech-to-text, text-to-image and multimodal models without sending data to external services.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 18
    mistral.rs

    mistral.rs

    Fast, flexible LLM inference

    mistral.rs is a fast and flexible LLM inference engine implemented in Rust, designed to run and serve modern language models with an emphasis on performance and practical deployment. It provides multiple entry points for developers, including a CLI for running models locally and an HTTP server that exposes an OpenAI-compatible API surface for easy integration with existing clients.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 19
    CTranslate2

    CTranslate2

    Fast inference engine for Transformer models

    CTranslate2 is a C++ and Python library for efficient inference with Transformer models. The project implements a custom runtime that applies many performance optimization techniques such as weights quantization, layers fusion, batch reordering, etc., to accelerate and reduce the memory usage of Transformer models on CPU and GPU. The execution is significantly faster and requires less resources than general-purpose deep learning frameworks on supported models and tasks thanks to many...
    Downloads: 8 This Week
    Last Update:
    See Project
  • 20
    mllm

    mllm

    Fast Multimodal LLM on Mobile Devices

    mllm is an open-source inference engine designed to run multimodal large language models efficiently on mobile devices and edge computing environments. The framework focuses on delivering high-performance AI inference in resource-constrained systems such as smartphones, embedded hardware, and lightweight computing platforms. Implemented primarily in C and C++, it is designed to operate with minimal external dependencies while taking advantage of hardware-specific acceleration technologies such as ARM NEON and x86 AVX2 instructions. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    Parallax

    Parallax

    Parallax is a distributed model serving framework

    Parallax is a decentralized inference framework designed to run large language models across distributed computing resources. Instead of relying on centralized GPU clusters in data centers, the system allows multiple heterogeneous machines to collaborate in serving AI inference workloads. Parallax divides model layers across different nodes and dynamically coordinates them to form a complete inference pipeline. A two-stage scheduling architecture determines how model layers are allocated to...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 22
    LightLLM

    LightLLM

    LightLLM is a Python-based LLM (Large Language Model) inference

    LightLLM is a high-performance inference and serving framework designed specifically for large language models, focusing on lightweight architecture, scalability, and efficient deployment. The framework enables developers to run and serve modern language models with significantly improved speed and resource efficiency compared to many traditional inference systems. Built primarily in Python, the project integrates optimization techniques and ideas from several leading open-source...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 23
    HunyuanWorld-Voyager

    HunyuanWorld-Voyager

    RGBD video generation model conditioned on camera input

    ...The system jointly produces aligned RGB and depth video sequences, making it directly applicable to 3D reconstruction tasks. At its core, Voyager integrates a world-consistent video diffusion model with an efficient long-range world exploration engine powered by auto-regressive inference. To support training, the team built a scalable data engine that automatically curates large video datasets with camera pose estimation and metric depth prediction. As a result, Voyager delivers state-of-the-art performance on world exploration benchmarks while maintaining photometric, style, and 3D consistency.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 24
    wllama

    wllama

    WebAssembly binding for llama.cpp - Enabling on-browser LLM inference

    wllama is a WebAssembly-based library that enables large language model inference directly inside a web browser. Built as a binding for the llama.cpp inference engine, the project allows developers to run LLM models locally without requiring a server backend or dedicated GPU hardware. The library leverages WebAssembly SIMD capabilities to achieve efficient execution within modern browsers while maintaining compatibility across platforms.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 25
    Open WebUI

    Open WebUI

    User-friendly AI Interface

    Open WebUI is an extensible, feature-rich, and user-friendly self-hosted AI platform designed to operate entirely offline. It supports various LLM runners like Ollama and OpenAI-compatible APIs, with a built-in inference engine for Retrieval Augmented Generation (RAG), making it a powerful AI deployment solution. Key features include effortless setup via Docker or Kubernetes, seamless integration with OpenAI-compatible APIs, granular permissions and user groups for enhanced security, responsive design across devices, and full Markdown and LaTeX support for enriched interactions. ...
    Downloads: 184 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • Next
Auth0 Logo