Showing 62 open source projects for "vllm"

View related business solutions
  • Enterprise-grade ITSM, for every business Icon
    Enterprise-grade ITSM, for every business

    Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity.

    Freshservice is an intuitive, AI-powered platform that helps IT, operations, and business teams deliver exceptional service without the usual complexity. Automate repetitive tasks, resolve issues faster, and provide seamless support across the organization. From managing incidents and assets to driving smarter decisions, Freshservice makes it easy to stay efficient and scale with confidence.
    Try it Free
  • Forever Free Full-Stack Observability | Grafana Cloud Icon
    Forever Free Full-Stack Observability | Grafana Cloud

    Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

    Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
    Create free account
  • 1
    vLLM

    vLLM

    A high-throughput and memory-efficient inference and serving engine

    vLLM is a fast and easy-to-use library for LLM inference and serving. High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more.
    Downloads: 43 This Week
    Last Update:
    See Project
  • 2
    Nano-vLLM

    Nano-vLLM

    A lightweight vLLM implementation built from scratch

    ...Its API closely mirrors that of the original vLLM framework, allowing developers familiar with vLLM to adopt the tool with minimal changes.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 3
    vLLM Semantic Router

    vLLM Semantic Router

    System Level Intelligent Router for Mixture-of-Models at Cloud

    Semantic Router is an open-source system designed to intelligently route requests across multiple large language models based on the semantic meaning and complexity of user queries. Instead of sending every prompt to the same model, the system analyzes the intent and reasoning requirements of the request and dynamically selects the most appropriate model to process it. This approach allows developers to combine multiple models with different strengths, such as lightweight models for simple...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    OpenJarvis

    OpenJarvis

    Personal AI, On Personal Devices

    ...The framework provides shared primitives for building local-first agents, along with evaluation tools that measure performance using metrics such as energy consumption, latency, cost, and accuracy. OpenJarvis integrates with local inference engines like Ollama, vLLM, SGLang, and llama.cpp to run language models directly on personal hardware. It also includes a learning loop that allows models to improve over time using locally generated interaction traces. By prioritizing local execution and efficiency, OpenJarvis aims to provide a foundation for privacy-preserving personal AI assistants.
    Downloads: 177 This Week
    Last Update:
    See Project
  • Go From AI Idea to AI App Fast Icon
    Go From AI Idea to AI App Fast

    One platform to build, fine-tune, and deploy ML models. No MLOps team required.

    Access Gemini 3 and 200+ models. Build chatbots, agents, or custom models with built-in monitoring and scaling.
    Try Free
  • 5
    DeepSeek-OCR 2

    DeepSeek-OCR 2

    Visual Causal Flow

    ...The repository provides model code and inference scripts that let researchers and developers run and benchmark the system on both images and PDFs, with support for batch evaluation and optimized pipelines leveraging vLLM and transformers.
    Downloads: 14 This Week
    Last Update:
    See Project
  • 6
    NVIDIA Model Optimizer

    NVIDIA Model Optimizer

    A unified library of SOTA model optimization techniques

    ...It supports a wide range of model types, including large language models, diffusion models, and vision-language models, and integrates with deployment frameworks such as TensorRT and vLLM. By providing standardized workflows and APIs, it enables developers to experiment with different optimization strategies and select the best approach for their use case.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 7
    Kimi K2.5

    Kimi K2.5

    Moonshot's most powerful AI model

    ...With a 256K context length and MoonViT vision encoder, the model excels across reasoning, coding, long-context comprehension, image, and video benchmarks. Kimi K2.5 is available via Moonshot’s API (OpenAI/Anthropic-compatible) and supports deployment through vLLM, SGLang, and KTransformers.
    Downloads: 40 This Week
    Last Update:
    See Project
  • 8
    KubeAI

    KubeAI

    Private Open AI on Kubernetes

    ...KubeAI serves an OpenAI compatible HTTP API. Admins can configure ML models by using the Model Kubernetes Custom Resources. KubeAI can be thought of as a Model Operator (See Operator Pattern) that manages vLLM and Ollama servers.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    NemoClaw

    NemoClaw

    NVIDIA plugin for secure installation of OpenClaw

    ...NemoClaw enables users to launch sandboxed agent environments that control network access, file permissions, and inference requests through policy-based security. The platform integrates with AI models such as NVIDIA Nemotron and supports multiple inference backends including cloud APIs, local NIM deployments, and vLLM. Through its command-line interface, developers can deploy, monitor, and manage AI assistants running inside isolated sandboxes. By combining sandbox orchestration, agent management, and AI model integration, NemoClaw provides a secure foundation for building and operating autonomous AI assistants.
    Downloads: 7 This Week
    Last Update:
    See Project
  • Train ML Models With SQL You Already Know Icon
    Train ML Models With SQL You Already Know

    BigQuery automates data prep, analysis, and predictions with built-in AI assistance.

    Build and deploy ML models using familiar SQL. Automate data prep with built-in Gemini. Query 1 TB and store 10 GB free monthly.
    Try Free
  • 10
    Harbor LLM

    Harbor LLM

    Run a full local LLM stack with one command using Docker

    ...With a single command, users can start preconfigured tools like Ollama and Open WebUI, enabling chat, workflows, and integrations immediately. Harbor supports multiple inference engines, including llama.cpp and vLLM, and connects them seamlessly to user interfaces. It also includes tools for web retrieval, image generation, voice interaction, and workflow automation. Built on Docker, Harbor allows services to run in isolated containers while communicating over a local network. It is intended for local development and experimentation rather than production deployment, giving developers a flexible way to explore AI systems, test configurations, and manage complex LLM stacks without manual wiring or setup overhead.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 11
    GLM-4.5

    GLM-4.5

    GLM-4.5: Open-source LLM for intelligent agents by Z.ai

    ...GLM-4.5 achieves strong performance on 12 industry-standard benchmarks, ranking 3rd overall, while GLM-4.5-Air balances competitive results with greater efficiency. The models support FP8 and BF16 precision, and can handle very large context windows of up to 128K tokens. Flexible inference is supported through frameworks like vLLM and SGLang with tool-call and reasoning parsers included.
    Downloads: 69 This Week
    Last Update:
    See Project
  • 12
    GLM-5

    GLM-5

    From Vibe Coding to Agentic Engineering

    GLM-5 is a next-generation open-source large language model (LLM) developed by the Z .ai team under the zai-org organization that pushes the boundaries of reasoning, coding, and long-horizon agentic intelligence. Building on earlier GLM series models, GLM-5 dramatically scales the parameter count (to roughly 744 billion) and expands pre-training data to significantly improve performance on complex tasks such as multi-step reasoning, software engineering workflows, and agent orchestration...
    Downloads: 207 This Week
    Last Update:
    See Project
  • 13
    Orpheus TTS

    Orpheus TTS

    Towards Human-Sounding Speech

    ...The project ships both pretrained and finetuned English models, as well as a family of multilingual models released as a research preview, and includes data-processing scripts so users can train or finetune their own variants. Inference is provided through a Python package that uses vLLM under the hood for high-throughput, low-latency generation, including streaming examples that show how to generate audio chunks in real time. The maintainers provide Colab notebooks, a standardized prompting format, and one-click deployment via Baseten for production-grade, FP8/FP16 optimized inference with ~200 ms streaming latency.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 14
    Intel LLM Library for PyTorch

    Intel LLM Library for PyTorch

    Accelerate local LLM inference and finetuning

    ...IPEX-LLM supports a wide range of popular models, including architectures such as LLaMA, Mistral, Qwen, and other transformer-based systems. The library can integrate with common AI frameworks and serving tools such as Hugging Face Transformers, LangChain, and vLLM, allowing developers to incorporate optimized inference into existing pipelines.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    OuteTTS

    OuteTTS

    Interface for OuteTTS models

    ...It provides a high-level Interface API that wraps model configuration, speaker handling, and audio generation so you can focus on integrating speech into your application rather than wiring up low-level engines. The project supports multiple backends including llama.cpp (Python bindings and server), Hugging Face Transformers, ExLlamaV2, VLLM and a JavaScript interface via Transformers.js, allowing it to run on CPUs, NVIDIA CUDA GPUs, AMD ROCm, Vulkan-capable GPUs, and Apple Metal. It also includes a notion of speaker profiles: you can create a speaker from a short audio sample, save it as JSON, and reuse it for consistent voice identity across generations and sessions. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    LocalAI

    LocalAI

    The free, Open Source alternative to OpenAI, Claude and others

    LocalAI is an open-source platform that allows users to run large language models and other AI systems locally on their own hardware. It acts as a drop-in replacement for APIs such as OpenAI, enabling developers to build AI-powered applications without relying on external cloud services. The platform supports a wide range of model types, including text generation, image creation, speech processing, and embeddings. LocalAI can run on consumer-grade hardware and does not necessarily require a...
    Downloads: 55 This Week
    Last Update:
    See Project
  • 17
    GLM-4.7

    GLM-4.7

    Advanced language and coding AI model

    GLM-4.7 is an advanced agent-oriented large language model designed as a high-performance coding and reasoning partner. It delivers significant gains over GLM-4.6 in multilingual agentic coding, terminal-based workflows, and real-world developer benchmarks such as SWE-bench and Terminal Bench 2.0. The model introduces stronger “thinking before acting” behavior, improving stability and accuracy in complex agent frameworks like Claude Code, Cline, and Roo Code. GLM-4.7 also advances “vibe...
    Downloads: 82 This Week
    Last Update:
    See Project
  • 18
    Qwen3

    Qwen3

    Qwen3 is the large language model series developed by Qwen team

    Qwen3 is a cutting-edge large language model (LLM) series developed by the Qwen team at Alibaba Cloud. The latest updated version, Qwen3-235B-A22B-Instruct-2507, features significant improvements in instruction-following, reasoning, knowledge coverage, and long-context understanding up to 256K tokens. It delivers higher quality and more helpful text generation across multiple languages and domains, including mathematics, coding, science, and tool usage. Various quantized versions,...
    Downloads: 25 This Week
    Last Update:
    See Project
  • 19
    SWIFT LLM

    SWIFT LLM

    Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs

    ...The platform provides a full machine learning pipeline that supports tasks ranging from model pre-training to reinforcement learning alignment techniques. It integrates with popular inference engines such as vLLM and LMDeploy to accelerate deployment and runtime performance. The framework also includes support for many modern training strategies, including preference learning methods and parameter-efficient fine-tuning techniques. ms-swift is designed to work with hundreds of language and multimodal models, providing a unified environment for experimentation and production deployment.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 20
    GLM-OCR

    GLM-OCR

    Accurate × Fast × Comprehensive

    GLM-OCR is an open-source multimodal optical character recognition (OCR) model built on a GLM-V encoder–decoder foundation that brings robust, accurate document understanding to complex real-world layouts and modalities. Designed to handle text recognition, table parsing, formula extraction, and general information retrieval from documents containing mixed content, GLM-OCR excels across major benchmarks while remaining highly efficient with a relatively compact parameter size (~0.9B),...
    Downloads: 10 This Week
    Last Update:
    See Project
  • 21
    Void Editor

    Void Editor

    Open source AI IDE and Cursor alternative

    Void is an open-source, AI-powered code editor built as a fork of Visual Studio Code. Designed as a fully transparent and privacy-focused alternative to Cursor or GitHub Copilot, it lets you use AI models locally or via APIs (OpenAI, Claude, Gemini, Ollama, etc.)—without routing data through proprietary servers. Developed by YC-backed startup Glass Devtools, it supports traditional coding features inherited from VS Code, enhanced with in-editor LLM capabilities—autocomplete, inline quick...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 22
    MiniCPM4

    MiniCPM4

    Ultra-Efficient LLMs on End Device

    MiniCPM4 is part of the MiniCPM family of ultra-efficient large language models designed specifically for high performance on edge devices and resource-constrained environments. Unlike traditional large-scale models that require extensive computational resources, MiniCPM4 focuses on delivering competitive reasoning and language capabilities while maintaining significantly lower latency and higher efficiency. It achieves this through optimized architectures, scalable training strategies, and...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 23
    LoLLMs Hub Fortress

    LoLLMs Hub Fortress

    A proxy server for multiple ollama instances with Key security

    LoLLMs Hub Fortress is a high-performance AI orchestration platform designed to unify multiple large language model backends into a single, secure, and scalable API layer. It acts as a central gateway that connects different inference engines such as Ollama, llama.cpp, vLLM, and OpenAI-compatible services, allowing them to function as interchangeable compute nodes within one system. The architecture is built around a hierarchical “master and slave” hub model, enabling distributed deployments where multiple machines or clusters can be managed through a single entry point. This design allows organizations to scale horizontally, combining local hardware, cloud resources, and specialized inference servers into a unified infrastructure. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 24
    dots.ocr

    dots.ocr

    Multilingual Document Layout Parsing in a Single Vision-Language Model

    dots.ocr is a cutting-edge multilingual document parsing system built on a unified vision-language model that combines layout detection, text recognition, and structural understanding into a single architecture. Unlike traditional OCR pipelines that rely on multiple specialized components, dots.ocr integrates these processes end-to-end, reducing error propagation and improving consistency across tasks. The model is designed to recognize virtually any human script, making it highly effective...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 25
    Tencent-Hunyuan-Large

    Tencent-Hunyuan-Large

    Open-source large language model family from Tencent Hunyuan

    Tencent-Hunyuan-Large is the flagship open-source large language model family from Tencent Hunyuan, offering both pre-trained and instruct (fine-tuned) variants. It is designed with long-context capabilities, quantization support, and high performance on benchmarks across general reasoning, mathematics, language understanding, and Chinese / multilingual tasks. It aims to provide competitive capability with efficient deployment and inference. FP8 quantization support to reduce memory usage...
    Downloads: 2 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • Next
MongoDB Logo MongoDB