vllm free download - SourceForge

43 projects for "vllm" with 1 filter applied:

BSD Clear Filters & Widen Search

Go From AI Idea to AI App Fast
One platform to build, fine-tune, and deploy ML models. No MLOps team required.

Access Gemini 3 and 200+ models. Build chatbots, agents, or custom models with built-in monitoring and scaling.

Try Free
AI-powered service management for IT and enterprise teams
Enterprise-grade ITSM, for every business

Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity. Maximize operational efficiency with refreshingly simple, AI-powered Freshservice.

Try it Free
1

Nano-vLLM

A lightweight vLLM implementation built from scratch

...Its API closely mirrors that of the original vLLM framework, allowing developers familiar with vLLM to adopt the tool with minimal changes.

Downloads: 0 This Week

Last Update: 2026-04-26
See Project
2

vLLM Semantic Router

System Level Intelligent Router for Mixture-of-Models at Cloud

Semantic Router is an open-source system designed to intelligently route requests across multiple large language models based on the semantic meaning and complexity of user queries. Instead of sending every prompt to the same model, the system analyzes the intent and reasoning requirements of the request and dynamically selects the most appropriate model to process it. This approach allows developers to combine multiple models with different strengths, such as lightweight models for simple...

Downloads: 0 This Week

Last Update: 2026-03-10
See Project
3

DeepSeek-OCR 2

Visual Causal Flow

...The repository provides model code and inference scripts that let researchers and developers run and benchmark the system on both images and PDFs, with support for batch evaluation and optimized pipelines leveraging vLLM and transformers.

Downloads: 8 This Week

Last Update: 2026-02-03
See Project
4

Orpheus TTS

Towards Human-Sounding Speech

...The project ships both pretrained and finetuned English models, as well as a family of multilingual models released as a research preview, and includes data-processing scripts so users can train or finetune their own variants. Inference is provided through a Python package that uses vLLM under the hood for high-throughput, low-latency generation, including streaming examples that show how to generate audio chunks in real time. The maintainers provide Colab notebooks, a standardized prompting format, and one-click deployment via Baseten for production-grade, FP8/FP16 optimized inference with ~200 ms streaming latency.

Downloads: 7 This Week

Last Update: 2025-12-05
See Project
Build Securely on Azure with Proven Frameworks
Lay a foundation for success with Tested Reference Architectures developed by Fortinet’s experts. Learn more in this white paper.

Moving to the cloud brings new challenges. How can you manage a larger attack surface while ensuring great network performance? Turn to Fortinet’s Tested Reference Architectures, blueprints for designing and securing cloud environments built by cybersecurity experts. Learn more and explore use cases in this white paper.

Download Now
5

Harbor LLM

Run a full local LLM stack with one command using Docker

...With a single command, users can start preconfigured tools like Ollama and Open WebUI, enabling chat, workflows, and integrations immediately. Harbor supports multiple inference engines, including llama.cpp and vLLM, and connects them seamlessly to user interfaces. It also includes tools for web retrieval, image generation, voice interaction, and workflow automation. Built on Docker, Harbor allows services to run in isolated containers while communicating over a local network. It is intended for local development and experimentation rather than production deployment, giving developers a flexible way to explore AI systems, test configurations, and manage complex LLM stacks without manual wiring or setup overhead.

Downloads: 2 This Week

Last Update: 2 days ago
See Project
6

GLM-4.5

GLM-4.5: Open-source LLM for intelligent agents by Z.ai

...GLM-4.5 achieves strong performance on 12 industry-standard benchmarks, ranking 3rd overall, while GLM-4.5-Air balances competitive results with greater efficiency. The models support FP8 and BF16 precision, and can handle very large context windows of up to 128K tokens. Flexible inference is supported through frameworks like vLLM and SGLang with tool-call and reasoning parsers included.

1 Review

Downloads: 45 This Week

Last Update: 2026-02-01
See Project
7

GLM-5

From Vibe Coding to Agentic Engineering

GLM-5 is a next-generation open-source large language model (LLM) developed by the Z .ai team under the zai-org organization that pushes the boundaries of reasoning, coding, and long-horizon agentic intelligence. Building on earlier GLM series models, GLM-5 dramatically scales the parameter count (to roughly 744 billion) and expands pre-training data to significantly improve performance on complex tasks such as multi-step reasoning, software engineering workflows, and agent orchestration...

Downloads: 91 This Week

Last Update: 2026-04-17
See Project
8

OuteTTS

Interface for OuteTTS models

...It provides a high-level Interface API that wraps model configuration, speaker handling, and audio generation so you can focus on integrating speech into your application rather than wiring up low-level engines. The project supports multiple backends including llama.cpp (Python bindings and server), Hugging Face Transformers, ExLlamaV2, VLLM and a JavaScript interface via Transformers.js, allowing it to run on CPUs, NVIDIA CUDA GPUs, AMD ROCm, Vulkan-capable GPUs, and Apple Metal. It also includes a notion of speaker profiles: you can create a speaker from a short audio sample, save it as JSON, and reuse it for consistent voice identity across generations and sessions. ...

Downloads: 0 This Week

Last Update: 2025-11-28
See Project
9

GLM-4.7

Advanced language and coding AI model

GLM-4.7 is an advanced agent-oriented large language model designed as a high-performance coding and reasoning partner. It delivers significant gains over GLM-4.6 in multilingual agentic coding, terminal-based workflows, and real-world developer benchmarks such as SWE-bench and Terminal Bench 2.0. The model introduces stronger “thinking before acting” behavior, improving stability and accuracy in complex agent frameworks like Claude Code, Cline, and Roo Code. GLM-4.7 also advances “vibe...

Downloads: 57 This Week

Last Update: 2 days ago
See Project
$300 in Free Credit Towards Top Cloud Services
Build VMs, containers, AI, databases, storage—all in one place.

Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.

Get Started
10

GLM-OCR

Accurate × Fast × Comprehensive

GLM-OCR is an open-source multimodal optical character recognition (OCR) model built on a GLM-V encoder–decoder foundation that brings robust, accurate document understanding to complex real-world layouts and modalities. Designed to handle text recognition, table parsing, formula extraction, and general information retrieval from documents containing mixed content, GLM-OCR excels across major benchmarks while remaining highly efficient with a relatively compact parameter size (~0.9B),...

Downloads: 5 This Week

Last Update: 2026-04-08
See Project
11

SWIFT LLM

Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs

...The platform provides a full machine learning pipeline that supports tasks ranging from model pre-training to reinforcement learning alignment techniques. It integrates with popular inference engines such as vLLM and LMDeploy to accelerate deployment and runtime performance. The framework also includes support for many modern training strategies, including preference learning methods and parameter-efficient fine-tuning techniques. ms-swift is designed to work with hundreds of language and multimodal models, providing a unified environment for experimentation and production deployment.

Downloads: 0 This Week

Last Update: 2026-04-25
See Project
12

OpenAI Harmony

Renderer for the harmony response format to be used with gpt-oss

...The format is essential for ensuring gpt-oss models operate correctly, as they are trained to rely on this structure for generating and organizing their responses. For users accessing gpt-oss through third-party providers like HuggingFace, Ollama, or vLLM, Harmony formatting is handled automatically, but developers building custom inference setups must implement it directly. With its flexible design, Harmony serves as the foundation for creating more interpretable, controlled, and extensible interactions with open-weight language models.

Downloads: 3 This Week

Last Update: 14 hours ago
See Project
13

Tencent-Hunyuan-Large

Open-source large language model family from Tencent Hunyuan

Tencent-Hunyuan-Large is the flagship open-source large language model family from Tencent Hunyuan, offering both pre-trained and instruct (fine-tuned) variants. It is designed with long-context capabilities, quantization support, and high performance on benchmarks across general reasoning, mathematics, language understanding, and Chinese / multilingual tasks. It aims to provide competitive capability with efficient deployment and inference. FP8 quantization support to reduce memory usage...

Downloads: 1 This Week

Last Update: 2025-09-24
See Project
14

dots.ocr

Multilingual Document Layout Parsing in a Single Vision-Language Model

dots.ocr is a cutting-edge multilingual document parsing system built on a unified vision-language model that combines layout detection, text recognition, and structural understanding into a single architecture. Unlike traditional OCR pipelines that rely on multiple specialized components, dots.ocr integrates these processes end-to-end, reducing error propagation and improving consistency across tasks. The model is designed to recognize virtually any human script, making it highly effective...

Downloads: 0 This Week

Last Update: 2026-03-24
See Project
15

MiniMax-M2.1

MiniMax M2.1, a SOTA model for real-world dev & agents.

MiniMax-M2.1 is an open-source, state-of-the-art agentic language model released to democratize high-performance AI capabilities. It goes beyond a simple parameter upgrade, delivering major gains in coding, tool use, instruction following, and long-horizon planning. The model is designed to be transparent, controllable, and accessible, enabling developers to build autonomous systems without relying on closed platforms. MiniMax-M2.1 excels in real-world software engineering tasks, including...

Downloads: 3 This Week

Last Update: 2026-01-28
See Project
16

MiniCPM4

Ultra-Efficient LLMs on End Device

MiniCPM4 is part of the MiniCPM family of ultra-efficient large language models designed specifically for high performance on edge devices and resource-constrained environments. Unlike traditional large-scale models that require extensive computational resources, MiniCPM4 focuses on delivering competitive reasoning and language capabilities while maintaining significantly lower latency and higher efficiency. It achieves this through optimized architectures, scalable training strategies, and...

Downloads: 0 This Week

Last Update: 2026-04-13
See Project
17

FastDeploy

High-performance Inference and Deployment Toolkit for LLMs and VLMs

...FastDeploy includes advanced acceleration technologies such as speculative decoding, multi-token prediction, and efficient KV cache management to improve throughput and latency during inference. It also offers compatibility with OpenAI-style APIs and vLLM-like interfaces, allowing developers to integrate deployed models easily into existing applications and services.

Downloads: 0 This Week

Last Update: 2026-04-08
See Project
18

LightLLM

LightLLM is a Python-based LLM (Large Language Model) inference

...The framework enables developers to run and serve modern language models with significantly improved speed and resource efficiency compared to many traditional inference systems. Built primarily in Python, the project integrates optimization techniques and ideas from several leading open-source implementations, including FasterTransformer, vLLM, and FlashAttention, to accelerate token generation and reduce latency. LightLLM is designed to handle large-scale model workloads in production environments, supporting efficient batching and GPU utilization for fast inference across multiple requests. Its architecture allows models to be deployed with minimal overhead while maintaining compatibility with popular transformer-based model families such as LLaMA and GPT-style architectures.

Downloads: 0 This Week

Last Update: 2026-03-05
See Project
19

tiny-llm

A course of learning LLM inference serving on Apple Silicon

...The project demonstrates how to load and run models such as Qwen-style architectures while progressively implementing performance improvements like KV caching, request batching, and optimized attention mechanisms. It also introduces concepts behind modern LLM serving systems that resemble simplified versions of production inference engines such as vLLM.

Downloads: 0 This Week

Last Update: 2026-04-24
See Project
20

Mooncake

Mooncake is the serving platform for Kimi

Mooncake is an open-source infrastructure platform designed to optimize large language model serving by focusing on efficient management and transfer of model data and KV cache. The platform was originally developed as part of the serving infrastructure for the Kimi large language model system. Its architecture centers on a high-performance transfer engine that provides unified data transfer across different storage and networking technologies. This engine enables efficient movement of...

Downloads: 0 This Week

Last Update: 2026-04-22
See Project
21

IQuest-Coder-V1 Model Family

New family of code large language models (LLMs)

IQuest-Coder-V1 is a cutting-edge family of open-source large language models specifically engineered for code generation, deep code understanding, and autonomous software engineering tasks. These models range from tens of billions to smaller footprints and are trained on a novel code-flow multi-stage paradigm that captures how real software evolves over time — not just static code snapshots — giving them a deeper semantic understanding of programming logic. They support native long contexts...

Downloads: 0 This Week

Last Update: 2026-03-02
See Project
22

MiniMax-M2

MiniMax-M2, a model built for Max coding & agentic workflows

MiniMax-M2 is an open-weight large language model designed specifically for high-end coding and agentic workflows while staying compact and efficient. It uses a Mixture-of-Experts (MoE) architecture with 230 billion total parameters but only 10 billion activated per token, giving it the behavior of a very large model at a fraction of the runtime cost. The model is tuned for end-to-end developer flows such as multi-file edits, compile–run–fix loops, and test-validated repairs across real...

Downloads: 1 This Week

Last Update: 2025-12-01
See Project
23

Sa2VA

Official Repo For "Sa2VA: Marrying SAM2 with LLaVA

Sa2VA is a cutting-edge open-source multi-modal large language model (MLLM) developed by ByteDance that unifies dense segmentation, visual understanding, and language-based reasoning across both images and videos. It merges the segmentation power of a state-of-the-art video segmentation model (based on SAM‑2) with the vision-language reasoning capabilities of a strong LLM backbone (derived from models like InternVL2.5 / Qwen-VL series), yielding a system that can answer questions about...

Downloads: 0 This Week

Last Update: 2025-12-02
See Project
24

gpt-oss-20b

OpenAI’s compact 20B open model for fast, agentic, and local use

...Like its larger sibling (gpt-oss-120b), it offers adjustable reasoning depth and full chain-of-thought visibility for better interpretability. It’s released under a permissive Apache 2.0 license, allowing unrestricted commercial and research use. GPT-OSS-20B is compatible with Transformers, vLLM, Ollama, PyTorch, and other tools. It is ideal for developers building lightweight AI agents or experimenting with fine-tuning on consumer-grade hardware.

Downloads: 0 This Week

Last Update: 2025-08-05
See Project
25

gpt-oss-120b

OpenAI’s open-weight 120B model optimized for reasoning and tooling

...The model supports fine-tuning, chain-of-thought reasoning, and structured outputs, making it ideal for complex workflows. It operates in OpenAI’s Harmony response format and can be deployed via Transformers, vLLM, Ollama, LM Studio, and PyTorch. Developers can control the reasoning level (low, medium, high) to balance speed and depth depending on the task. Released under the Apache 2.0 license, it enables both commercial and research applications. The model supports function calling, web browsing, and code execution, streamlining intelligent agent development.

Downloads: 0 This Week

Last Update: 2025-08-05
See Project