vLLM Files

A high-throughput and memory-efficient inference and serving engine

This is an exact mirror of the vLLM project, hosted at https://github.com/vllm-project/vllm. SourceForge is not affiliated with vLLM.

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
vllm-0.20.2.tar.gz	2026-05-10	33.5 MB	0
vllm-0.20.2-cp38-abi3-manylinux_2_35_x86_64.whl	2026-05-10	244.4 MB	0
vllm-0.20.2+cpu-cp38-abi3-manylinux_2_35_aarch64.whl	2026-05-10	36.0 MB	0
vllm-0.20.2+cpu-cp38-abi3-manylinux_2_35_x86_64.whl	2026-05-10	75.8 MB	0
vllm-0.20.2+cu129-cp38-abi3-manylinux_2_31_aarch64.whl	2026-05-10	422.2 MB	0
vllm-0.20.2+cu129-cp38-abi3-manylinux_2_31_x86_64.whl	2026-05-10	455.1 MB	0
vllm-0.20.2-cp38-abi3-manylinux_2_35_aarch64.whl	2026-05-10	235.8 MB	0
README.md	2026-05-08	922 Bytes	0
v0.20.2 source code.tar.gz	2026-05-08	33.4 MB	0
v0.20.2 source code.zip	2026-05-08	36.5 MB	0
Totals: 10 Items		1.6 GB	0

vLLM v0.20.2

Highlights

This release features 6 commits from 6 contributors (0 new)!

This is a small patch release with bug fixes for DeepSeek V4, gpt-oss, and Qwen3-VL

Bug Fixes

DeepSeek V4 sparse attention: Re-enable the persistent topk path on Hopper and ensure the memset kernel runs at CUDA graph capture time regardless of max_seq_len, fixing the MTP=1 hang on DeepSeek V4 (#41665, revert of [#41605]).
DeepSeek V4 KV cache: Fixed a "failure to allocate KV blocks" error in the V1 engine KV cache manager (#41282).
gpt-oss MXFP4 + torch.compile: Plumbed hidden_dim_unpadded through the moe_forward fake op so MXFP4 works under torch.compile on v0.20.x (#42002, backport of [#41646]).
Qwen3-VL: Removed an invalid deepstack boundary check that could fail under heavy load (#40932).

Contributors

@ywang96, @zyongye, @stecasta, @wzhao18, @Isotr0py, @khluu

Source: README.md, updated 2026-05-08

Other Useful Business Software

Enterprise-grade ITSM, for every business Icon

Enterprise-grade ITSM, for every business

Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity.

Freshservice is an intuitive, AI-powered platform that helps IT, operations, and business teams deliver exceptional service without the usual complexity. Automate repetitive tasks, resolve issues faster, and provide seamless support across the organization. From managing incidents and assets to driving smarter decisions, Freshservice makes it easy to stay efficient and scale with confidence.

Try it Free

Auth0 B2B Essentials: SSO, MFA, and RBAC Built In Icon

Auth0 B2B Essentials: SSO, MFA, and RBAC Built In

Unlimited organizations, 3 enterprise SSO connections, role-based access control, and pro MFA included. Dev and prod tenants out of the box.

Auth0's B2B Essentials plan gives you everything you need to ship secure multi-tenant apps. Unlimited orgs, enterprise SSO, RBAC, audit log streaming, and higher auth and API limits included. Add on M2M tokens, enterprise MFA, or additional SSO connections as you scale.

Sign Up Free

Forever Free Full-Stack Observability | Grafana Cloud

Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.

Create free account

Recommended Projects

tiny-llm
A course of learning LLM inference serving on Apple Silicon
FlashInfer
FlashInfer: Kernel Library for LLM Serving
OpenFold
Trainable, memory-efficient, and GPU-friendly PyTorch reproduction
SGLang
SGLang is a fast serving framework for large language models
RTP-LLM
Alibaba's high-performance LLM inference engine for diverse apps