The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
vllm-0.19.0.tar.gz	< 23 hours ago	31.1 MB	0
vllm-0.19.0-cp38-abi3-manylinux_2_31_x86_64.whl	< 23 hours ago	432.3 MB	0
vllm-0.19.0+cpu-cp38-abi3-manylinux_2_35_aarch64.whl	< 23 hours ago	33.5 MB	0
vllm-0.19.0+cpu-cp38-abi3-manylinux_2_35_x86_64.whl	< 23 hours ago	71.8 MB	0
vllm-0.19.0+cu130-cp38-abi3-manylinux_2_35_aarch64.whl	< 23 hours ago	213.2 MB	0
vllm-0.19.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl	< 23 hours ago	227.6 MB	0
vllm-0.19.0-cp38-abi3-manylinux_2_31_aarch64.whl	< 23 hours ago	384.7 MB	0
README.md	2026-04-02	14.3 kB	0
v0.19.0 source code.tar.gz	2026-04-02	31.0 MB	0
v0.19.0 source code.zip	2026-04-02	33.8 MB	0
Totals: 10 Items		1.5 GB	0

vLLM v0.19.0

Highlights

This release features 448 commits from 197 contributors (54 new)!

Gemma 4 support: Full Google Gemma 4 architecture support including MoE, multimodal, reasoning, and tool-use capabilities (#38826, [#38847]). Requires transformers>=5.5.0.
Zero-bubble async scheduling + speculative decoding: Async scheduling now supports speculative decoding with zero-bubble overlap, significantly improving throughput (#32951).
Model Runner V2 maturation: MRV2 gains piecewise CUDA graphs for pipeline parallelism (#35162), spec decode rejection sampler with greedy/logprobs support (#37238, [#37237]), multi-modal embeddings for spec decode (#36097), streaming inputs (#37028), and EPLB support (#37488).
ViT Full CUDA Graphs: Vision encoders (ViT) now support full CUDA graph capture for reduced overhead (#35963).
General CPU KV cache offloading: A simple yet general CPU KV cache offloading mechanism for V1, with pluggable cache policy and block-level preemption handling (#37160, [#37874], [#34805], [#36642], [#37853]).
DBO (Dual-Batch Overlap) generalization: The microbatch optimization (DBO) now works with general models, not just specific architectures (#37926).
NVIDIA B300/GB300 (SM 10.3) support: Allreduce fusion enabled by default with tuned all-reduce communicator (#37755, [#37756]).
Transformers v5 compatibility: Broad compatibility fixes across many models for HuggingFace Transformers v5 (#37681, [#38127], [#38090], [#38247], [#38410]).

Model Support

New architectures: Gemma 4 (#38826), Cohere ASR (#35809), Cohere Transcribe (#38120), ColQwen3.5 4.5B (#36887), LFM2-ColBERT-350M (#37528), Granite 4.0 1B Speech (#38019), Qwen3-ForcedAligner (#35367).
Speculative decoding: Eagle3 for Pixtral (#37182), EagleMistralLarge3 fix (#37232).
LoRA expansion: H2OVL tower/connector LoRA (#31696), --lora-target-modules to restrict LoRA to specific modules (#34984), language_model_only respected (#37375), Mistral3 fix (#36928), Qwen3.5 fix (#36976), out-of-tree ops replacement (#37181).
Model fixes: NemotronH MTP + Chunked Prefill (#35447), Qwen3-VL video timestamps (#37439), Qwen3.5 GDN quantized models (#37448), Qwen3Next A_log FP32 (#37810), JAIS ALiBi (#37820), RoBERTa CUDA graph position IDs (#37873), AudioFlamingo3/MusicFlamingo (#37643), Music Flamingo loading (#35535), bge-m3 task selection (#37632), Nemotron Parse loading (#37407), GLM OCR patch merger (#37962), PaddleOCR checkpoint compat (#38232), DeepSeek v3.2 params (#33703), MiniMax NVFP4 weight loading (#37214), gated model HF token (#37920), Parakeet OOM on long audio (#36671).
Features: Temporal compression for Nemotron-3-VL videos (#36808), NemotronH Puzzle + MTP (#37803), torch.compile for InternVL vision encoder (#38049), multiple embedding types in single call (#35829).
Performance: GLM-4.xv ViT optimization (#37779).

Engine Core

Zero-bubble async scheduling + speculative decoding (#32951).
Model Runner V2: PP CUDA graphs (#35162), spec decode rejection sampler greedy (#37238) + logprobs (#37237), multimodal embeddings for spec decode (#36097), streaming inputs (#37028), configurable acceptance rate (#38045), FP32 draft logits (#37526), FP64 Gumbel noise (#37798), warmup with spec decode (#37812).
ViT Full CUDA Graph capture (#35963).
General CPU KV cache offloading with pluggable CachePolicy (#37160, [#37874]), block-level preemption (#34805), multiple KV groups (#36642), hybrid model support (#37853).
DBO for general models: Microbatch optimization generalized beyond specific architectures (#37926).
Compilation: Mega AOT artifact for torch 2.12+ (#37198), lazy graph module to defer recompile (#37609), remove model tag requirement for compile cache (#37345), Triton autotuning disk cache enabled by default (#37188), inductor runtime asserts disabled by default (#37485).
FlexAttention: Custom mask modification support (#37692).
Attention: Distinguish short extends vs decodes (#37303), allow qk_nope_head_dim=192 in FlashInfer MLA (#37475), skip sliding window attention layers with FP8 KV cache (#33695).
Scheduling: Schedule requests based on full input sequence length (#37307).
Spec decode: Per-draft-model MoE backend via --speculative-config (#37880), Eagle3 drafter quant_config propagation (#37280), Eagle3 norm_before_fc propagation (#38111).
Extensibility: PluggableLayer for CustomQwen2Decoder (#37293), tensor IPC transfer for multimodal data (#32104).
Performance: Optimize top-k in Triton sampler (#37225), optimize token_embed for pooling models with 1% improvement (#37347), fix slow hasattr in CUDAGraphWrapper (#37425), NFS prefetch auto-enabled with RAM guard (#37673), pybase64 replacement (#37290), optimize swap_states for hybrid models (#34733).
Bugfixes: Fix gibberish from FP8 MLA KV scale inconsistency (#37054), Mamba state corruption (#37728), deadlock with pause/resume (#37024), FlashInfer MNNVL socket collisions (#36674), multimodal prefix cache key collisions (#36708), DP coordinator ZMQ TOCTOU (#37452), CUDA graph memory double-counting (#37426), pooling non-determinism (#37775), AllReduce Fusion shutdown crash (#36955), FlashInfer allreduce workspace (#37461), async spec decoding with hybrid models (#38556), MLA sparse indexer prefill chunking (#36178), KV offloading + MLA (#37536), async scheduling extra CUDA context (#37449), DP MTP dummy run (#35243), offloading+prefetch for GLM-4.7-FP8 (#37178), max memory for multiple KV-cache groups (#36030).

Hardware & Performance

NVIDIA:
B300/GB300 (SM 10.3): Allreduce fusion enabled by default (#37755), tuned all-reduce communicator (#37756).
Blackwell: Optimized SM120 CUTLASS blockwise FP8 GEMM (#37970), fix NVFP4 NaN on desktop Blackwell (#37725), fix DeepGEMM E8M0 accuracy for Qwen3.5 FP8 (#38083), restore FP8 FlashMLA CUDA graph persistent buffers (#35175), DGX Spark fix (#38126).
FlashInfer sparse MLA as default for FP8 KV cache (#37252).
Tuned prefill configs for FP8 FA3 (#36265), tuned Triton MoE config for Qwen3.5 on H200 with 9.9% E2E improvement (#37340), H800 MoE configs (#31201).
GPT-OSS: Router GEMM kernel (#37205), eliminate padding with FlashInfer MXFP4/MXFP8 MoE (#30647), reduce redundant SparseMatrix creation (#37683).
NVFP4 CUTLASS MoE non-gated support (#37320), fuse pack topk in TRTLLM MoE via torch.compile (#37695).
Non-contiguous KV cache in TRTLLM FP8 dequant kernel (#36867), Qwen3 dual stream input projection (#36795).
AMD ROCm:
ROCm 7.2.1, torch 2.10, triton 3.6 (#38252).
DeepEP as all2all backend (#34692).
Persistent MLA kernel from AITER (#36574), FP8xFP8 attention in AITER (#36927).
AWQ Marlin support (#36505), wvSplitK skinny GEMM for RDNA4/gfx1x (#34709).
Nightly Docker image and wheel releases (#37283).
Bugfixes: Sleep mode memory leak (#37533), hybrid model stride (#37228), qwen3_next crash (#36795).
Intel XPU: MLA model support (#37143), CompressedTensor W4A8 (#37207), auto-detect XPU build platform (#37634).
TPU: Async scheduling interface (#36924), Qwen3.5 FP8 weight loading fix (#37348).
CPU: Enable tcmalloc by default (#37607), graceful degradation without tcmalloc/libiomp (#37561), 48.9% throughput improvement for pooling models (#38139), OpenMP thread fix for torch.compile (#37538), structured output crash fix (#37706), KV cache block zeroing crash fix (#37550), slot mapping kernel (#37987), W4A16 compressed tensors (#38219).
Performance fixes: FP8 DeepGEMM batch invariance (#37718), Triton autotuning for Qwen3.5 (#37338), TRTLLM NVFP4 routing precision (#36725).

Large Scale Serving

Disaggregated serving: PD kv_transfer_params for Anthropic Messages (#37535) and Responses API (#37424), Mooncake heterogeneous TP (#36869), Mamba N-1 prefill for P/D (#37310).
EPLB: MRV2 support (#37488), improved responsiveness (#36271), EP weight filter fix (#37322).
Elastic EP: Fix repeated scale up/down cycles (#37131), fix stateless group port races (#36330).
DBO: Generalized to work with all models (#37926).
Multi-node: Fix allreduce fusion (#38136).
KV connector: Plugin-overridable metadata build (#37336).
Constraints: Cap API servers to 1 with Elastic EP (#37466).

Quantization

Online MXFP8 quantization for MoE and dense models (#35448).
FP8: WoQ kernel abstraction (#32929), Marlin FP8 for compressed tensors fix (#38092).
NVFP4: Rescale weight scales to fix BF16 dequant underflow (#34577), fix Marlin NaN/Inf with float16 (#33972).
QeRL: Online quantization composed with quantized reloading for RLHF (#38032).
CPU: W4A16 compressed tensors (#38219).
XPU: CompressedTensor W4A8 (#37207).
ROCm: AWQ Marlin support (#36505).
MXFP8 + DeepGEMM: Fix crash when both are active (#37358).
Removals: Per-tensor-per-channel FP8 removed (#32700), Sparse24 integration and kernels removed (#36799).

API & Frontend

New endpoints: /v1/chat/completions/batch for batched chat completions (#38011).
Features: Limit thinking tokens (hard limit) (#20859), multiple embedding types in single call (#35829), numpy array embeddings for multimodal (#38119), --lora-target-modules (#34984), -sc shorthand for --speculative-config (#38380).
Tool parsing: GigaChat 3.1 parser (#36664), Kimi-K2.5 reasoning/tool parser (#37438), Gemma 4 tool parser (#38847), tools passed to parser constructor (#38029), fix Mistral parser (#37209), fix DeepSeek v3.2 streaming (#36056), fix GLM-4.7 parsing (#37386), fix Hermes streaming (#38168), fix OpenAI tool parser IndexError (#37958), fix Anthropic streaming (#37510).
Responses API: Fix crash with tool_choice=required exceeding max_output_tokens (#37258), fix TTFT recording (#37498), fix Anthropic serving template kwargs (#37899).
Performance: Offload blocking tokenizer ops to thread pool (#34789).
Deprecations: --calculate-kv-scales (#37201), score task (#37537), pooling multi-task support (#37956), reasoning_content message field removed (#37480).
Bugfixes: Embed/classify task routing (#37573), Cohere embed task instruction (#38362), renderer workers restricted to 1 with MM cache (#38418).
UX: Log once per node by default (#37568), torch profiler with stack enabled (#37571).

Security

Add VLLM_MAX_N_SEQUENCES environment variable to enforce sequence limits (#37952).
Enforce frame limit in VideoMediaIO to prevent resource exhaustion (#38636).

Dependencies

Transformers v5 compatibility across many models (#37681, [#38127], [#38247], [#38410], [#38090]).
ROCm 7.2.1, torch 2.10, triton 3.6 for ROCm builds (#38252).
compressed-tensors bumped to 0.14.0.1 (#36988).
Python OpenAI package bumped (#32316).
flashinfer-cubin added as default CUDA dependency (#37233).
librosa removed from audio dependencies (#37058).

V0 Deprecation

Deprecate virtual engine (#37195).
Deprecate --disable-frontend-multiprocessing (#37612).
Refactor KV cache from list to element (#37487).

New Contributors

@aaab8b made their first contribution in [#37533]
@aasgaonkar made their first contribution in [#35386]
@allgather made their first contribution in [#38410]
@avinashsingh77 made their first contribution in [#37100]
@b-mu made their first contribution in [#35963]
@bongwoobak made their first contribution in [#37424]
@brandonpelfrey made their first contribution in [#32104]
@ccrhx4 made their first contribution in [#37634]
@cdpath made their first contribution in [#37510]
@cemigo114 made their first contribution in [#37064]
@cnyvfang made their first contribution in [#37439]
@DanBlanaru made their first contribution in [#37307]
@DorBernsohn made their first contribution in [#37438]
@dsingal0 made their first contribution in [#37923]
@fxdawnn made their first contribution in [#36038]
@grYe99 made their first contribution in [#38074]
@guillaumeguy made their first contribution in [#38119]
@gxd3 made their first contribution in [#36924]
@he-yufeng made their first contribution in [#37301]
@javierdejesusda made their first contribution in [#37920]
@jetxa made their first contribution in [#37899]
@jhsmith409 made their first contribution in [#37448]
@jrplatin made their first contribution in [#37348]
@kjiang249 made their first contribution in [#37475]
@laudney made their first contribution in [#34709]
@lcskrishna made their first contribution in [#34692]
@li-liwen made their first contribution in [#38108]
@Liangyx2 made their first contribution in [#37523]
@MatejRojec made their first contribution in [#38011]
@Nekofish-L made their first contribution in [#37970]
@pjo256 made their first contribution in [#34733]
@r266-tech made their first contribution in [#37820]
@RobTand made their first contribution in [#37725]
@scyyh11 made their first contribution in [#34789]
@SherryC41 made their first contribution in [#37519]
@shwetha-s-poojary made their first contribution in [#31696]
@siewcapital made their first contribution in [#36955]
@SKPsanjeevi made their first contribution in [#36574]
@thillai-c made their first contribution in [#37231]
@tianrengao made their first contribution in [#34389]
@tmm77 made their first contribution in [#37694]
@utsumi-fj made their first contribution in [#38328]
@vineetatiwari27 made their first contribution in [#37998]
@Wangbei25 made their first contribution in [#37293]
@WindChimeRan made their first contribution in [#35007]
@wjhrdy made their first contribution in [#37706]
@XLiu-2000 made their first contribution in [#37371]
@xueliangyang-oeuler made their first contribution in [#37536]
@yanghui1-arch made their first contribution in [#37873]
@yassha made their first contribution in [#37369]
@yeahdongcn made their first contribution in [#37840]
@Young-Leo made their first contribution in [#37565]
@ZeldaHuang made their first contribution in [#37425]
@zhejiangxiaomai made their first contribution in [#37259]

Source: README.md, updated 2026-04-02

vLLM Files

A high-throughput and memory-efficient inference and serving engine

vLLM v0.19.0