Download Latest Version v0.19.0 source code.tar.gz (31.0 MB)
Email in envelope

Get an email when there's a new version of vLLM

Home / v0.19.0
Name Modified Size InfoDownloads / Week
Parent folder
vllm-0.19.0.tar.gz < 23 hours ago 31.1 MB
vllm-0.19.0-cp38-abi3-manylinux_2_31_x86_64.whl < 23 hours ago 432.3 MB
vllm-0.19.0+cpu-cp38-abi3-manylinux_2_35_aarch64.whl < 23 hours ago 33.5 MB
vllm-0.19.0+cpu-cp38-abi3-manylinux_2_35_x86_64.whl < 23 hours ago 71.8 MB
vllm-0.19.0+cu130-cp38-abi3-manylinux_2_35_aarch64.whl < 23 hours ago 213.2 MB
vllm-0.19.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl < 23 hours ago 227.6 MB
vllm-0.19.0-cp38-abi3-manylinux_2_31_aarch64.whl < 23 hours ago 384.7 MB
README.md 2026-04-02 14.3 kB
v0.19.0 source code.tar.gz 2026-04-02 31.0 MB
v0.19.0 source code.zip 2026-04-02 33.8 MB
Totals: 10 Items   1.5 GB 0

vLLM v0.19.0

Highlights

This release features 448 commits from 197 contributors (54 new)!

  • Gemma 4 support: Full Google Gemma 4 architecture support including MoE, multimodal, reasoning, and tool-use capabilities (#38826, [#38847]). Requires transformers>=5.5.0.
  • Zero-bubble async scheduling + speculative decoding: Async scheduling now supports speculative decoding with zero-bubble overlap, significantly improving throughput (#32951).
  • Model Runner V2 maturation: MRV2 gains piecewise CUDA graphs for pipeline parallelism (#35162), spec decode rejection sampler with greedy/logprobs support (#37238, [#37237]), multi-modal embeddings for spec decode (#36097), streaming inputs (#37028), and EPLB support (#37488).
  • ViT Full CUDA Graphs: Vision encoders (ViT) now support full CUDA graph capture for reduced overhead (#35963).
  • General CPU KV cache offloading: A simple yet general CPU KV cache offloading mechanism for V1, with pluggable cache policy and block-level preemption handling (#37160, [#37874], [#34805], [#36642], [#37853]).
  • DBO (Dual-Batch Overlap) generalization: The microbatch optimization (DBO) now works with general models, not just specific architectures (#37926).
  • NVIDIA B300/GB300 (SM 10.3) support: Allreduce fusion enabled by default with tuned all-reduce communicator (#37755, [#37756]).
  • Transformers v5 compatibility: Broad compatibility fixes across many models for HuggingFace Transformers v5 (#37681, [#38127], [#38090], [#38247], [#38410]).

Model Support

  • New architectures: Gemma 4 (#38826), Cohere ASR (#35809), Cohere Transcribe (#38120), ColQwen3.5 4.5B (#36887), LFM2-ColBERT-350M (#37528), Granite 4.0 1B Speech (#38019), Qwen3-ForcedAligner (#35367).
  • Speculative decoding: Eagle3 for Pixtral (#37182), EagleMistralLarge3 fix (#37232).
  • LoRA expansion: H2OVL tower/connector LoRA (#31696), --lora-target-modules to restrict LoRA to specific modules (#34984), language_model_only respected (#37375), Mistral3 fix (#36928), Qwen3.5 fix (#36976), out-of-tree ops replacement (#37181).
  • Model fixes: NemotronH MTP + Chunked Prefill (#35447), Qwen3-VL video timestamps (#37439), Qwen3.5 GDN quantized models (#37448), Qwen3Next A_log FP32 (#37810), JAIS ALiBi (#37820), RoBERTa CUDA graph position IDs (#37873), AudioFlamingo3/MusicFlamingo (#37643), Music Flamingo loading (#35535), bge-m3 task selection (#37632), Nemotron Parse loading (#37407), GLM OCR patch merger (#37962), PaddleOCR checkpoint compat (#38232), DeepSeek v3.2 params (#33703), MiniMax NVFP4 weight loading (#37214), gated model HF token (#37920), Parakeet OOM on long audio (#36671).
  • Features: Temporal compression for Nemotron-3-VL videos (#36808), NemotronH Puzzle + MTP (#37803), torch.compile for InternVL vision encoder (#38049), multiple embedding types in single call (#35829).
  • Performance: GLM-4.xv ViT optimization (#37779).

Engine Core

  • Zero-bubble async scheduling + speculative decoding (#32951).
  • Model Runner V2: PP CUDA graphs (#35162), spec decode rejection sampler greedy (#37238) + logprobs (#37237), multimodal embeddings for spec decode (#36097), streaming inputs (#37028), configurable acceptance rate (#38045), FP32 draft logits (#37526), FP64 Gumbel noise (#37798), warmup with spec decode (#37812).
  • ViT Full CUDA Graph capture (#35963).
  • General CPU KV cache offloading with pluggable CachePolicy (#37160, [#37874]), block-level preemption (#34805), multiple KV groups (#36642), hybrid model support (#37853).
  • DBO for general models: Microbatch optimization generalized beyond specific architectures (#37926).
  • Compilation: Mega AOT artifact for torch 2.12+ (#37198), lazy graph module to defer recompile (#37609), remove model tag requirement for compile cache (#37345), Triton autotuning disk cache enabled by default (#37188), inductor runtime asserts disabled by default (#37485).
  • FlexAttention: Custom mask modification support (#37692).
  • Attention: Distinguish short extends vs decodes (#37303), allow qk_nope_head_dim=192 in FlashInfer MLA (#37475), skip sliding window attention layers with FP8 KV cache (#33695).
  • Scheduling: Schedule requests based on full input sequence length (#37307).
  • Spec decode: Per-draft-model MoE backend via --speculative-config (#37880), Eagle3 drafter quant_config propagation (#37280), Eagle3 norm_before_fc propagation (#38111).
  • Extensibility: PluggableLayer for CustomQwen2Decoder (#37293), tensor IPC transfer for multimodal data (#32104).
  • Performance: Optimize top-k in Triton sampler (#37225), optimize token_embed for pooling models with 1% improvement (#37347), fix slow hasattr in CUDAGraphWrapper (#37425), NFS prefetch auto-enabled with RAM guard (#37673), pybase64 replacement (#37290), optimize swap_states for hybrid models (#34733).
  • Bugfixes: Fix gibberish from FP8 MLA KV scale inconsistency (#37054), Mamba state corruption (#37728), deadlock with pause/resume (#37024), FlashInfer MNNVL socket collisions (#36674), multimodal prefix cache key collisions (#36708), DP coordinator ZMQ TOCTOU (#37452), CUDA graph memory double-counting (#37426), pooling non-determinism (#37775), AllReduce Fusion shutdown crash (#36955), FlashInfer allreduce workspace (#37461), async spec decoding with hybrid models (#38556), MLA sparse indexer prefill chunking (#36178), KV offloading + MLA (#37536), async scheduling extra CUDA context (#37449), DP MTP dummy run (#35243), offloading+prefetch for GLM-4.7-FP8 (#37178), max memory for multiple KV-cache groups (#36030).

Hardware & Performance

  • NVIDIA:
  • B300/GB300 (SM 10.3): Allreduce fusion enabled by default (#37755), tuned all-reduce communicator (#37756).
  • Blackwell: Optimized SM120 CUTLASS blockwise FP8 GEMM (#37970), fix NVFP4 NaN on desktop Blackwell (#37725), fix DeepGEMM E8M0 accuracy for Qwen3.5 FP8 (#38083), restore FP8 FlashMLA CUDA graph persistent buffers (#35175), DGX Spark fix (#38126).
  • FlashInfer sparse MLA as default for FP8 KV cache (#37252).
  • Tuned prefill configs for FP8 FA3 (#36265), tuned Triton MoE config for Qwen3.5 on H200 with 9.9% E2E improvement (#37340), H800 MoE configs (#31201).
  • GPT-OSS: Router GEMM kernel (#37205), eliminate padding with FlashInfer MXFP4/MXFP8 MoE (#30647), reduce redundant SparseMatrix creation (#37683).
  • NVFP4 CUTLASS MoE non-gated support (#37320), fuse pack topk in TRTLLM MoE via torch.compile (#37695).
  • Non-contiguous KV cache in TRTLLM FP8 dequant kernel (#36867), Qwen3 dual stream input projection (#36795).
  • AMD ROCm:
  • ROCm 7.2.1, torch 2.10, triton 3.6 (#38252).
  • DeepEP as all2all backend (#34692).
  • Persistent MLA kernel from AITER (#36574), FP8xFP8 attention in AITER (#36927).
  • AWQ Marlin support (#36505), wvSplitK skinny GEMM for RDNA4/gfx1x (#34709).
  • Nightly Docker image and wheel releases (#37283).
  • Bugfixes: Sleep mode memory leak (#37533), hybrid model stride (#37228), qwen3_next crash (#36795).
  • Intel XPU: MLA model support (#37143), CompressedTensor W4A8 (#37207), auto-detect XPU build platform (#37634).
  • TPU: Async scheduling interface (#36924), Qwen3.5 FP8 weight loading fix (#37348).
  • CPU: Enable tcmalloc by default (#37607), graceful degradation without tcmalloc/libiomp (#37561), 48.9% throughput improvement for pooling models (#38139), OpenMP thread fix for torch.compile (#37538), structured output crash fix (#37706), KV cache block zeroing crash fix (#37550), slot mapping kernel (#37987), W4A16 compressed tensors (#38219).
  • Performance fixes: FP8 DeepGEMM batch invariance (#37718), Triton autotuning for Qwen3.5 (#37338), TRTLLM NVFP4 routing precision (#36725).

Large Scale Serving

  • Disaggregated serving: PD kv_transfer_params for Anthropic Messages (#37535) and Responses API (#37424), Mooncake heterogeneous TP (#36869), Mamba N-1 prefill for P/D (#37310).
  • EPLB: MRV2 support (#37488), improved responsiveness (#36271), EP weight filter fix (#37322).
  • Elastic EP: Fix repeated scale up/down cycles (#37131), fix stateless group port races (#36330).
  • DBO: Generalized to work with all models (#37926).
  • Multi-node: Fix allreduce fusion (#38136).
  • KV connector: Plugin-overridable metadata build (#37336).
  • Constraints: Cap API servers to 1 with Elastic EP (#37466).

Quantization

  • Online MXFP8 quantization for MoE and dense models (#35448).
  • FP8: WoQ kernel abstraction (#32929), Marlin FP8 for compressed tensors fix (#38092).
  • NVFP4: Rescale weight scales to fix BF16 dequant underflow (#34577), fix Marlin NaN/Inf with float16 (#33972).
  • QeRL: Online quantization composed with quantized reloading for RLHF (#38032).
  • CPU: W4A16 compressed tensors (#38219).
  • XPU: CompressedTensor W4A8 (#37207).
  • ROCm: AWQ Marlin support (#36505).
  • MXFP8 + DeepGEMM: Fix crash when both are active (#37358).
  • Removals: Per-tensor-per-channel FP8 removed (#32700), Sparse24 integration and kernels removed (#36799).

API & Frontend

  • New endpoints: /v1/chat/completions/batch for batched chat completions (#38011).
  • Features: Limit thinking tokens (hard limit) (#20859), multiple embedding types in single call (#35829), numpy array embeddings for multimodal (#38119), --lora-target-modules (#34984), -sc shorthand for --speculative-config (#38380).
  • Tool parsing: GigaChat 3.1 parser (#36664), Kimi-K2.5 reasoning/tool parser (#37438), Gemma 4 tool parser (#38847), tools passed to parser constructor (#38029), fix Mistral parser (#37209), fix DeepSeek v3.2 streaming (#36056), fix GLM-4.7 parsing (#37386), fix Hermes streaming (#38168), fix OpenAI tool parser IndexError (#37958), fix Anthropic streaming (#37510).
  • Responses API: Fix crash with tool_choice=required exceeding max_output_tokens (#37258), fix TTFT recording (#37498), fix Anthropic serving template kwargs (#37899).
  • Performance: Offload blocking tokenizer ops to thread pool (#34789).
  • Deprecations: --calculate-kv-scales (#37201), score task (#37537), pooling multi-task support (#37956), reasoning_content message field removed (#37480).
  • Bugfixes: Embed/classify task routing (#37573), Cohere embed task instruction (#38362), renderer workers restricted to 1 with MM cache (#38418).
  • UX: Log once per node by default (#37568), torch profiler with stack enabled (#37571).

Security

  • Add VLLM_MAX_N_SEQUENCES environment variable to enforce sequence limits (#37952).
  • Enforce frame limit in VideoMediaIO to prevent resource exhaustion (#38636).

Dependencies

  • Transformers v5 compatibility across many models (#37681, [#38127], [#38247], [#38410], [#38090]).
  • ROCm 7.2.1, torch 2.10, triton 3.6 for ROCm builds (#38252).
  • compressed-tensors bumped to 0.14.0.1 (#36988).
  • Python OpenAI package bumped (#32316).
  • flashinfer-cubin added as default CUDA dependency (#37233).
  • librosa removed from audio dependencies (#37058).

V0 Deprecation

  • Deprecate virtual engine (#37195).
  • Deprecate --disable-frontend-multiprocessing (#37612).
  • Refactor KV cache from list to element (#37487).

New Contributors

  • @aaab8b made their first contribution in [#37533]
  • @aasgaonkar made their first contribution in [#35386]
  • @allgather made their first contribution in [#38410]
  • @avinashsingh77 made their first contribution in [#37100]
  • @b-mu made their first contribution in [#35963]
  • @bongwoobak made their first contribution in [#37424]
  • @brandonpelfrey made their first contribution in [#32104]
  • @ccrhx4 made their first contribution in [#37634]
  • @cdpath made their first contribution in [#37510]
  • @cemigo114 made their first contribution in [#37064]
  • @cnyvfang made their first contribution in [#37439]
  • @DanBlanaru made their first contribution in [#37307]
  • @DorBernsohn made their first contribution in [#37438]
  • @dsingal0 made their first contribution in [#37923]
  • @fxdawnn made their first contribution in [#36038]
  • @grYe99 made their first contribution in [#38074]
  • @guillaumeguy made their first contribution in [#38119]
  • @gxd3 made their first contribution in [#36924]
  • @he-yufeng made their first contribution in [#37301]
  • @javierdejesusda made their first contribution in [#37920]
  • @jetxa made their first contribution in [#37899]
  • @jhsmith409 made their first contribution in [#37448]
  • @jrplatin made their first contribution in [#37348]
  • @kjiang249 made their first contribution in [#37475]
  • @laudney made their first contribution in [#34709]
  • @lcskrishna made their first contribution in [#34692]
  • @li-liwen made their first contribution in [#38108]
  • @Liangyx2 made their first contribution in [#37523]
  • @MatejRojec made their first contribution in [#38011]
  • @Nekofish-L made their first contribution in [#37970]
  • @pjo256 made their first contribution in [#34733]
  • @r266-tech made their first contribution in [#37820]
  • @RobTand made their first contribution in [#37725]
  • @scyyh11 made their first contribution in [#34789]
  • @SherryC41 made their first contribution in [#37519]
  • @shwetha-s-poojary made their first contribution in [#31696]
  • @siewcapital made their first contribution in [#36955]
  • @SKPsanjeevi made their first contribution in [#36574]
  • @thillai-c made their first contribution in [#37231]
  • @tianrengao made their first contribution in [#34389]
  • @tmm77 made their first contribution in [#37694]
  • @utsumi-fj made their first contribution in [#38328]
  • @vineetatiwari27 made their first contribution in [#37998]
  • @Wangbei25 made their first contribution in [#37293]
  • @WindChimeRan made their first contribution in [#35007]
  • @wjhrdy made their first contribution in [#37706]
  • @XLiu-2000 made their first contribution in [#37371]
  • @xueliangyang-oeuler made their first contribution in [#37536]
  • @yanghui1-arch made their first contribution in [#37873]
  • @yassha made their first contribution in [#37369]
  • @yeahdongcn made their first contribution in [#37840]
  • @Young-Leo made their first contribution in [#37565]
  • @ZeldaHuang made their first contribution in [#37425]
  • @zhejiangxiaomai made their first contribution in [#37259]
Source: README.md, updated 2026-04-02