vLLM - Browse /v0.11.0 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
vllm-0.11.0+cu129-cp38-abi3-manylinux1_x86_64.whl	2025-10-04	433.0 MB	0
vllm-0.11.0-cp38-abi3-manylinux1_x86_64.whl	2025-10-04	438.2 MB	0
vllm-0.11.0-cp38-abi3-manylinux2014_aarch64.whl	2025-10-04	401.0 MB	0
vllm-0.11.0.tar.gz	2025-10-04	10.8 MB	0
README.md	2025-10-03	78.7 kB	1
v0.11.0 source code.tar.gz	2025-10-03	10.7 MB	0
v0.11.0 source code.zip	2025-10-03	12.7 MB	9
Totals: 7 Items		1.3 GB	10

Highlights

This release features 538 commits, 207 contributors (65 new contributors)!

This release completes the removal of V0 engine. V0 engine code including AsyncLLMEngine, LLMEngine, MQLLMEngine, all attention backends, and related components have been removed. V1 is the only engine in the codebase now.
This releases turns on FULL_AND_PIECEWISE as the CUDA graph mode default. This should provide better out of the box performance for most models, particularly fine-grained MoEs, while preserving compatibility with existing models supporting only PIECEWISE mode.

Model Support

New architectures: DeepSeek-V3.2-Exp (#25896), Qwen3-VL series (#24727), Qwen3-Next (#24526), OLMo3 (#24534), LongCat-Flash (#23991), Dots OCR (#24645), Ling2.0 (#24627), CWM (#25611).
Encoders: RADIO encoder support (#24595), Transformers backend support for encoder-only models (#25174).
Task expansion: BERT token classification/NER (#24872), multimodal models for pooling tasks (#24451).
Data parallel for vision encoders: InternVL (#23909), Qwen2-VL (#25445), Qwen3-VL (#24955).
Speculative decoding: EAGLE3 for MiniCPM3 (#24243) and GPT-OSS (#25246).
Features: Qwen3-VL text-only mode (#26000), EVS video token pruning (#22980), Mamba2 TP+quantization (#24593), MRoPE + YaRN (#25384), Whisper on XPU (#25123), LongCat-Flash-Chat tool calling (#24083).
Performance: GLM-4.1V 916ms TTFT reduction via fused RMSNorm (#24733), GLM-4 MoE SharedFusedMoE optimization (#24849), Qwen2.5-VL CUDA sync removal (#24741), Qwen3-VL Triton MRoPE kernel (#25055), FP8 checkpoints for Qwen3-Next (#25079).
Reasoning: SeedOSS reason parser (#24263).

Engine Core

KV cache offloading: CPU offloading with LRU management (#19848, [#20075], [#21448], [#22595], [#24251]).
V1 features: Prompt embeddings (#24278), sharded state loading (#25308), FlexAttention sliding window (#24089), LLM.apply_model (#18465).
Hybrid allocator: Pipeline parallel (#23974), varying hidden sizes (#25101).
Async scheduling: Uniprocessor executor support (#24219).
Architecture: Tokenizer group removal (#24078), shared memory multimodal caching (#20452).
Attention: Hybrid SSM/Attention in Triton (#21197), FlashAttention 3 for ViT (#24347).
Performance: FlashInfer RoPE 2x speedup (#21126), fused Q/K RoPE 11% improvement (#24511, [#25005]), 8x spec decode overhead reduction (#24986), FlashInfer spec decode with 1.14x speedup (#25196), model info caching (#23558), inputs_embeds copy avoidance (#25739).
LoRA: Optimized weight loading (#25403).
Defaults: CUDA graph mode FULL_AND_PIECEWISE (#25444), Inductor standalone compile disabled (#25391).
torch.compile: CUDA graph Inductor partition integration (#24281).

Hardware & Performance

NVIDIA: FP8 FlashInfer MLA decode (#24705), BF16 fused MoE for Hopper/Blackwell expert parallel (#25503).
DeepGEMM: Enabled by default (#24462), 5.5% throughput improvement (#24783).
New architectures: RISC-V 64-bit (#22112), ARM non-x86 CPU (#25166), ARM 4-bit fused MoE (#23809).
AMD: ROCm 7.0 (#25178), GLM-4.5 MI300X tuning (#25703).
Intel XPU: MoE DP accuracy fix (#25465).

Large Scale Serving & Performance

Dual-Batch Overlap (DBO): Overlapping computation mechanism (#23693), DeepEP high throughput + prefill (#24845).
Data Parallelism: torchrun launcher (#24899), Ray placement groups (#25026), Triton DP/EP kernels (#24588).
EPLB: Hunyuan V1 (#23078), Mixtral (#22842), static placement (#23745), reduced overhead (#24573).
Disaggregated serving: KV transfer metrics (#22188), NIXL MLA latent dimension (#25902).
MoE: Shared expert overlap optimization (#24254), SiLU kernel for DeepSeek-R1 (#24054), Enable Allgather/ReduceScatter backend for NaiveAllToAll (#23964).
Distributed: NCCL symmetric memory with 3-4% throughput improvement (#24532), enabled by default for TP (#25070).

Quantization

FP8: Per-token-group quantization (#24342), hardware-accelerated instructions (#24757), torch.compile KV cache (#22758), paged attention update (#22222).
FP4: NVFP4 for dense models (#25609), Gemma3 (#22771), Llama 3.1 405B (#25135).
W4A8: Faster preprocessing (#23972).
Compressed tensors: Blocked FP8 for MoE (#25219).

API & Frontend

OpenAI: Prompt logprobs for all tokens (#24956), logprobs=-1 for full vocab (#25031), reasoning streaming events (#24938), Responses API MCP tools (#24628, [#24985]), health 503 on dead engine (#24897).
Multimodal: Media UUID caching (#23950), image path format (#25081).
Tool calling: XML parser for Qwen3-Coder (#25028), Hermes-style tokens (#25281).
CLI: --enable-logging (#25610), improved --help (#24903).
Config: Speculative model engine args (#25250), env validation (#24761), NVTX profiling (#25501), guided decoding backward compatibility (#25615, [#25422]).
Metrics: V1 TPOT histogram (#24015), hidden deprecated gpu_ metrics (#24245), KV cache GiB units (#25204, [#25479]).
UX: Removed misleading quantization warning (#25012).

Security

https://github.com/vllm-project/vllm/security/advisories/GHSA-wr9h-g72x-mwhm

Dependencies

PyTorch 2.8 for CPU (#25652), FlashInfer 0.3.1 (#24470), CUDA 13 (#24599), ROCm 7.0 (#25178).
Build requirements: C++17 now enforced globally (#24823).
TPU: Deprecated xm.mark_step in favor of torch_xla.sync (#25254).

V0 Deprecation

Engines: AsyncLLMEngine (#25025), LLMEngine (#25033), MQLLMEngine (#25019), core (#25321), model runner (#25328), MP executor (#25329).
Components: Attention backends (#25351), encoder-decoder (#24907), output processor (#25320), sampling metadata (#25345), Sequence/Sampler (#25332).
Interfaces: LoRA (#25686), async output processor (#25334), MultiModalPlaceholderMap (#25366), seq group methods (#25330), placeholder attention (#25510), input embeddings (#25242), multimodal registry (#25362), max_seq_len_to_capture (#25543), attention classes (#25541), hybrid models (#25400), backend suffixes (#25489), compilation fallbacks (#25675), default args (#25409).

What's Changed

[Qwen3-Next] MoE configs for H20 TP=1,2,4,8 by @jeejeelee in https://github.com/vllm-project/vllm/pull/24707
[DOCs] Update ROCm installation docs section by @gshtras in https://github.com/vllm-project/vllm/pull/24691
Enable conversion of multimodal models to pooling tasks by @maxdebayser in https://github.com/vllm-project/vllm/pull/24451
Fix implementation divergence for BLOOM models between vLLM and HuggingFace when using prompt embeds by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/24686
[Bugfix] Fix MRoPE dispatch on CPU by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/24712
[BugFix] Fix Qwen3-Next PP by @njhill in https://github.com/vllm-project/vllm/pull/24709
[CI] Fix flaky test v1/worker/test_gpu_model_runner.py::test_kv_cache_stride_order by @heheda12345 in https://github.com/vllm-project/vllm/pull/24640
[CI] Add ci_envs for convenient local testing by @noooop in https://github.com/vllm-project/vllm/pull/24630
[CI/Build] Skip prompt embeddings tests on V1-only CPU backend by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/24721
[Misc][gpt-oss] Add gpt-oss label to PRs that mention harmony or related to builtin tool call by @heheda12345 in https://github.com/vllm-project/vllm/pull/24717
[Bugfix] Fix BNB name match by @jeejeelee in https://github.com/vllm-project/vllm/pull/24735
[Kernel] [CPU] refactor cpu_attn.py:_run_sdpa_forward for better memory access by @ignaciosica in https://github.com/vllm-project/vllm/pull/24701
[sleep mode] save memory for on-the-fly quantization by @youkaichao in https://github.com/vllm-project/vllm/pull/24731
[Multi Modal] Add FA3 in VIT by @wwl2755 in https://github.com/vllm-project/vllm/pull/24347
[Multimodal] Remove legacy multimodal fields in favor of MultiModalFeatureSpec by @sfeng33 in https://github.com/vllm-project/vllm/pull/24548
[Doc]: fix typos in various files by @didier-durand in https://github.com/vllm-project/vllm/pull/24726
[Docs] Fix warnings in mkdocs build (continued) by @Zerohertz in https://github.com/vllm-project/vllm/pull/24740
[Bugfix] Fix MRoPE dispatch on XPU by @yma11 in https://github.com/vllm-project/vllm/pull/24724
[Qwen3-Next] MoE configs for H100 TP=1,2 and TP2/EP by @elvircrn in https://github.com/vllm-project/vllm/pull/24739
[Core] Shared memory based object store for Multimodal data caching and IPC by @dongluw in https://github.com/vllm-project/vllm/pull/20452
[Bugfix][Frontend] Fix --enable-log-outputs does not match the documentation by @kebe7jun in https://github.com/vllm-project/vllm/pull/24626
[Models] Optimise and simplify _validate_and_reshape_mm_tensor by @lgeiger in https://github.com/vllm-project/vllm/pull/24742
[Models] Prevent CUDA sync in Qwen2.5-VL by @lgeiger in https://github.com/vllm-project/vllm/pull/24741
[Model] Switch to Fused RMSNorm in GLM-4.1V model by @SamitHuang in https://github.com/vllm-project/vllm/pull/24733
[UX] Remove AsyncLLM torch profiler disabled log by @mgoin in https://github.com/vllm-project/vllm/pull/24609
[CI] Speed up model unit tests in CI by @afeldman-nm in https://github.com/vllm-project/vllm/pull/24253
[Bugfix] Fix incompatibility between [#20452] and [#24548] by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/24754
[CI] Trigger BC Linter when labels are added/removed by @zhewenl in https://github.com/vllm-project/vllm/pull/24767
[Benchmark] Allow arbitrary headers to be passed to benchmarked endpoints by @smarterclayton in https://github.com/vllm-project/vllm/pull/23937
[Compilation Bug] Fix Inductor Graph Output with Shape Issue by @yewentao256 in https://github.com/vllm-project/vllm/pull/24772
Invert pattern order to make sure that out_proj layers are identified by @anmarques in https://github.com/vllm-project/vllm/pull/24781
[Attention][FlashInfer] Enable FP8 FlashInfer (TRTLLM) MLA decode by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/24705
Add FLASHINFER_MLA to backend selector test by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/24753
[Qwen3Next] Fixes the cuda graph capture conditions under large batch sizes (#24660) by @sighingnow in https://github.com/vllm-project/vllm/pull/24667
[Core] Support async scheduling with uniproc executor by @njhill in https://github.com/vllm-project/vllm/pull/24219
[Frontend][Multimodal] Allow skipping media data when UUIDs are provided. by @huachenheli in https://github.com/vllm-project/vllm/pull/23950
[Model] Add Olmo3 model implementation by @2015aroras in https://github.com/vllm-project/vllm/pull/24534
[Bugfix] Fix GPUModelRunner has no attribute lora_manager by @jeejeelee in https://github.com/vllm-project/vllm/pull/24762
[Chore] Remove unused batched RoPE op & kernel by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24789
[Docs] Fix warnings in mkdocs build (continued) by @Zerohertz in https://github.com/vllm-project/vllm/pull/24791
[Docs] Remove Neuron install doc as backend no longer exists by @hmellor in https://github.com/vllm-project/vllm/pull/24396
[Doc]: Remove 404 hyperlinks by @rozeappletree in https://github.com/vllm-project/vllm/pull/24785
[Perf] Use NVIDIA hardware-accelerated instruction for float to fp8_e4m3 quantization by @elvischenv in https://github.com/vllm-project/vllm/pull/24757
[Kernels][DP/EP] Optimize Silu Kernel for R1 by @elvircrn in https://github.com/vllm-project/vllm/pull/24054
[Core][Multimodal] Cache supports_kw by @lgeiger in https://github.com/vllm-project/vllm/pull/24773
[CI Failure] Fix test_flashinfer_cutlass_mxfp4_mxfp8_fused_moe by @mgoin in https://github.com/vllm-project/vllm/pull/24750
[Misc] Correct an outdated comment. by @russellb in https://github.com/vllm-project/vllm/pull/24765
[Doc]: fix typos in various files by @didier-durand in https://github.com/vllm-project/vllm/pull/24798
[CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test again by @wwl2755 in https://github.com/vllm-project/vllm/pull/24771
Remove redundant assignment in xfer_buffers, This is a little fix by @ChenTaoyu-SJTU in https://github.com/vllm-project/vllm/pull/24732
[Minor] Simplify duplicative device check for cuda by @ziliangpeng in https://github.com/vllm-project/vllm/pull/24793
[Chore] Minor simplification for non-PP path by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24810
[Multi Modal][Performance] Fused Q,K's apply_rope into one by @wwl2755 in https://github.com/vllm-project/vllm/pull/24511
[Misc] Improve s3_utils type hints with BaseClient by @Zerohertz in https://github.com/vllm-project/vllm/pull/24825
[Perf] Fix DeepGEMM Contiguous Layout Issue, 5.5% Throughput Improvement by @yewentao256 in https://github.com/vllm-project/vllm/pull/24783
fix type of sampling rate for encode_base64 by @co63oc in https://github.com/vllm-project/vllm/pull/24826
[Benchmarks] Throw usage error when using dataset-name random and dataset-path together by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/24819
Force use C++17 globally to avoid compilation error by @chenfengjin in https://github.com/vllm-project/vllm/pull/24823
[Chore] Remove ipex_ops warning by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/24835
[Spec Decoding]Support Spec Decoding Metrics in DP Mode by @wuhang2014 in https://github.com/vllm-project/vllm/pull/24049
[Hybrid Allocator] Support Pipeline Parallel by @heheda12345 in https://github.com/vllm-project/vllm/pull/23974
[Docs] Have a try to improve frameworks/streamlit.md by @windsonsea in https://github.com/vllm-project/vllm/pull/24841
[kv cache] update num_free_blocks in the end by @andyxning in https://github.com/vllm-project/vllm/pull/24228
[Frontend] Skip stop in reasoning content by @gaocegege in https://github.com/vllm-project/vllm/pull/14550
[Bugfix] MiDashengLM model contact error under concurrent testing by @bingchen-mi in https://github.com/vllm-project/vllm/pull/24738
[Doc]: fix typos in various files by @didier-durand in https://github.com/vllm-project/vllm/pull/24821
[Misc] rename interval to max_recent_requests by @andyxning in https://github.com/vllm-project/vllm/pull/24229
[Misc] Own KVConnectors installation by @NickLucche in https://github.com/vllm-project/vllm/pull/24867
[P/D]kv_output_aggregator support heterogeneous by @LCAIZJ in https://github.com/vllm-project/vllm/pull/23917
[UT] enhance free kv cache block queue popleft_n by @andyxning in https://github.com/vllm-project/vllm/pull/24220
[XPU] Set consistent default KV cache layout by @NickLucche in https://github.com/vllm-project/vllm/pull/24745
[Misc] Fix examples openai_pooling_client.py by @noooop in https://github.com/vllm-project/vllm/pull/24853
[Model]: support Ling2.0 by @ant-yy in https://github.com/vllm-project/vllm/pull/24627
[Bugfix] Fix GLM4.1V multimodal processor with compatability for Transformers v4.56 by @Isotr0py in https://github.com/vllm-project/vllm/pull/24822
Fp8 paged attention update by @xiao-llm in https://github.com/vllm-project/vllm/pull/22222
Reinstate existing torch script by @hmellor in https://github.com/vllm-project/vllm/pull/24729
[USAGE] Improve error handling for weight initialization in Unquantized… by @koiker in https://github.com/vllm-project/vllm/pull/20321
Move MultiModalConfig from config/__init__.py to config/multimodal.py by @hmellor in https://github.com/vllm-project/vllm/pull/24659
[Transform] Deterministic Hadacore Transforms by @kylesayrs in https://github.com/vllm-project/vllm/pull/24106
Update num_tokens_across_dp to use nccl instead of gloo by @SageMoore in https://github.com/vllm-project/vllm/pull/24105
Bump Flashinfer to 0.3.1 by @bbartels in https://github.com/vllm-project/vllm/pull/24868
[gpt-oss] Add IncompleteDetails to ResponsesRepsonse by @qandrew in https://github.com/vllm-project/vllm/pull/24561
[gpt-oss][1a] create_responses stream outputs BaseModel type, api server is SSE still by @qandrew in https://github.com/vllm-project/vllm/pull/24759
[Performance] Remove redundant clone() calls in cutlass_mla by @alexm-redhat in https://github.com/vllm-project/vllm/pull/24891
[Bug] Fix Cutlass Scaled MM Compilation Error by @yewentao256 in https://github.com/vllm-project/vllm/pull/24887
[ci] fix wheel names for arm wheels by @simon-mo in https://github.com/vllm-project/vllm/pull/24898
[Tests] fix initialization of kv hash in tests by @mickaelseznec in https://github.com/vllm-project/vllm/pull/24273
[Compile] Fix noop_elimination pass and add tests for noop_elimination by @ZJY0516 in https://github.com/vllm-project/vllm/pull/24880
HuggingFace -> Hugging Face in Integration with Hugging Face docs by @sergiopaniego in https://github.com/vllm-project/vllm/pull/24889
Updated CODEOWNERS for flashinfer, mla, fused_moe by @mgoin in https://github.com/vllm-project/vllm/pull/24906
[Deprecation] Remove DeepGEMM Old Symbol Wrapper by @yewentao256 in https://github.com/vllm-project/vllm/pull/24902
[ROCm][Bugfix] Fix the case where there's bias by @gshtras in https://github.com/vllm-project/vllm/pull/24895
Add pytest-cov and .coveragerc by @rzabarazesh in https://github.com/vllm-project/vllm/pull/24778
[Bug] Fix is_flashmla_supported Check Error by @yewentao256 in https://github.com/vllm-project/vllm/pull/24774
[CI] Small Accuracy Eval Test for Deepseek Model by @yewentao256 in https://github.com/vllm-project/vllm/pull/24259
[Metrics] Hide deprecated metrics with gpu_ prefix by @markmc in https://github.com/vllm-project/vllm/pull/24245
[Docs] Update instructions for how to using existing torch binary by @zou3519 in https://github.com/vllm-project/vllm/pull/24892
Upgrade flashinfer to 0.3.1 by @houseroad in https://github.com/vllm-project/vllm/pull/24470
[XPU] Fix circular import error. by @jikunshang in https://github.com/vllm-project/vllm/pull/24927
Remove V0 Encoder-Decoder Support by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24907
[Bugfix] Fix sequence parallelism bug when enable pipeline parallelism by @cascade812 in https://github.com/vllm-project/vllm/pull/24021
[Bug] [Spec Dec]: Fix kv_cache dtype mismatch for Eagle3 drafter on FP8 target by @vllmellm in https://github.com/vllm-project/vllm/pull/24505
[QWEN NEXT] Fused MoE kernels Optimization configs by @samanamp in https://github.com/vllm-project/vllm/pull/24924
[benchmark] Add triton version in the moe tuned config by @jeejeelee in https://github.com/vllm-project/vllm/pull/24769
[Bugfix] remove duplicate tokens streamed in required tool choice streaming by @Jason-CKY in https://github.com/vllm-project/vllm/pull/23312
[Mamba] Support TP>1 with quantization for mamba2 mixer in case n_groups % tp_size == 0 by @tomeras91 in https://github.com/vllm-project/vllm/pull/24593
[Feat][EPLB] A novel static EPLB placement strategy for MoE models. by @cboss6 in https://github.com/vllm-project/vllm/pull/23745
Move SpeculativeConfig from config/__init__.py to config/speculative.py by @hmellor in https://github.com/vllm-project/vllm/pull/24904
[Docs] move benchmarks README to contributing guides by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/24820
feat: Add Grafana and Perces monitoring dashboards for vLLM by @liangwen12year in https://github.com/vllm-project/vllm/pull/23498
(doc): set cmake c++ compatible standard when building on MacOS CPU. by @teekenl in https://github.com/vllm-project/vllm/pull/23483
[CI] Add Decode Context Parallelism (DCP) test to CI by @minosfuture in https://github.com/vllm-project/vllm/pull/24487
[Model] Clean up and simplify Mamba2 Metadata Usage in both V0 and V1 by @cyang49 in https://github.com/vllm-project/vllm/pull/24331
[Core][MultiModalHasher] Don't convert memoryviews to bytes during hashing by @lgeiger in https://github.com/vllm-project/vllm/pull/24925
[Core/DBO][1/N] Add Dual-Batch Overlap mechanism to VLLM by @SageMoore in https://github.com/vllm-project/vllm/pull/23693
[Bugfix] Fix unable to run encoder model when disable_hybrid_kv_cache_manager is true by @lianyiibo in https://github.com/vllm-project/vllm/pull/24571
[Misc] Add removed encoder-decoder models to previously supported models list by @Isotr0py in https://github.com/vllm-project/vllm/pull/24961
Directly get max encoder len from VLLM config in V1 by @Sugar-zsg in https://github.com/vllm-project/vllm/pull/24866
[gpt-oss][1b] streaming add item id, content id by @qandrew in https://github.com/vllm-project/vllm/pull/24788
[MISC] Add code owners of vllm/v1 to vllm/v1/core by @heheda12345 in https://github.com/vllm-project/vllm/pull/24928
[ROCm] Add dependencies for ROCm by @Concurrensee in https://github.com/vllm-project/vllm/pull/24900
[gpt-oss][1][bugfix] fix streaming final output by @qandrew in https://github.com/vllm-project/vllm/pull/24466
Use kwargs for long lists of EngineCoreRequest arguments in tests and fix extra kwargs by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/24987
fp8 kv cache support fix for torch.compile by @maleksan85 in https://github.com/vllm-project/vllm/pull/22758
[Perf] Reuse workspace for FP8+FP4 Marlin MoE by @mgoin in https://github.com/vllm-project/vllm/pull/20500
[CI][Bugfix] Fix failing Blackwell test by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/24993
[CI] GPT-OSS GPQA eval test for Blackwell by @mgoin in https://github.com/vllm-project/vllm/pull/24920
[FP8] Extend per-token-group quantization support to QuantFP8 by @tahsintunan in https://github.com/vllm-project/vllm/pull/24342
Removes source compilation of nixl dependency by @bbartels in https://github.com/vllm-project/vllm/pull/24874
[Doc] Add --force-overwrite option to generate_cmake_presets.py by @elvischenv in https://github.com/vllm-project/vllm/pull/24375
[Core] Use CpuGpuBuffer for block table tensors by @njhill in https://github.com/vllm-project/vllm/pull/24795
[Benchmarks] Add MMVU video dataset support and clean up deprecated datasets by @Isotr0py in https://github.com/vllm-project/vllm/pull/24719
[UX] Enforce valid choices for envs like VLLM_ATTENTION_BACKEND, etc by @mgoin in https://github.com/vllm-project/vllm/pull/24761
[Docs] fix invalid doc link by @yyzxw in https://github.com/vllm-project/vllm/pull/25017
[UX] Remove "quantization is not fully optimized yet" log by @mgoin in https://github.com/vllm-project/vllm/pull/25012
[misc] fix typo in value error by @prashantgupta24 in https://github.com/vllm-project/vllm/pull/24995
[Core] Get num_encoder_tokens from scheduler config by @russellb in https://github.com/vllm-project/vllm/pull/24989
[V0 Deprecation] Remove MQLLMEngine by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25019
[Model] Support Qwen3-VL Model Series by @ywang96 in https://github.com/vllm-project/vllm/pull/24727
[Rocm] [quantization] Fix quark ptpc moe and add test case by @haoyangli-amd in https://github.com/vllm-project/vllm/pull/24649
Add more documentation and improve usability of lognormal dist (benchmark_serving_multi_turn) by @pliops-daniels in https://github.com/vllm-project/vllm/pull/23255
[XPU] Fix xpu model runner call torch.cuda APIs by @jikunshang in https://github.com/vllm-project/vllm/pull/25011
[EPLB] Support EPLB for Mixtral Model by @rouchenzi in https://github.com/vllm-project/vllm/pull/22842
[Core][MultiModalHasher] Hash images without converting image mode by @lgeiger in https://github.com/vllm-project/vllm/pull/24969
[Model] Pass param prefix to LLMHead by @whx-sjtu in https://github.com/vllm-project/vllm/pull/24862
[Model] Apply SharedFusedMoE to glm4_moe. by @whx-sjtu in https://github.com/vllm-project/vllm/pull/24849
[Core] Remove tokenizer group in vLLM by @zhuohan123 in https://github.com/vllm-project/vllm/pull/24078
[Docs] Fix griffe warning in base_static_graph.py by @windsonsea in https://github.com/vllm-project/vllm/pull/25018
[DP] Create placement groups by ray_device_key by @xinyu-intel in https://github.com/vllm-project/vllm/pull/25026
[Frontend] Support returning all prompt logprobs by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/24956
[BugFix] enable DOTALL to match multi-line tool_call parameters in extract_tool_call_required_streaming by @shijun-yin in https://github.com/vllm-project/vllm/pull/24668
[Misc] Avoid use of deprecated AutoModelForVision2Seq by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25065
Add RADIO Vision Encoder Support to vLLM by @danielafrimi in https://github.com/vllm-project/vllm/pull/24595
[Bugfix] Fix Stream usage in CPU model runner and OneDNN kernel check by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/25046
Apply fixes for CUDA 13 by @Aidyn-A in https://github.com/vllm-project/vllm/pull/24599
[fix] lora benchmarks pass no_lora_flag_cpu by @dolpm in https://github.com/vllm-project/vllm/pull/23774
[Bugfix][Qwen3-Next] fixes the varlen issue in qwen3-next's MTP implementation. by @sighingnow in https://github.com/vllm-project/vllm/pull/24957
[Docs] improve code formatting and comments for eliminate griffe build warning. by @samzong in https://github.com/vllm-project/vllm/pull/25010
Remove old cutlass mla by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/23961
[Docs] vllm/benchmarks/datasets.py fix docstring param format. by @samzong in https://github.com/vllm-project/vllm/pull/24970
[CI Bugfix] Fix failing test_invalid_env by @mgoin in https://github.com/vllm-project/vllm/pull/25078
[V0 Deprecation] Remove V0 Core tests by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25082
cleanup: remove adapter commons by @simon-mo in https://github.com/vllm-project/vllm/pull/25045
Remove unused find_cuda_init helper script by @simon-mo in https://github.com/vllm-project/vllm/pull/25044
[V0 Deprecation] Remove unused output processor util by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25023
Change log level from info to debug for IOProcessor by @mgoin in https://github.com/vllm-project/vllm/pull/24999
[CI] Revert back prepare_prompts and check_answers by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25087
[V0 Deprecation] Remove V0 tests in test_sequence.py by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25088
[CI Bugfix] Fix failing test_model_load_with_params tests due to tokenizer refactor by @mgoin in https://github.com/vllm-project/vllm/pull/25086
[V1] Logits processor docs by @afeldman-nm in https://github.com/vllm-project/vllm/pull/22919
[Misc] Update owners for KV connector and V1 offloading by @ApostaC in https://github.com/vllm-project/vllm/pull/25041
[Bugfix] Update import path for bc_linter_include by @mmangkad in https://github.com/vllm-project/vllm/pull/24766
[BUG] Exclude .pth files when pulling remote files by @ahao-anyscale in https://github.com/vllm-project/vllm/pull/25092
[Kernel] Faster pre-processing time for W4A8 by @czhu-cohere in https://github.com/vllm-project/vllm/pull/23972
[gpt-oss][2] fix types for streaming by @qandrew in https://github.com/vllm-project/vllm/pull/24556
[Bugfix][B200] Fix cutlass_mla hang by @alexm-redhat in https://github.com/vllm-project/vllm/pull/24966
[ROCm][Bugfix] Aiter mha fp8 fix by @dllehr-amd in https://github.com/vllm-project/vllm/pull/24991
Disable failing GPT-OSS Eval (Blackwell) for now by @mgoin in https://github.com/vllm-project/vllm/pull/25107
[Bugfix] Refactor Flashinfer TRTLLM attention kernel selection logic by @elvischenv in https://github.com/vllm-project/vllm/pull/24600
Add a batched auto tune script by @karan in https://github.com/vllm-project/vllm/pull/25076
[Bugfix] Fix accuracy issue for silu_mul + nvfp4 quant fusion kernel by @elvischenv in https://github.com/vllm-project/vllm/pull/24833
[Kernel] Delegate construction of FusedMoEQuantConfig to FusedMoEMethodBase subclasses by @bnellnm in https://github.com/vllm-project/vllm/pull/22537
[V0 Deprecation] Remove V0 Engine tests by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25114
[V0 Deprecation] Remove V0 Tracing & Metrics tests by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25115
[V0 Deprecation] Remove misc V0 tests by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25118
[V0 Deprecation] Skip PP test by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25128
[Kernels] Enable DeepGEMM by default by @bnellnm in https://github.com/vllm-project/vllm/pull/24462
[MM Encoder] Apply DP ViT for Qwen3-VL model series by @ywang96 in https://github.com/vllm-project/vllm/pull/24955
[Docs] Clean up the contributing README by @hmellor in https://github.com/vllm-project/vllm/pull/25099
[Core][MM] Cleanup MultiModalCache by @lgeiger in https://github.com/vllm-project/vllm/pull/25006
[Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models by @toncao in https://github.com/vllm-project/vllm/pull/24960
[Kernels] Overlap shared experts with combine instead of dispatch by @bnellnm in https://github.com/vllm-project/vllm/pull/24254
[Model] enable data parallel for InternVL vision encoder by @666even666 in https://github.com/vllm-project/vllm/pull/23909
Mark prompt logprobs as incompatible with prompt embeds at API level by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/25077
[XPU] Whisper model support on XPU Platform by @chaojun-zhang in https://github.com/vllm-project/vllm/pull/25123
[EPLB] Add EPLB support for hunyuan_v1 by @666even666 in https://github.com/vllm-project/vllm/pull/23078
[V0 Deprecation] Remove more V0 tests by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25117
[Spec Decode] Efficient padded speculation by @benchislett in https://github.com/vllm-project/vllm/pull/24539
[benchmark] add peak throughput metrics and plot by @simon-mo in https://github.com/vllm-project/vllm/pull/23867
[CLI] Use streaming in CLI chat and completion commands by @simon-mo in https://github.com/vllm-project/vllm/pull/23769
[Kernel] Better inf handling for grouped topk cu by @lumina37 in https://github.com/vllm-project/vllm/pull/24886
[Docs] Fix API Reference by @hmellor in https://github.com/vllm-project/vllm/pull/25140
Retrieve sliding_window from text config in Gemma3 MM by @hmellor in https://github.com/vllm-project/vllm/pull/25085
[Bugfix] when use s3 model cannot use default load_format by @lengrongfu in https://github.com/vllm-project/vllm/pull/24435
[Qwen] Add fp8 checkpoint support for qwen3-next. by @sighingnow in https://github.com/vllm-project/vllm/pull/25079
Add 'path' option to ImagePrompt data_format by @gfinol in https://github.com/vllm-project/vllm/pull/25081
[Doc] Fix cross-reference warnings by @punitvara in https://github.com/vllm-project/vllm/pull/25058
[Chore] Cleanup guided namespace, move to structured outputs config by @aarnphm in https://github.com/vllm-project/vllm/pull/22772
Fix: Add explicit #include <omp.h> for OpenMP compatibility on certain toolchains by @ihb2032 in https://github.com/vllm-project/vllm/pull/24951
silu-v1: Fix EPS not being used during max-reduction by @elvircrn in https://github.com/vllm-project/vllm/pull/25069
[Frontend] Support setting logprobs to -1 by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/25031
[Model] Improve Pooling Model by @jeejeelee in https://github.com/vllm-project/vllm/pull/25149
Move StructuredOutputsConfig from config/__init__.py to config/structured_outputs.py by @hmellor in https://github.com/vllm-project/vllm/pull/25153
[Docs] Fix pooling-params doc references in openai_compatible_server.md by @yankay in https://github.com/vllm-project/vllm/pull/24939
[Docs] add the parallel sampling usage in LLMEngine and AsyncLLM by @gigit0000 in https://github.com/vllm-project/vllm/pull/24222
Fix forward reference warning in documentation by @hmellor in https://github.com/vllm-project/vllm/pull/25150
Fix validate-config pre-commit check by @hmellor in https://github.com/vllm-project/vllm/pull/25157
[Bugfix][Mamba] - Fix Conv State Kernel FP32 Support by @Josephasafg in https://github.com/vllm-project/vllm/pull/24883
[Misc] Clean up flags in vllm bench serve by @ywang96 in https://github.com/vllm-project/vllm/pull/25138
[Structured Output][Refactor] Move apply_grammar_bitmask() method from ModelRunner to structured output utils by @shen-shanshan in https://github.com/vllm-project/vllm/pull/21999
Refactor dense FP8 tensor/channel/block utils and add CT FP8 block by @mgoin in https://github.com/vllm-project/vllm/pull/21404
[Misc] Add kv-connector label by @NickLucche in https://github.com/vllm-project/vllm/pull/25156
[Kernel] Enable Hybrid Model Support in Triton Unified Attention Kernel by @jvlunteren in https://github.com/vllm-project/vllm/pull/21197
[PERF] Add conv1d metadata to GDN attn by @vadiklyutiy in https://github.com/vllm-project/vllm/pull/25105
feat(api): Return 503 on /health when engine is dead by @dongbo910220 in https://github.com/vllm-project/vllm/pull/24897
[New Model] Support BertForTokenClassification / Named Entity Recognition (NER) task by @noooop in https://github.com/vllm-project/vllm/pull/24872
[Docs] Fix warnings in mkdocs build (continued) by @Zerohertz in https://github.com/vllm-project/vllm/pull/25163
Enable Allgather/ReduceScatter backend for NaiveAllToAll by @wenscarl in https://github.com/vllm-project/vllm/pull/23964
[Misc] Add codeowner for Transformers backend by @hmellor in https://github.com/vllm-project/vllm/pull/25180
[spec decode] Fix MTP inference path for MiMo-7B model by @zixi-qi in https://github.com/vllm-project/vllm/pull/25136
[ROCm][CI/Build] Use ROCm7.0 as the base by @gshtras in https://github.com/vllm-project/vllm/pull/25178
[ROCm][AITER][Bugfix] Switch AITER to use PIECEWISE_AND_FULL compilation by @Rohan138 in https://github.com/vllm-project/vllm/pull/25104
[KV offload][1/N] Introduce an offloading component by @orozery in https://github.com/vllm-project/vllm/pull/19848
[V0 Deprecation] Remove AsyncLLMEngine by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25025
[fix]: remove data type hardcoding from gptoss model implementation by @nikhil-arm in https://github.com/vllm-project/vllm/pull/23807
[feat]: Create interface for model-specific M-RoPE by @AzizCode92 in https://github.com/vllm-project/vllm/pull/24194
[Bug] Fix returned_lse not Defined issue by @yewentao256 in https://github.com/vllm-project/vllm/pull/25106
[Bug] Fix torch Compilation Cache Hit Error by @yewentao256 in https://github.com/vllm-project/vllm/pull/25093
[V0 Deprecation] Remove unused async_timeout.py by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25190
[KV offload][1b/N] rename offloading to kv_offload by @orozery in https://github.com/vllm-project/vllm/pull/25191
[BugFix] Fix DeepGEMM warmup, no m.weight_scale_inv by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/25206
[CORE] Prompt Embeddings Support for v1 Engine by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/24278
[KV offload][2/N] Introduce LRU-based CPU offloading management by @orozery in https://github.com/vllm-project/vllm/pull/20075
[gpt-oss] Add ResponseReasoningPartAddedEvent, ResponseReasoningPartDoneEvent for streaming by @qandrew in https://github.com/vllm-project/vllm/pull/24938
[Perf] Optimize memory peak during EAGLE model loading. by @candyzone in https://github.com/vllm-project/vllm/pull/24585
[Misc] Clean up MM profiling warnings by @ywang96 in https://github.com/vllm-project/vllm/pull/25222
[Docs] Fix griffe warnings in vllm/multimodal by @windsonsea in https://github.com/vllm-project/vllm/pull/25216
[OOT] Support sync_model_loading for OOT by @xuechendi in https://github.com/vllm-project/vllm/pull/25126
[Build] Update Xgrammar to 0.1.24 to get a CVE fix by @russellb in https://github.com/vllm-project/vllm/pull/25188
[CPU] Disable oneDNN linear on non-x86 platforms by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/25166
[Bugfix][CPU] Add placeholder to avoid import errors when using fused_moe ops on platforms without triton by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/25137
[Misc] Cleanup test conftest for deprecated encoder-decoder models by @Isotr0py in https://github.com/vllm-project/vllm/pull/25231
[bugfix] fix MHA for models like OpenGVLab/InternVL3_5-38B by @yma11 in https://github.com/vllm-project/vllm/pull/25146
[Kernel][Performance] Add Triton kernel for Qwen3-VL interleaved MRoPE by @Isotr0py in https://github.com/vllm-project/vllm/pull/25055
[Bugfix][Perf] Misc fixes for Qwen3 VL by @ywang96 in https://github.com/vllm-project/vllm/pull/25238
Move PoolerConfig from config/__init__.py to config/pooler.py by @hmellor in https://github.com/vllm-project/vllm/pull/25181
[P/D][Nixl] Introduce KVTransferMetrics and aggregation strategy by @NickLucche in https://github.com/vllm-project/vllm/pull/22188
[V0 Deprecation] Remove V0 logic from get_input_embeddings interface by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25242
[Qwen] Remove cuda hard-code in qwen3 next by @wxsIcey in https://github.com/vllm-project/vllm/pull/25243
Update CODEOWNERS by @hmellor in https://github.com/vllm-project/vllm/pull/25269
Move ModelConfig from config/__init__.py to config/model.py by @hmellor in https://github.com/vllm-project/vllm/pull/25252
refactor(benchmarks): add type annotations to wait_for_endpoint parameters by @samzong in https://github.com/vllm-project/vllm/pull/25218
[KV offload][3/N] Add worker-side CPU support by @orozery in https://github.com/vllm-project/vllm/pull/21448
[Frontend] Pass API server count to each process by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23717
[Core] Modify the initialization parameters of the lora manager by @jeejeelee in https://github.com/vllm-project/vllm/pull/25249
Remove Redundant Assignment in Qwen3_VisionPatchMerger by @LJH-LBJ in https://github.com/vllm-project/vllm/pull/25224
Encoder model support for the Transformers backend by @hmellor in https://github.com/vllm-project/vllm/pull/25174
[CI/Build] fix test function_calling by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/25072
[Core][Prefix Hash] Fix prefix hash metrics sliding window maintainance by @Jialin in https://github.com/vllm-project/vllm/pull/24990
[Docs] add init.py to vllm/model_executor/layers/quantization/compressed_tensors/transform by @samzong in https://github.com/vllm-project/vllm/pull/24974
[bugfix] fix structured outputs key missing issue from [#24929] by @luccafong in https://github.com/vllm-project/vllm/pull/25195
[KV offload][4/N] Offloading KV connector by @orozery in https://github.com/vllm-project/vllm/pull/22595
Optimize triton unified attention performance for sliding window attention by @zixi-qi in https://github.com/vllm-project/vllm/pull/24390
[Bugfix] GPT OSS Attritbute error on H100 by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/25228
[Bugfix] Fix chunked a2_scales in modular kernels by @bnellnm in https://github.com/vllm-project/vllm/pull/25264
Specify platform in pip-compile pre-commit hook so it runs on MacOS by @hmellor in https://github.com/vllm-project/vllm/pull/25273
[Perf] Use FlashInfer RoPE for RotaryEmbedding.forward_cuda when available by @mgoin in https://github.com/vllm-project/vllm/pull/21126
[BugFix] Make FlashInferMetadataBuilder non-blocking by @nvjullin in https://github.com/vllm-project/vllm/pull/25040
Fix: Correct FusedMoE layer reference in auto_round quantization by @David-Wen2025 in https://github.com/vllm-project/vllm/pull/24818
[Frontend] Responses API messages out, just harmony for now by @alecsolder in https://github.com/vllm-project/vllm/pull/24985
[Compile] Fix Compile Warning for Ignoring MIN_BLOCK_PER_SM by @yewentao256 in https://github.com/vllm-project/vllm/pull/25193
Enable modelopt gemma3 nvfp4/fp8, make workflow more robust by @Edwardf0t1 in https://github.com/vllm-project/vllm/pull/22771
allow disable flashinfer prefill by @luccafong in https://github.com/vllm-project/vllm/pull/25276
[BugFix] Fix async scheduling CPU tensor race take 2 by @njhill in https://github.com/vllm-project/vllm/pull/25279
[Bugfix] Remove VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE [#2969] by @Lucaskabela in https://github.com/vllm-project/vllm/pull/25090
Don't skip special tokens with hermes-style tool calling by @maxdebayser in https://github.com/vllm-project/vllm/pull/25281
test: Remove vestigial skip for prompt embeds tests after landing v1 Prompt Embeds support by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/25291
[docs] Prompt Embedding feature support by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/25288
[torch.compile] CUDAGraph Inductor partition integration by @BoyuanFeng in https://github.com/vllm-project/vllm/pull/24281
[BugFix] Ensure appropriate guards in destructors by @njhill in https://github.com/vllm-project/vllm/pull/25284
[Misc] Support more collective_rpc return types by @njhill in https://github.com/vllm-project/vllm/pull/25294
Improve weight loading for encoder models in Transformers backend by @hmellor in https://github.com/vllm-project/vllm/pull/25289
[BUGFIX] GPTQ quantization compatibility for Qwen3 Next MOE models (AutoGPTQ and AutoRound-GPTQ) by @JartX in https://github.com/vllm-project/vllm/pull/25268
[BugFix] Exclude self when checking for port collision by @njhill in https://github.com/vllm-project/vllm/pull/25286
[BUG FIX][NON-CUDA]quick fix to avoid call cudagraph_unsafe in attention by @xuechendi in https://github.com/vllm-project/vllm/pull/25298
[Bugfix] fix tool call arguments is empty by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/25223
[Optimization] Avoid repeated model architecture conversion for pooling models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25261
[Hybrid Allocator] Support full attention with different hidden size by @heheda12345 in https://github.com/vllm-project/vllm/pull/25101
[Bugfix] Fix Qwen3-VL-MoE weight loading for EP by @ywang96 in https://github.com/vllm-project/vllm/pull/25300
[V1] Support LLM.apply_model by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/18465
[CI Failure] Disable FlashInfer RoPE to unblock CI by @mgoin in https://github.com/vllm-project/vllm/pull/25299
[Docs] Fix warnings in mkdocs build (continued) by @wwl2755 in https://github.com/vllm-project/vllm/pull/25042
Generate _ModelInfo properties file when loading to improve loading speed by @manoelmarques in https://github.com/vllm-project/vllm/pull/23558
[Model] Cleanup InternViT's data parallel implementation by @Isotr0py in https://github.com/vllm-project/vllm/pull/25306
[Core] Enable sharded state loader for V1 engine and enhance test coverage by @lirong-lirong in https://github.com/vllm-project/vllm/pull/25308
[V0 Deprecation] Enable the remaining multimodal tests in V1 by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25307
[Docs] Fix warnings in vllm/profiler and vllm/transformers_utils by @windsonsea in https://github.com/vllm-project/vllm/pull/25220
[V0 Deprecation] Remove LLMEngine by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25033
[V0 Deprecation] Remove V0 Output Processor by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25320
[Chore] Remove unused sampler in models by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25324
[CI] Skip tests failing on main by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25326
[V0 Deprecation] Remove V0 core by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25321
[Doc] improve test-pipeline.yaml documentation by @hl475 in https://github.com/vllm-project/vllm/pull/25305
[V0 Deprecation] Remove V0 model runner base & simplify worker base by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25328
[Multi Modal][Performance] Fused Q,K's apply_rope in more models by @wwl2755 in https://github.com/vllm-project/vllm/pull/25005
[V0 Deprecation] Remove from_seq_group methods by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25330
[V0 Deprecation] Remove V0 MP executor by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25329
[V1] Add sliding window support to Flex Attention backend by @Isotr0py in https://github.com/vllm-project/vllm/pull/24089
[MM][Perf] Minor Optimization on Qwen3-VL fast_pos_embed_interpolate by @ywang96 in https://github.com/vllm-project/vllm/pull/25337
[Bugfix] Typos in error message for missing model config file by @simondanielsson in https://github.com/vllm-project/vllm/pull/25339
[Optimization] Cache chat template result when processor fails to be loaded by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25341
[V0 Deprecation] Remove V0 Sequence class & Sampler by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25332
[V0 Deprecation] Remove async_output_proc, preemption mode, delay factor by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25334
feat: Enable engine-level arguments with speculators models by @rahul-tuli in https://github.com/vllm-project/vllm/pull/25250
[V0 Deprecation] Remove V0 sampling metadata by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25345
[Perf] Further optimization for Qwen3-VL fast_pos_embed_interpolate by @Isotr0py in https://github.com/vllm-project/vllm/pull/25347
Remove V0 attention backends by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25351
[Bugfix][V0 Deprecation][CI] use async mock and await for async method by @KKSK-DON in https://github.com/vllm-project/vllm/pull/25325
Multimodal - audio tests by @debroy-rh in https://github.com/vllm-project/vllm/pull/25285
[Model] Support Dots OCR by @ywang96 in https://github.com/vllm-project/vllm/pull/24645
[Docs] GSM8K Accuracy Evaluation doc update by @david6666666 in https://github.com/vllm-project/vllm/pull/25360
[Bugfix] Fix hermes tool parser handling of non-string argument types by @david6666666 in https://github.com/vllm-project/vllm/pull/22002
[V0 Deprecation] Remove V0-only methods in multi-modal registry by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25362
[V0 Deprecation] Remove MultiModalPlaceholderMap by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25366
Enable Eagle3 speculative decoding for GPT-OSS model by @eldarkurtic in https://github.com/vllm-project/vllm/pull/25246
[TPU][Bugfix][CI] Fix broken tests/build dependency by @NickLucche in https://github.com/vllm-project/vllm/pull/25255
[TPU] Deprecate xm.mark_step in favor of `torch_xla.sync by @NickLucche in https://github.com/vllm-project/vllm/pull/25254
refactor: abstract graph mode support into platform interface by @yiz-liu in https://github.com/vllm-project/vllm/pull/25161
[Misc] Remove unused encoder-decoder error strings by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25374
Make pickle import check fast by @hmellor in https://github.com/vllm-project/vllm/pull/25379
Make mypy behave like a proper pre-commit hook by @hmellor in https://github.com/vllm-project/vllm/pull/25313
MI-300X triton moe configs by @Sara-KS in https://github.com/vllm-project/vllm/pull/23445
[Bugfix] Fix several issues with p2p xPyD in GET type by @Csrayz in https://github.com/vllm-project/vllm/pull/23993
[V1][Attention] Split triton_attn in triton-only and rocm specific backends by @bringlein in https://github.com/vllm-project/vllm/pull/24648
[EPLB] Reduce EPLB Inference Overhead by @abmfy in https://github.com/vllm-project/vllm/pull/24573
[CLI env var] Add VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH in env variables by @Daisy-Ma-coder in https://github.com/vllm-project/vllm/pull/25274
[Compiler] Disable Inductor standalone compile by default by @ElizaWszola in https://github.com/vllm-project/vllm/pull/25391
[CI Failure] Fix fp8 kv cache on <SM90 by @mgoin in https://github.com/vllm-project/vllm/pull/25396
[DP] support torchrun external launcher with Data Parallelism by @luccafong in https://github.com/vllm-project/vllm/pull/24899
Remove RFC review hours reference by @simon-mo in https://github.com/vllm-project/vllm/pull/25416
[torch.compile] Cleanup compilation tests and custom passes, add debug utils, fix DCE bug (#23091), fix test (#24376), and prep for custom op matching (#24604) by @ProExpertProg in https://github.com/vllm-project/vllm/pull/24542
[KV offload][5/N] Add CPUOffloadingSpec by @orozery in https://github.com/vllm-project/vllm/pull/24251
[CI/Build] Skip Qwen3-VL initialization tests until models are actually released by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25394
[TPU] update torch_xla dependency for PyPI compatibility by @jcyang43 in https://github.com/vllm-project/vllm/pull/25278
[Frontend] Responses API MCP tools for built in tools and to pass through headers by @alecsolder in https://github.com/vllm-project/vllm/pull/24628
[Bugfix] fix custom op test by @ProExpertProg in https://github.com/vllm-project/vllm/pull/25429
[Core] Drop overly aggressive whisper assertion by @russellb in https://github.com/vllm-project/vllm/pull/25408
[Bugfix] Fix missing clear_connector_metadata by @NickLucche in https://github.com/vllm-project/vllm/pull/25397
[BugFix] [DP/EP] Fix slow execution when BS <= DP by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/25407
[Performance] Remove input pads in cutlass_mla and optimize v_proj output handling by @alexm-redhat in https://github.com/vllm-project/vllm/pull/25184
[Perf] Apply torch.compile for per_block_cast_to_fp8 by @yewentao256 in https://github.com/vllm-project/vllm/pull/24611
[V0 deprecation] Remove platform v1 controling interface by @Isotr0py in https://github.com/vllm-project/vllm/pull/25410
[V0 deprecation] Remove _set_default_args_v0 function by @Isotr0py in https://github.com/vllm-project/vllm/pull/25409
[Bug] Fix Long Context OOM Issue by @yewentao256 in https://github.com/vllm-project/vllm/pull/25290
[feat] Support MRoPE + YaRN by @JJJYmmm in https://github.com/vllm-project/vllm/pull/25384
[XPU] Fix compile_size is None case. by @jikunshang in https://github.com/vllm-project/vllm/pull/25433
[benchmarks]allow skip ready check for bench serve by @luccafong in https://github.com/vllm-project/vllm/pull/25420
[Bugfix] Remove contiguous output req for context parallel MLA by @mgoin in https://github.com/vllm-project/vllm/pull/25414
[Docs] Fix griffe warnings in vllm/lora/ops by @windsonsea in https://github.com/vllm-project/vllm/pull/25369
[DP/EP][GPTOSS] Use triton matmul-ogs kernels for GPTOSS DP/EP by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/24588
[NIXL][OOT platform] support nixl_connector with oot platform and other nixl_backend by @xuechendi in https://github.com/vllm-project/vllm/pull/25121
[Model] Enable DP for ViT in Qwen2-VL by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25445
Handle triton kernel import exception by @minosfuture in https://github.com/vllm-project/vllm/pull/25319
[Frontend] Add a new xml-based tool parser for qwen3-coder by @Zhikaiiii in https://github.com/vllm-project/vllm/pull/25028
[Misc] Move DP for ViT code inside model executor dir by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25459
[Test]: Hermes tool parser stream output error in Qwen3 case by @ahartel in https://github.com/vllm-project/vllm/pull/25203
[Bugfix] Fix idefics3 tie_word_embeddings by @Isotr0py in https://github.com/vllm-project/vllm/pull/25454
[Core] Optimize LoRA weight loading by @jeejeelee in https://github.com/vllm-project/vllm/pull/25403
[docs] Benchmark Serving Incorrect Arg by @vllmellm in https://github.com/vllm-project/vllm/pull/25474
[CI/Build] Fix disabled v1 attention backend selection test by @Isotr0py in https://github.com/vllm-project/vllm/pull/25471
[BugFix] Register expert_map as named buffer for wake_up and sleep by @wuxibin89 in https://github.com/vllm-project/vllm/pull/25458
[P/D] Support NIXL connector to disconnect during a clean shutdown by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/24423
[test/doc] make NixlConnector example more clear by @panpan0000 in https://github.com/vllm-project/vllm/pull/24249
[XPU] Fix MOE DP accuracy issue on XPU by @faaany in https://github.com/vllm-project/vllm/pull/25465
[UX] Change kv-cache-memory log level to debug by @mgoin in https://github.com/vllm-project/vllm/pull/25479
[V1] Remove V0 code paths for Hybrid models by @tdoublep in https://github.com/vllm-project/vllm/pull/25400
[Core/DBO][2/N] Dual-Batch Overlap add DeepEP High Throughput support and Prefill support by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/24845
Add backward compatibility for GuidedDecodingParams by @hmellor in https://github.com/vllm-project/vllm/pull/25422
[Kernels] Support blocked fp8 quantization for compressed tensors MoE by @bnellnm in https://github.com/vllm-project/vllm/pull/25219
[BugFix] Fix UB in per_token_group_quant.cu by @rivos-shreeasish in https://github.com/vllm-project/vllm/pull/24913
[Log] Optimize kv cache memory log from Bytes to GiB by @yewentao256 in https://github.com/vllm-project/vllm/pull/25204
Use macro guard CUDA functions for back compatibility in grouped_topk_kernel.cu by @minosfuture in https://github.com/vllm-project/vllm/pull/25346
[V1][Kernel] Add triton implementation for reshape_and_cache_flash by @bringlein in https://github.com/vllm-project/vllm/pull/24503
[Misc] Reduce initialization time of auto_tune by @wdhongtw in https://github.com/vllm-project/vllm/pull/23682
[Spec Decode][CI] Add e2e test for examples/spec_decode.py and prevent breaking Acceptance Length by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/24531
[Core] Ensure LoRA linear respect the base_layer's tp_size and tp_rank by @jeejeelee in https://github.com/vllm-project/vllm/pull/25487
[ROCm] Add skinny gemm bias support for dtypes fp16,bf16,fp8 by @amd-hhashemi in https://github.com/vllm-project/vllm/pull/24988
[core] add nccl symmetric memory for all reduce by @Amir-19 in https://github.com/vllm-project/vllm/pull/24532
[Performance] Move apply_w8a8_block_fp8_linear to an op class by @ElizaWszola in https://github.com/vllm-project/vllm/pull/24666
[Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEWISE by @mgoin in https://github.com/vllm-project/vllm/pull/25444
[Speculators][Speculative Decoding] Fix gpt-oss eagle3 accuracy issue by @jiahanc in https://github.com/vllm-project/vllm/pull/25406
[Bugfix] Lower gpt-oss max cudagraph size to 992 to be compatible with FA3 by @mgoin in https://github.com/vllm-project/vllm/pull/25508
Enable symmetric memory all reduce by default only enabling for TP by @ilmarkov in https://github.com/vllm-project/vllm/pull/25070
[CI] Fix Pre-commit Issue by @yewentao256 in https://github.com/vllm-project/vllm/pull/25497
[Bugfix] gpt-oss container tool output bug by @alecsolder in https://github.com/vllm-project/vllm/pull/25485
[Build] Update Xgrammar to 0.1.25 by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/25467
[Bugfix] Fix for the import error from [#24588] by @gshtras in https://github.com/vllm-project/vllm/pull/25481
[CI/Build] Fix and re-enable v1 PP test on CI by @Isotr0py in https://github.com/vllm-project/vllm/pull/25496
[Core] Use KVCacheBlock as much as possible instead of dict[block_id, KVCacheBlock] by @Jialin in https://github.com/vllm-project/vllm/pull/24830
[V0 Deprecation] Remove placeholder attn by @tdoublep in https://github.com/vllm-project/vllm/pull/25510
Add VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE & VLLM_ENABLE_INDUCTOR_COORDINA… by @rouchenzi in https://github.com/vllm-project/vllm/pull/25493
Fix triton_reshape_and_cache_flash.py triton import by @mgoin in https://github.com/vllm-project/vllm/pull/25522
[gpt-oss][bugfix] remove logic to require resp_ in ResponseAPI by @qandrew in https://github.com/vllm-project/vllm/pull/25428
Remove redundant mutates_args and dispatch_key for direct_register_custom_op by @mgoin in https://github.com/vllm-project/vllm/pull/25512
[BugFix] Fix OOM in vLLM replicas by ensuring consistent NCCL memory accounting by @kouroshHakha in https://github.com/vllm-project/vllm/pull/25359
Add VLLM_NVTX_SCOPES_FOR_PROFILING=1 to enable nvtx.annotate scopes by @coreylowman in https://github.com/vllm-project/vllm/pull/25501
[Kernel] [Mamba] Remove BLOCK_H=1 from list of tuneable configurations for _chunk_cumsum_fwd_kernel by @tdoublep in https://github.com/vllm-project/vllm/pull/25197
[ROCm] Small functional changes for gptoss by @jpvillam-amd in https://github.com/vllm-project/vllm/pull/25201
[Perf] Increase default max splits for FA3 full cudagraphs by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/25495
[Bugfix] [B200] cutlass_mla - ensure kv_split == 1 for batch size > 1 by @alexm-redhat in https://github.com/vllm-project/vllm/pull/25509
[BugFix] AssertionError: Do not capture num_reqs > max_num_reqs for uniform batch by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/25505
Improve output when failing json.loads() on structured output test by @dougbtv in https://github.com/vllm-project/vllm/pull/25483
Add CUTLASS FP8 MOE benchmark scripts and kernel config by @chenxi-yang in https://github.com/vllm-project/vllm/pull/25302
[Bug] Fix AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv' by @yewentao256 in https://github.com/vllm-project/vllm/pull/25519
[BUG] Allows for RunAI Streamer and Torch.compile cache to be used together by @ahao-anyscale in https://github.com/vllm-project/vllm/pull/24922
[Model] Support SeedOss Reason Parser by @LuYanFCP in https://github.com/vllm-project/vllm/pull/24263
[V1][Metrics] Add per-request TPOT histogram by @baxingpiaochong in https://github.com/vllm-project/vllm/pull/24015
[Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen by @benchislett in https://github.com/vllm-project/vllm/pull/25520
[Core] Support weight_loader_v2 for UnquantizedLinearMethod by @kylesayrs in https://github.com/vllm-project/vllm/pull/23036
[Compile] Fix AMD Compile Error by @yewentao256 in https://github.com/vllm-project/vllm/pull/25518
[BugFix] Fix MLA assert with CUTLASS MLA by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/25478
[fix]: add Arm 4bit fused moe support by @nikhil-arm in https://github.com/vllm-project/vllm/pull/23809
[KV sharing] Re-land Gemma3n model changes from [#22628] by @sarckk in https://github.com/vllm-project/vllm/pull/24357
[Spec Decode] Enable FlashInfer Spec Decoding by @benchislett in https://github.com/vllm-project/vllm/pull/25196
[Perf] Fix jit compiles at runtime of fla gated delta rule by @coreylowman in https://github.com/vllm-project/vllm/pull/25432
[Bugfix] [Frontend] Cleanup gpt-oss non-streaming chat tool calls by @bbrowning in https://github.com/vllm-project/vllm/pull/25514
[TPU][Bugfix] fix the missing apply_model in tpu worker by @yaochengji in https://github.com/vllm-project/vllm/pull/25526
[Misc] Retry HF processing if "Already borrowed" error occurs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25535
[Bugfix][CPU] Skip unsupported custom op register on CPU by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/25534
[CI/Build] Fix v1 OOT registration test by @Isotr0py in https://github.com/vllm-project/vllm/pull/25547
[Misc]] Move processing context to multimodal directory by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25548
[CI/Build] add nightly prime-rl integration tests by @Jackmin801 in https://github.com/vllm-project/vllm/pull/25207
[V0 Deprecation] Remove max_seq_len_to_capture by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25543
[BugFix] Potential Fix for FA3 full-cudagraph IMA by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/25490
[misc] update the warning message by @youkaichao in https://github.com/vllm-project/vllm/pull/25566
[Bugfix] Fix dummy video number of frames calculation by @ywang96 in https://github.com/vllm-project/vllm/pull/25553
[Bug] fix import and unit test by @jmkuebler in https://github.com/vllm-project/vllm/pull/25558
[Benchmark] Fix regression in structured output benchmark by @russellb in https://github.com/vllm-project/vllm/pull/25500
[docs] fix nixl kv_connector_extra_config.backends key by @panpan0000 in https://github.com/vllm-project/vllm/pull/25565
[Bugfix] Fix DeepSeekV31ToolParser to correctly parse multiple tools in non-streaming output by @taohui in https://github.com/vllm-project/vllm/pull/25405
Move DeviceConfig, ObservabilityConfig, SpeechToTextConfig to their own files by @hmellor in https://github.com/vllm-project/vllm/pull/25564
[Misc] Improve type annotations for jsontree by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25577
[ROCm][Bugfix] Only enable +rms_norm based on aiter if not explicitly disabled by @gshtras in https://github.com/vllm-project/vllm/pull/25275
[ROCm][Build][Bugfix] Fix ROCm base docker whls installation order by @gshtras in https://github.com/vllm-project/vllm/pull/25415
Fixes and updates to bench_per_token_quant_fp8 by @mgoin in https://github.com/vllm-project/vllm/pull/25591
[Bugfix] add cache model when from object storage get model by @lengrongfu in https://github.com/vllm-project/vllm/pull/24764
Support mnnvl all2allv from Flashinfer by @wenscarl in https://github.com/vllm-project/vllm/pull/21003
Suppress benign cuBLAS warning when capturing cudagraphs with DBO by @SageMoore in https://github.com/vllm-project/vllm/pull/25596
[Docs] Enable fail_on_warning for the docs build in CI by @hmellor in https://github.com/vllm-project/vllm/pull/25580
[V0 Deprecation] Remove unused classes in attention by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25541
[Logging] Improve log for when DeepEP HT disables CUDA Graphs by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/25531
feat: BF16 FlashInfer Fused Cutlass MOE for Hopper and Blackwell Expert Parallel by @djmmoss in https://github.com/vllm-project/vllm/pull/25503
[Refactor] Use DeepGEMM Col Major TMA Aligned Tensor by @yewentao256 in https://github.com/vllm-project/vllm/pull/25517
Improve --help for enhanced user experience by @hmellor in https://github.com/vllm-project/vllm/pull/24903
[MISC] replace c10::optional with std::optional by @842974287 in https://github.com/vllm-project/vllm/pull/25602
[Model] Improve DotsOCRForCausalLM by @jeejeelee in https://github.com/vllm-project/vllm/pull/25466
[Kernel] Support DCP for Triton backend by @frank-wei in https://github.com/vllm-project/vllm/pull/25132
[Bug] Dynamo Unsupported due to BasevLLMParameter.torch_function calling disabled super() by @yewentao256 in https://github.com/vllm-project/vllm/pull/25613
Enable Fbgemm NVFP4 on Dense models by @samanamp in https://github.com/vllm-project/vllm/pull/25609
[Model] Add LongCat-Flash by @OftenDream in https://github.com/vllm-project/vllm/pull/23991
optimize: eliminate duplicate split_enc_dec_inputs calls by @nicole-lihui in https://github.com/vllm-project/vllm/pull/25573
[Bugfix] fix apply_temperature to avoid nan in probs by @courage17340 in https://github.com/vllm-project/vllm/pull/24734
[Misc] Simplify PoolerOutput and move to v1/outputs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25629
Map CwmForCausalLM to llama and LlamaForCausalLM by @jacobkahn in https://github.com/vllm-project/vllm/pull/25611
typo: remove duplicate is by @nicole-lihui in https://github.com/vllm-project/vllm/pull/25641
Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class… by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/25607
[fix] Update torch version in cpu-build.txt for AArch64/ppc64le and Darwin by @fadara01 in https://github.com/vllm-project/vllm/pull/25579
[Misc] Fix Qwen3-VL video_grid_thw typing by @ywang96 in https://github.com/vllm-project/vllm/pull/25646
[Bugfix] Add triton.language.tensor placeholder by @adobrzyn in https://github.com/vllm-project/vllm/pull/25649
[Bugfix] Fix Qwen3-VL max_num_video_tokens calculation for video profiling by @Isotr0py in https://github.com/vllm-project/vllm/pull/25648
[mypy] Further improve MM type annotations by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25654
[Bugfix] Parse SpeculativeConfig Error by @yyzxw in https://github.com/vllm-project/vllm/pull/25142
[V0 deprecation] Remove unreachable model_config.supported_tasks by @noooop in https://github.com/vllm-project/vllm/pull/25642
Add backward compatibility for guided_... API by @hmellor in https://github.com/vllm-project/vllm/pull/25615
[CI/Build] Fix flaky entrypoints test by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25663
[XPU][Triton]add xpu config in triton_reshape_and_cache_flash by @jikunshang in https://github.com/vllm-project/vllm/pull/25643
[Hardware][RISC-V] Add riscv64 support for vLLM with scalar by @langc23 in https://github.com/vllm-project/vllm/pull/22112
[mypy] Fix wrong type annotations related to tuple by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25660
[misc] warning by default for hanging / busy / idle by @youkaichao in https://github.com/vllm-project/vllm/pull/25627
[torch.compile] Make Query Quantization Fusable by @jmkuebler in https://github.com/vllm-project/vllm/pull/24914
[CPU] update torch 2.8 and fix missing fields in TorchSDPAMetadata by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/25652
[ux] Switch a warning to debug about a pytorch fallback by @russellb in https://github.com/vllm-project/vllm/pull/23750
[Bugfix] Fix InternS1 video processing after Transformers v4.56 by @Isotr0py in https://github.com/vllm-project/vllm/pull/25644
[Misc] Remove cruft file in repo by @NickLucche in https://github.com/vllm-project/vllm/pull/25678
[Logging] Remove TORCH_NCCL_AVOID_RECORD_STREAMS to squash a warning by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/25532
[BUGFIX] Fix crash in Eagle Speculative Decoding models when exceedin… by @AlonKejzman in https://github.com/vllm-project/vllm/pull/24662
Revert "[Bug] Dynamo Unsupported due to BasevLLMParameter.torch_function calling disabled super()" by @mgoin in https://github.com/vllm-project/vllm/pull/25681
[BugFix] Fix DBO hang by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/25625
[Model] Add optional parameter to reasoning parser constructor by @taohui in https://github.com/vllm-project/vllm/pull/25554
[Model] Define merge_by_field_config MM interface by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25676
[V0 deprecation] Clean up V0 fallback in compilation config by @Isotr0py in https://github.com/vllm-project/vllm/pull/25675
[V0 deprecation] Remove _VLLM_V1 suffixes from attention backend names by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/25489
[V0 deprecation] Clean up LoRA by @jeejeelee in https://github.com/vllm-project/vllm/pull/25686
[Misc] Simplify test_argsort_mm_positions by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25690
[Optimization] Streamline InputPreprocessor by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25702
[Optimization] Use a cheaper cache key in get_model_architecture by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25682
[Spec Decode] Add Batch Parallel Ngram. Upto 8x lower overhead. by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/24986
[Core] Enable command line logging for LLMEngine by @zhuohan123 in https://github.com/vllm-project/vllm/pull/25610
[Model] rename NemotronH_Nano_VL -> NemotronH_Nano_VL_V2 by @tomeras91 in https://github.com/vllm-project/vllm/pull/25708
Fix routing_bias dtype by @wenscarl in https://github.com/vllm-project/vllm/pull/25711
[Refactor] Remove DeepGEMM OP Register by @yewentao256 in https://github.com/vllm-project/vllm/pull/25710
[Misc] Don't log shm dequeue delay warning on worker side by @njhill in https://github.com/vllm-project/vllm/pull/25720
Llamas 3.1 405B fp4 changes upstreaming from 355_wip by @maleksan85 in https://github.com/vllm-project/vllm/pull/25135
[Core] Force PIECEWISE CUDAGraph mode for encoder-decoder by @russellb in https://github.com/vllm-project/vllm/pull/25701
[Misc] Remove unnecessary memoryviews in shm_broadcast.py by @njhill in https://github.com/vllm-project/vllm/pull/25721
EVS Support (Video tokens pruning) by @BloodAxe in https://github.com/vllm-project/vllm/pull/22980
[CI/Build] fix doc build warning: Failed to get 'name: description' pair by @yitingdc in https://github.com/vllm-project/vllm/pull/25733
fix: revert cast to cpu in MsgpackEncoder._encode_tensor to avoid hidden performance regressions by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/25738
perf: Avoid copying inputs_embeds tensors to GPU unless prompt_embeds is enabled by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/25739
[Harware][AMD][Model] Triton MoE tuning configs for GLM-4.5 for MI300X by @xaguilar-amd in https://github.com/vllm-project/vllm/pull/25703
fix: print outputt offline_inference/base/chat.py example by @Iceber in https://github.com/vllm-project/vllm/pull/25744
[Qwen3-Next][GDN] fixes cuda graph capturing bug in GDN metadata and a stride bug in causal_conv_1d. by @sighingnow in https://github.com/vllm-project/vllm/pull/25743
Remove cuda hard-code in compute_causal_conv1d_metadata by @wxsIcey in https://github.com/vllm-project/vllm/pull/25555
[misc] refactor speculative config by @yyzxw in https://github.com/vllm-project/vllm/pull/25657
[Bugfix] Fix Shared Expert/Zero expert code in FusedMoE.process_chunk by @SageMoore in https://github.com/vllm-project/vllm/pull/25698
Support LongCat-Flash-Chat tool call by @Xu-Wenqing in https://github.com/vllm-project/vllm/pull/24083
[Doc] Update Batch-level DP docs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25757
[Model] Mamba2 varlen and metadata refactor by @cyang49 in https://github.com/vllm-project/vllm/pull/21467
[CI] Fix test_shared_storage_connector_hashes by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/25748
[Bugfix] Properly abort pooling request. by @noooop in https://github.com/vllm-project/vllm/pull/25734
[CI/Build] Split up Distributed Tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25572
[CI/Build] Fix some V1 tests not being run by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25569
[Quantization] Add field to skip unquantized modules for GPTQ config by @Isotr0py in https://github.com/vllm-project/vllm/pull/25455
[BugFix] Fix using dbo_decode_token_threshold always (and ignoring dbo_prefill_token_threshold) by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/25622
[ray][metrics] Replace ':' with '_' for OpenTelemetry compatibility in Ray by @eicherseiji in https://github.com/vllm-project/vllm/pull/25439
[Fix][torch.compile] fix unique_filepath by @ZJY0516 in https://github.com/vllm-project/vllm/pull/25732
Eagle3 that supports the Minicpm3 model by @LDLINGLINGLING in https://github.com/vllm-project/vllm/pull/24243
[Doc]: improve CPU(x86) build-wheel-from-source section by @brokedba in https://github.com/vllm-project/vllm/pull/25617

New Contributors

@SamitHuang made their first contribution in https://github.com/vllm-project/vllm/pull/24733
@rozeappletree made their first contribution in https://github.com/vllm-project/vllm/pull/24785
@ChenTaoyu-SJTU made their first contribution in https://github.com/vllm-project/vllm/pull/24732
@ziliangpeng made their first contribution in https://github.com/vllm-project/vllm/pull/24793
@chenfengjin made their first contribution in https://github.com/vllm-project/vllm/pull/24823
@LCAIZJ made their first contribution in https://github.com/vllm-project/vllm/pull/23917
@xiao-llm made their first contribution in https://github.com/vllm-project/vllm/pull/22222
@koiker made their first contribution in https://github.com/vllm-project/vllm/pull/20321
@cboss6 made their first contribution in https://github.com/vllm-project/vllm/pull/23745
@liangwen12year made their first contribution in https://github.com/vllm-project/vllm/pull/23498
@lianyiibo made their first contribution in https://github.com/vllm-project/vllm/pull/24571
@tahsintunan made their first contribution in https://github.com/vllm-project/vllm/pull/24342
@haoyangli-amd made their first contribution in https://github.com/vllm-project/vllm/pull/24649
@rouchenzi made their first contribution in https://github.com/vllm-project/vllm/pull/22842
@xinyu-intel made their first contribution in https://github.com/vllm-project/vllm/pull/25026
@shijun-yin made their first contribution in https://github.com/vllm-project/vllm/pull/24668
@Aidyn-A made their first contribution in https://github.com/vllm-project/vllm/pull/24599
@dolpm made their first contribution in https://github.com/vllm-project/vllm/pull/23774
@samzong made their first contribution in https://github.com/vllm-project/vllm/pull/25010
@mmangkad made their first contribution in https://github.com/vllm-project/vllm/pull/24766
@karan made their first contribution in https://github.com/vllm-project/vllm/pull/25076
@toncao made their first contribution in https://github.com/vllm-project/vllm/pull/24960
@666even666 made their first contribution in https://github.com/vllm-project/vllm/pull/23909
@lumina37 made their first contribution in https://github.com/vllm-project/vllm/pull/24886
@gfinol made their first contribution in https://github.com/vllm-project/vllm/pull/25081
@punitvara made their first contribution in https://github.com/vllm-project/vllm/pull/25058
@gigit0000 made their first contribution in https://github.com/vllm-project/vllm/pull/24222
@Rohan138 made their first contribution in https://github.com/vllm-project/vllm/pull/25104
@candyzone made their first contribution in https://github.com/vllm-project/vllm/pull/24585
@wxsIcey made their first contribution in https://github.com/vllm-project/vllm/pull/25243
@LJH-LBJ made their first contribution in https://github.com/vllm-project/vllm/pull/25224
@David-Wen2025 made their first contribution in https://github.com/vllm-project/vllm/pull/24818
@alecsolder made their first contribution in https://github.com/vllm-project/vllm/pull/24985
@Lucaskabela made their first contribution in https://github.com/vllm-project/vllm/pull/25090
@manoelmarques made their first contribution in https://github.com/vllm-project/vllm/pull/23558
@lirong-lirong made their first contribution in https://github.com/vllm-project/vllm/pull/25308
@debroy-rh made their first contribution in https://github.com/vllm-project/vllm/pull/25285
@Sara-KS made their first contribution in https://github.com/vllm-project/vllm/pull/23445
@Daisy-Ma-coder made their first contribution in https://github.com/vllm-project/vllm/pull/25274
@jcyang43 made their first contribution in https://github.com/vllm-project/vllm/pull/25278
@Zhikaiiii made their first contribution in https://github.com/vllm-project/vllm/pull/25028
@ahartel made their first contribution in https://github.com/vllm-project/vllm/pull/25203
@wuxibin89 made their first contribution in https://github.com/vllm-project/vllm/pull/25458
@rivos-shreeasish made their first contribution in https://github.com/vllm-project/vllm/pull/24913
@Amir-19 made their first contribution in https://github.com/vllm-project/vllm/pull/24532
@LuYanFCP made their first contribution in https://github.com/vllm-project/vllm/pull/24263
@baxingpiaochong made their first contribution in https://github.com/vllm-project/vllm/pull/24015
@Jackmin801 made their first contribution in https://github.com/vllm-project/vllm/pull/25207
@taohui made their first contribution in https://github.com/vllm-project/vllm/pull/25405
@OftenDream made their first contribution in https://github.com/vllm-project/vllm/pull/23991
@nicole-lihui made their first contribution in https://github.com/vllm-project/vllm/pull/25573
@jacobkahn made their first contribution in https://github.com/vllm-project/vllm/pull/25611
@fadara01 made their first contribution in https://github.com/vllm-project/vllm/pull/25579
@langc23 made their first contribution in https://github.com/vllm-project/vllm/pull/22112
@AlonKejzman made their first contribution in https://github.com/vllm-project/vllm/pull/24662
@BloodAxe made their first contribution in https://github.com/vllm-project/vllm/pull/22980
@yitingdc made their first contribution in https://github.com/vllm-project/vllm/pull/25733
@xaguilar-amd made their first contribution in https://github.com/vllm-project/vllm/pull/25703
@Iceber made their first contribution in https://github.com/vllm-project/vllm/pull/25744
@LDLINGLINGLING made their first contribution in https://github.com/vllm-project/vllm/pull/24243
@brokedba made their first contribution in https://github.com/vllm-project/vllm/pull/25617

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.10.2...v0.11.0

Source: README.md, updated 2025-10-03

vLLM Files

A high-throughput and memory-efficient inference and serving engine