Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
vllm-0.11.0+cu129-cp38-abi3-manylinux1_x86_64.whl | 2025-10-04 | 433.0 MB | |
vllm-0.11.0-cp38-abi3-manylinux1_x86_64.whl | 2025-10-04 | 438.2 MB | |
vllm-0.11.0-cp38-abi3-manylinux2014_aarch64.whl | 2025-10-04 | 401.0 MB | |
vllm-0.11.0.tar.gz | 2025-10-04 | 10.8 MB | |
README.md | 2025-10-03 | 78.7 kB | |
v0.11.0 source code.tar.gz | 2025-10-03 | 10.7 MB | |
v0.11.0 source code.zip | 2025-10-03 | 12.7 MB | |
Totals: 7 Items | 1.3 GB | 10 |
Highlights
This release features 538 commits, 207 contributors (65 new contributors)!
- This release completes the removal of V0 engine. V0 engine code including AsyncLLMEngine, LLMEngine, MQLLMEngine, all attention backends, and related components have been removed. V1 is the only engine in the codebase now.
- This releases turns on FULL_AND_PIECEWISE as the CUDA graph mode default. This should provide better out of the box performance for most models, particularly fine-grained MoEs, while preserving compatibility with existing models supporting only PIECEWISE mode.
Model Support
- New architectures: DeepSeek-V3.2-Exp (#25896), Qwen3-VL series (#24727), Qwen3-Next (#24526), OLMo3 (#24534), LongCat-Flash (#23991), Dots OCR (#24645), Ling2.0 (#24627), CWM (#25611).
- Encoders: RADIO encoder support (#24595), Transformers backend support for encoder-only models (#25174).
- Task expansion: BERT token classification/NER (#24872), multimodal models for pooling tasks (#24451).
- Data parallel for vision encoders: InternVL (#23909), Qwen2-VL (#25445), Qwen3-VL (#24955).
- Speculative decoding: EAGLE3 for MiniCPM3 (#24243) and GPT-OSS (#25246).
- Features: Qwen3-VL text-only mode (#26000), EVS video token pruning (#22980), Mamba2 TP+quantization (#24593), MRoPE + YaRN (#25384), Whisper on XPU (#25123), LongCat-Flash-Chat tool calling (#24083).
- Performance: GLM-4.1V 916ms TTFT reduction via fused RMSNorm (#24733), GLM-4 MoE SharedFusedMoE optimization (#24849), Qwen2.5-VL CUDA sync removal (#24741), Qwen3-VL Triton MRoPE kernel (#25055), FP8 checkpoints for Qwen3-Next (#25079).
- Reasoning: SeedOSS reason parser (#24263).
Engine Core
- KV cache offloading: CPU offloading with LRU management (#19848, [#20075], [#21448], [#22595], [#24251]).
- V1 features: Prompt embeddings (#24278), sharded state loading (#25308), FlexAttention sliding window (#24089), LLM.apply_model (#18465).
- Hybrid allocator: Pipeline parallel (#23974), varying hidden sizes (#25101).
- Async scheduling: Uniprocessor executor support (#24219).
- Architecture: Tokenizer group removal (#24078), shared memory multimodal caching (#20452).
- Attention: Hybrid SSM/Attention in Triton (#21197), FlashAttention 3 for ViT (#24347).
- Performance: FlashInfer RoPE 2x speedup (#21126), fused Q/K RoPE 11% improvement (#24511, [#25005]), 8x spec decode overhead reduction (#24986), FlashInfer spec decode with 1.14x speedup (#25196), model info caching (#23558), inputs_embeds copy avoidance (#25739).
- LoRA: Optimized weight loading (#25403).
- Defaults: CUDA graph mode FULL_AND_PIECEWISE (#25444), Inductor standalone compile disabled (#25391).
- torch.compile: CUDA graph Inductor partition integration (#24281).
Hardware & Performance
- NVIDIA: FP8 FlashInfer MLA decode (#24705), BF16 fused MoE for Hopper/Blackwell expert parallel (#25503).
- DeepGEMM: Enabled by default (#24462), 5.5% throughput improvement (#24783).
- New architectures: RISC-V 64-bit (#22112), ARM non-x86 CPU (#25166), ARM 4-bit fused MoE (#23809).
- AMD: ROCm 7.0 (#25178), GLM-4.5 MI300X tuning (#25703).
- Intel XPU: MoE DP accuracy fix (#25465).
Large Scale Serving & Performance
- Dual-Batch Overlap (DBO): Overlapping computation mechanism (#23693), DeepEP high throughput + prefill (#24845).
- Data Parallelism: torchrun launcher (#24899), Ray placement groups (#25026), Triton DP/EP kernels (#24588).
- EPLB: Hunyuan V1 (#23078), Mixtral (#22842), static placement (#23745), reduced overhead (#24573).
- Disaggregated serving: KV transfer metrics (#22188), NIXL MLA latent dimension (#25902).
- MoE: Shared expert overlap optimization (#24254), SiLU kernel for DeepSeek-R1 (#24054), Enable Allgather/ReduceScatter backend for NaiveAllToAll (#23964).
- Distributed: NCCL symmetric memory with 3-4% throughput improvement (#24532), enabled by default for TP (#25070).
Quantization
- FP8: Per-token-group quantization (#24342), hardware-accelerated instructions (#24757), torch.compile KV cache (#22758), paged attention update (#22222).
- FP4: NVFP4 for dense models (#25609), Gemma3 (#22771), Llama 3.1 405B (#25135).
- W4A8: Faster preprocessing (#23972).
- Compressed tensors: Blocked FP8 for MoE (#25219).
API & Frontend
- OpenAI: Prompt logprobs for all tokens (#24956), logprobs=-1 for full vocab (#25031), reasoning streaming events (#24938), Responses API MCP tools (#24628, [#24985]), health 503 on dead engine (#24897).
- Multimodal: Media UUID caching (#23950), image path format (#25081).
- Tool calling: XML parser for Qwen3-Coder (#25028), Hermes-style tokens (#25281).
- CLI: --enable-logging (#25610), improved --help (#24903).
- Config: Speculative model engine args (#25250), env validation (#24761), NVTX profiling (#25501), guided decoding backward compatibility (#25615, [#25422]).
- Metrics: V1 TPOT histogram (#24015), hidden deprecated gpu_ metrics (#24245), KV cache GiB units (#25204, [#25479]).
- UX: Removed misleading quantization warning (#25012).
Security
Dependencies
- PyTorch 2.8 for CPU (#25652), FlashInfer 0.3.1 (#24470), CUDA 13 (#24599), ROCm 7.0 (#25178).
- Build requirements: C++17 now enforced globally (#24823).
- TPU: Deprecated
xm.mark_step
in favor oftorch_xla.sync
(#25254).
V0 Deprecation
- Engines: AsyncLLMEngine (#25025), LLMEngine (#25033), MQLLMEngine (#25019), core (#25321), model runner (#25328), MP executor (#25329).
- Components: Attention backends (#25351), encoder-decoder (#24907), output processor (#25320), sampling metadata (#25345), Sequence/Sampler (#25332).
- Interfaces: LoRA (#25686), async output processor (#25334), MultiModalPlaceholderMap (#25366), seq group methods (#25330), placeholder attention (#25510), input embeddings (#25242), multimodal registry (#25362), max_seq_len_to_capture (#25543), attention classes (#25541), hybrid models (#25400), backend suffixes (#25489), compilation fallbacks (#25675), default args (#25409).
What's Changed
- [Qwen3-Next] MoE configs for H20 TP=1,2,4,8 by @jeejeelee in https://github.com/vllm-project/vllm/pull/24707
- [DOCs] Update ROCm installation docs section by @gshtras in https://github.com/vllm-project/vllm/pull/24691
- Enable conversion of multimodal models to pooling tasks by @maxdebayser in https://github.com/vllm-project/vllm/pull/24451
- Fix implementation divergence for BLOOM models between vLLM and HuggingFace when using prompt embeds by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/24686
- [Bugfix] Fix MRoPE dispatch on CPU by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/24712
- [BugFix] Fix Qwen3-Next PP by @njhill in https://github.com/vllm-project/vllm/pull/24709
- [CI] Fix flaky test v1/worker/test_gpu_model_runner.py::test_kv_cache_stride_order by @heheda12345 in https://github.com/vllm-project/vllm/pull/24640
- [CI] Add ci_envs for convenient local testing by @noooop in https://github.com/vllm-project/vllm/pull/24630
- [CI/Build] Skip prompt embeddings tests on V1-only CPU backend by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/24721
- [Misc][gpt-oss] Add gpt-oss label to PRs that mention harmony or related to builtin tool call by @heheda12345 in https://github.com/vllm-project/vllm/pull/24717
- [Bugfix] Fix BNB name match by @jeejeelee in https://github.com/vllm-project/vllm/pull/24735
- [Kernel] [CPU] refactor
cpu_attn.py:_run_sdpa_forward
for better memory access by @ignaciosica in https://github.com/vllm-project/vllm/pull/24701 - [sleep mode] save memory for on-the-fly quantization by @youkaichao in https://github.com/vllm-project/vllm/pull/24731
- [Multi Modal] Add FA3 in VIT by @wwl2755 in https://github.com/vllm-project/vllm/pull/24347
- [Multimodal] Remove legacy multimodal fields in favor of MultiModalFeatureSpec by @sfeng33 in https://github.com/vllm-project/vllm/pull/24548
- [Doc]: fix typos in various files by @didier-durand in https://github.com/vllm-project/vllm/pull/24726
- [Docs] Fix warnings in mkdocs build (continued) by @Zerohertz in https://github.com/vllm-project/vllm/pull/24740
- [Bugfix] Fix MRoPE dispatch on XPU by @yma11 in https://github.com/vllm-project/vllm/pull/24724
- [Qwen3-Next] MoE configs for H100 TP=1,2 and TP2/EP by @elvircrn in https://github.com/vllm-project/vllm/pull/24739
- [Core] Shared memory based object store for Multimodal data caching and IPC by @dongluw in https://github.com/vllm-project/vllm/pull/20452
- [Bugfix][Frontend] Fix
--enable-log-outputs
does not match the documentation by @kebe7jun in https://github.com/vllm-project/vllm/pull/24626 - [Models] Optimise and simplify
_validate_and_reshape_mm_tensor
by @lgeiger in https://github.com/vllm-project/vllm/pull/24742 - [Models] Prevent CUDA sync in Qwen2.5-VL by @lgeiger in https://github.com/vllm-project/vllm/pull/24741
- [Model] Switch to Fused RMSNorm in GLM-4.1V model by @SamitHuang in https://github.com/vllm-project/vllm/pull/24733
- [UX] Remove AsyncLLM torch profiler disabled log by @mgoin in https://github.com/vllm-project/vllm/pull/24609
- [CI] Speed up model unit tests in CI by @afeldman-nm in https://github.com/vllm-project/vllm/pull/24253
- [Bugfix] Fix incompatibility between [#20452] and [#24548] by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/24754
- [CI] Trigger BC Linter when labels are added/removed by @zhewenl in https://github.com/vllm-project/vllm/pull/24767
- [Benchmark] Allow arbitrary headers to be passed to benchmarked endpoints by @smarterclayton in https://github.com/vllm-project/vllm/pull/23937
- [Compilation Bug] Fix Inductor Graph Output with Shape Issue by @yewentao256 in https://github.com/vllm-project/vllm/pull/24772
- Invert pattern order to make sure that out_proj layers are identified by @anmarques in https://github.com/vllm-project/vllm/pull/24781
- [Attention][FlashInfer] Enable FP8 FlashInfer (TRTLLM) MLA decode by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/24705
- Add FLASHINFER_MLA to backend selector test by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/24753
- [Qwen3Next] Fixes the cuda graph capture conditions under large batch sizes (#24660) by @sighingnow in https://github.com/vllm-project/vllm/pull/24667
- [Core] Support async scheduling with uniproc executor by @njhill in https://github.com/vllm-project/vllm/pull/24219
- [Frontend][Multimodal] Allow skipping media data when UUIDs are provided. by @huachenheli in https://github.com/vllm-project/vllm/pull/23950
- [Model] Add Olmo3 model implementation by @2015aroras in https://github.com/vllm-project/vllm/pull/24534
- [Bugfix] Fix GPUModelRunner has no attribute lora_manager by @jeejeelee in https://github.com/vllm-project/vllm/pull/24762
- [Chore] Remove unused batched RoPE op & kernel by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24789
- [Docs] Fix warnings in mkdocs build (continued) by @Zerohertz in https://github.com/vllm-project/vllm/pull/24791
- [Docs] Remove Neuron install doc as backend no longer exists by @hmellor in https://github.com/vllm-project/vllm/pull/24396
- [Doc]: Remove 404 hyperlinks by @rozeappletree in https://github.com/vllm-project/vllm/pull/24785
- [Perf] Use NVIDIA hardware-accelerated instruction for float to fp8_e4m3 quantization by @elvischenv in https://github.com/vllm-project/vllm/pull/24757
- [Kernels][DP/EP] Optimize Silu Kernel for R1 by @elvircrn in https://github.com/vllm-project/vllm/pull/24054
- [Core][Multimodal] Cache
supports_kw
by @lgeiger in https://github.com/vllm-project/vllm/pull/24773 - [CI Failure] Fix test_flashinfer_cutlass_mxfp4_mxfp8_fused_moe by @mgoin in https://github.com/vllm-project/vllm/pull/24750
- [Misc] Correct an outdated comment. by @russellb in https://github.com/vllm-project/vllm/pull/24765
- [Doc]: fix typos in various files by @didier-durand in https://github.com/vllm-project/vllm/pull/24798
- [CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test again by @wwl2755 in https://github.com/vllm-project/vllm/pull/24771
- Remove redundant assignment in xfer_buffers, This is a little fix by @ChenTaoyu-SJTU in https://github.com/vllm-project/vllm/pull/24732
- [Minor] Simplify duplicative device check for cuda by @ziliangpeng in https://github.com/vllm-project/vllm/pull/24793
- [Chore] Minor simplification for non-PP path by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24810
- [Multi Modal][Performance] Fused Q,K's apply_rope into one by @wwl2755 in https://github.com/vllm-project/vllm/pull/24511
- [Misc] Improve
s3_utils
type hints withBaseClient
by @Zerohertz in https://github.com/vllm-project/vllm/pull/24825 - [Perf] Fix DeepGEMM Contiguous Layout Issue, 5.5% Throughput Improvement by @yewentao256 in https://github.com/vllm-project/vllm/pull/24783
- fix type of sampling rate for encode_base64 by @co63oc in https://github.com/vllm-project/vllm/pull/24826
- [Benchmarks] Throw usage error when using dataset-name random and dataset-path together by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/24819
- Force use C++17 globally to avoid compilation error by @chenfengjin in https://github.com/vllm-project/vllm/pull/24823
- [Chore] Remove ipex_ops warning by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/24835
- [Spec Decoding]Support Spec Decoding Metrics in DP Mode by @wuhang2014 in https://github.com/vllm-project/vllm/pull/24049
- [Hybrid Allocator] Support Pipeline Parallel by @heheda12345 in https://github.com/vllm-project/vllm/pull/23974
- [Docs] Have a try to improve frameworks/streamlit.md by @windsonsea in https://github.com/vllm-project/vllm/pull/24841
- [kv cache] update num_free_blocks in the end by @andyxning in https://github.com/vllm-project/vllm/pull/24228
- [Frontend] Skip
stop
in reasoning content by @gaocegege in https://github.com/vllm-project/vllm/pull/14550 - [Bugfix] MiDashengLM model contact error under concurrent testing by @bingchen-mi in https://github.com/vllm-project/vllm/pull/24738
- [Doc]: fix typos in various files by @didier-durand in https://github.com/vllm-project/vllm/pull/24821
- [Misc] rename interval to max_recent_requests by @andyxning in https://github.com/vllm-project/vllm/pull/24229
- [Misc] Own KVConnectors installation by @NickLucche in https://github.com/vllm-project/vllm/pull/24867
- [P/D]
kv_output_aggregator
support heterogeneous by @LCAIZJ in https://github.com/vllm-project/vllm/pull/23917 - [UT] enhance free kv cache block queue popleft_n by @andyxning in https://github.com/vllm-project/vllm/pull/24220
- [XPU] Set consistent default KV cache layout by @NickLucche in https://github.com/vllm-project/vllm/pull/24745
- [Misc] Fix examples openai_pooling_client.py by @noooop in https://github.com/vllm-project/vllm/pull/24853
- [Model]: support Ling2.0 by @ant-yy in https://github.com/vllm-project/vllm/pull/24627
- [Bugfix] Fix GLM4.1V multimodal processor with compatability for Transformers v4.56 by @Isotr0py in https://github.com/vllm-project/vllm/pull/24822
- Fp8 paged attention update by @xiao-llm in https://github.com/vllm-project/vllm/pull/22222
- Reinstate existing torch script by @hmellor in https://github.com/vllm-project/vllm/pull/24729
- [USAGE] Improve error handling for weight initialization in Unquantized… by @koiker in https://github.com/vllm-project/vllm/pull/20321
- Move
MultiModalConfig
fromconfig/__init__.py
toconfig/multimodal.py
by @hmellor in https://github.com/vllm-project/vllm/pull/24659 - [Transform] Deterministic Hadacore Transforms by @kylesayrs in https://github.com/vllm-project/vllm/pull/24106
- Update num_tokens_across_dp to use nccl instead of gloo by @SageMoore in https://github.com/vllm-project/vllm/pull/24105
- Bump Flashinfer to 0.3.1 by @bbartels in https://github.com/vllm-project/vllm/pull/24868
- [gpt-oss] Add IncompleteDetails to ResponsesRepsonse by @qandrew in https://github.com/vllm-project/vllm/pull/24561
- [gpt-oss][1a] create_responses stream outputs BaseModel type, api server is SSE still by @qandrew in https://github.com/vllm-project/vllm/pull/24759
- [Performance] Remove redundant clone() calls in cutlass_mla by @alexm-redhat in https://github.com/vllm-project/vllm/pull/24891
- [Bug] Fix Cutlass Scaled MM Compilation Error by @yewentao256 in https://github.com/vllm-project/vllm/pull/24887
- [ci] fix wheel names for arm wheels by @simon-mo in https://github.com/vllm-project/vllm/pull/24898
- [Tests] fix initialization of kv hash in tests by @mickaelseznec in https://github.com/vllm-project/vllm/pull/24273
- [Compile] Fix noop_elimination pass and add tests for noop_elimination by @ZJY0516 in https://github.com/vllm-project/vllm/pull/24880
HuggingFace
->Hugging Face
inIntegration with Hugging Face
docs by @sergiopaniego in https://github.com/vllm-project/vllm/pull/24889- Updated CODEOWNERS for flashinfer, mla, fused_moe by @mgoin in https://github.com/vllm-project/vllm/pull/24906
- [Deprecation] Remove DeepGEMM Old Symbol Wrapper by @yewentao256 in https://github.com/vllm-project/vllm/pull/24902
- [ROCm][Bugfix] Fix the case where there's bias by @gshtras in https://github.com/vllm-project/vllm/pull/24895
- Add pytest-cov and .coveragerc by @rzabarazesh in https://github.com/vllm-project/vllm/pull/24778
- [Bug] Fix
is_flashmla_supported
Check Error by @yewentao256 in https://github.com/vllm-project/vllm/pull/24774 - [CI] Small Accuracy Eval Test for Deepseek Model by @yewentao256 in https://github.com/vllm-project/vllm/pull/24259
- [Metrics] Hide deprecated metrics with gpu_ prefix by @markmc in https://github.com/vllm-project/vllm/pull/24245
- [Docs] Update instructions for how to using existing torch binary by @zou3519 in https://github.com/vllm-project/vllm/pull/24892
- Upgrade flashinfer to 0.3.1 by @houseroad in https://github.com/vllm-project/vllm/pull/24470
- [XPU] Fix circular import error. by @jikunshang in https://github.com/vllm-project/vllm/pull/24927
- Remove V0 Encoder-Decoder Support by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24907
- [Bugfix] Fix sequence parallelism bug when enable pipeline parallelism by @cascade812 in https://github.com/vllm-project/vllm/pull/24021
- [Bug] [Spec Dec]: Fix kv_cache dtype mismatch for Eagle3 drafter on FP8 target by @vllmellm in https://github.com/vllm-project/vllm/pull/24505
- [QWEN NEXT] Fused MoE kernels Optimization configs by @samanamp in https://github.com/vllm-project/vllm/pull/24924
- [benchmark] Add triton version in the moe tuned config by @jeejeelee in https://github.com/vllm-project/vllm/pull/24769
- [Bugfix] remove duplicate tokens streamed in required tool choice streaming by @Jason-CKY in https://github.com/vllm-project/vllm/pull/23312
- [Mamba] Support TP>1 with quantization for mamba2 mixer in case
n_groups % tp_size == 0
by @tomeras91 in https://github.com/vllm-project/vllm/pull/24593 - [Feat][EPLB] A novel static EPLB placement strategy for MoE models. by @cboss6 in https://github.com/vllm-project/vllm/pull/23745
- Move
SpeculativeConfig
fromconfig/__init__.py
toconfig/speculative.py
by @hmellor in https://github.com/vllm-project/vllm/pull/24904 - [Docs] move benchmarks README to contributing guides by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/24820
- feat: Add Grafana and Perces monitoring dashboards for vLLM by @liangwen12year in https://github.com/vllm-project/vllm/pull/23498
- (doc): set cmake c++ compatible standard when building on MacOS CPU. by @teekenl in https://github.com/vllm-project/vllm/pull/23483
- [CI] Add Decode Context Parallelism (DCP) test to CI by @minosfuture in https://github.com/vllm-project/vllm/pull/24487
- [Model] Clean up and simplify Mamba2 Metadata Usage in both V0 and V1 by @cyang49 in https://github.com/vllm-project/vllm/pull/24331
- [Core][MultiModalHasher] Don't convert memoryviews to bytes during hashing by @lgeiger in https://github.com/vllm-project/vllm/pull/24925
- [Core/DBO][1/N] Add Dual-Batch Overlap mechanism to VLLM by @SageMoore in https://github.com/vllm-project/vllm/pull/23693
- [Bugfix] Fix unable to run encoder model when disable_hybrid_kv_cache_manager is true by @lianyiibo in https://github.com/vllm-project/vllm/pull/24571
- [Misc] Add removed encoder-decoder models to previously supported models list by @Isotr0py in https://github.com/vllm-project/vllm/pull/24961
- Directly get max encoder len from VLLM config in V1 by @Sugar-zsg in https://github.com/vllm-project/vllm/pull/24866
- [gpt-oss][1b] streaming add item id, content id by @qandrew in https://github.com/vllm-project/vllm/pull/24788
- [MISC] Add code owners of vllm/v1 to vllm/v1/core by @heheda12345 in https://github.com/vllm-project/vllm/pull/24928
- [ROCm] Add dependencies for ROCm by @Concurrensee in https://github.com/vllm-project/vllm/pull/24900
- [gpt-oss][1][bugfix] fix streaming final output by @qandrew in https://github.com/vllm-project/vllm/pull/24466
- Use kwargs for long lists of
EngineCoreRequest
arguments in tests and fix extra kwargs by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/24987 - fp8 kv cache support fix for torch.compile by @maleksan85 in https://github.com/vllm-project/vllm/pull/22758
- [Perf] Reuse workspace for FP8+FP4 Marlin MoE by @mgoin in https://github.com/vllm-project/vllm/pull/20500
- [CI][Bugfix] Fix failing Blackwell test by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/24993
- [CI] GPT-OSS GPQA eval test for Blackwell by @mgoin in https://github.com/vllm-project/vllm/pull/24920
- [FP8] Extend per-token-group quantization support to QuantFP8 by @tahsintunan in https://github.com/vllm-project/vllm/pull/24342
- Removes source compilation of nixl dependency by @bbartels in https://github.com/vllm-project/vllm/pull/24874
- [Doc] Add --force-overwrite option to generate_cmake_presets.py by @elvischenv in https://github.com/vllm-project/vllm/pull/24375
- [Core] Use
CpuGpuBuffer
for block table tensors by @njhill in https://github.com/vllm-project/vllm/pull/24795 - [Benchmarks] Add MMVU video dataset support and clean up deprecated datasets by @Isotr0py in https://github.com/vllm-project/vllm/pull/24719
- [UX] Enforce valid choices for envs like VLLM_ATTENTION_BACKEND, etc by @mgoin in https://github.com/vllm-project/vllm/pull/24761
- [Docs] fix invalid doc link by @yyzxw in https://github.com/vllm-project/vllm/pull/25017
- [UX] Remove "quantization is not fully optimized yet" log by @mgoin in https://github.com/vllm-project/vllm/pull/25012
- [misc] fix typo in value error by @prashantgupta24 in https://github.com/vllm-project/vllm/pull/24995
- [Core] Get num_encoder_tokens from scheduler config by @russellb in https://github.com/vllm-project/vllm/pull/24989
- [V0 Deprecation] Remove MQLLMEngine by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25019
- [Model] Support Qwen3-VL Model Series by @ywang96 in https://github.com/vllm-project/vllm/pull/24727
- [Rocm] [quantization] Fix quark ptpc moe and add test case by @haoyangli-amd in https://github.com/vllm-project/vllm/pull/24649
- Add more documentation and improve usability of lognormal dist (benchmark_serving_multi_turn) by @pliops-daniels in https://github.com/vllm-project/vllm/pull/23255
- [XPU] Fix xpu model runner call torch.cuda APIs by @jikunshang in https://github.com/vllm-project/vllm/pull/25011
- [EPLB] Support EPLB for Mixtral Model by @rouchenzi in https://github.com/vllm-project/vllm/pull/22842
- [Core][MultiModalHasher] Hash images without converting image mode by @lgeiger in https://github.com/vllm-project/vllm/pull/24969
- [Model] Pass param prefix to LLMHead by @whx-sjtu in https://github.com/vllm-project/vllm/pull/24862
- [Model] Apply SharedFusedMoE to glm4_moe. by @whx-sjtu in https://github.com/vllm-project/vllm/pull/24849
- [Core] Remove tokenizer group in vLLM by @zhuohan123 in https://github.com/vllm-project/vllm/pull/24078
- [Docs] Fix griffe warning in base_static_graph.py by @windsonsea in https://github.com/vllm-project/vllm/pull/25018
- [DP] Create placement groups by ray_device_key by @xinyu-intel in https://github.com/vllm-project/vllm/pull/25026
- [Frontend] Support returning all prompt logprobs by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/24956
- [BugFix] enable DOTALL to match multi-line tool_call parameters in extract_tool_call_required_streaming by @shijun-yin in https://github.com/vllm-project/vllm/pull/24668
- [Misc] Avoid use of deprecated
AutoModelForVision2Seq
by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25065 - Add RADIO Vision Encoder Support to vLLM by @danielafrimi in https://github.com/vllm-project/vllm/pull/24595
- [Bugfix] Fix Stream usage in CPU model runner and OneDNN kernel check by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/25046
- Apply fixes for CUDA 13 by @Aidyn-A in https://github.com/vllm-project/vllm/pull/24599
- [fix] lora benchmarks pass no_lora_flag_cpu by @dolpm in https://github.com/vllm-project/vllm/pull/23774
- [Bugfix][Qwen3-Next] fixes the varlen issue in qwen3-next's MTP implementation. by @sighingnow in https://github.com/vllm-project/vllm/pull/24957
- [Docs] improve code formatting and comments for eliminate griffe build warning. by @samzong in https://github.com/vllm-project/vllm/pull/25010
- Remove old cutlass mla by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/23961
- [Docs] vllm/benchmarks/datasets.py fix docstring param format. by @samzong in https://github.com/vllm-project/vllm/pull/24970
- [CI Bugfix] Fix failing test_invalid_env by @mgoin in https://github.com/vllm-project/vllm/pull/25078
- [V0 Deprecation] Remove V0 Core tests by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25082
- cleanup: remove adapter commons by @simon-mo in https://github.com/vllm-project/vllm/pull/25045
- Remove unused find_cuda_init helper script by @simon-mo in https://github.com/vllm-project/vllm/pull/25044
- [V0 Deprecation] Remove unused output processor util by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25023
- Change log level from info to debug for IOProcessor by @mgoin in https://github.com/vllm-project/vllm/pull/24999
- [CI] Revert back prepare_prompts and check_answers by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25087
- [V0 Deprecation] Remove V0 tests in test_sequence.py by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25088
- [CI Bugfix] Fix failing test_model_load_with_params tests due to tokenizer refactor by @mgoin in https://github.com/vllm-project/vllm/pull/25086
- [V1] Logits processor docs by @afeldman-nm in https://github.com/vllm-project/vllm/pull/22919
- [Misc] Update owners for KV connector and V1 offloading by @ApostaC in https://github.com/vllm-project/vllm/pull/25041
- [Bugfix] Update import path for bc_linter_include by @mmangkad in https://github.com/vllm-project/vllm/pull/24766
- [BUG] Exclude .pth files when pulling remote files by @ahao-anyscale in https://github.com/vllm-project/vllm/pull/25092
- [Kernel] Faster pre-processing time for W4A8 by @czhu-cohere in https://github.com/vllm-project/vllm/pull/23972
- [gpt-oss][2] fix types for streaming by @qandrew in https://github.com/vllm-project/vllm/pull/24556
- [Bugfix][B200] Fix
cutlass_mla
hang by @alexm-redhat in https://github.com/vllm-project/vllm/pull/24966 - [ROCm][Bugfix] Aiter mha fp8 fix by @dllehr-amd in https://github.com/vllm-project/vllm/pull/24991
- Disable failing GPT-OSS Eval (Blackwell) for now by @mgoin in https://github.com/vllm-project/vllm/pull/25107
- [Bugfix] Refactor Flashinfer TRTLLM attention kernel selection logic by @elvischenv in https://github.com/vllm-project/vllm/pull/24600
- Add a batched auto tune script by @karan in https://github.com/vllm-project/vllm/pull/25076
- [Bugfix] Fix accuracy issue for silu_mul + nvfp4 quant fusion kernel by @elvischenv in https://github.com/vllm-project/vllm/pull/24833
- [Kernel] Delegate construction of FusedMoEQuantConfig to FusedMoEMethodBase subclasses by @bnellnm in https://github.com/vllm-project/vllm/pull/22537
- [V0 Deprecation] Remove V0 Engine tests by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25114
- [V0 Deprecation] Remove V0 Tracing & Metrics tests by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25115
- [V0 Deprecation] Remove misc V0 tests by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25118
- [V0 Deprecation] Skip PP test by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25128
- [Kernels] Enable DeepGEMM by default by @bnellnm in https://github.com/vllm-project/vllm/pull/24462
- [MM Encoder] Apply DP ViT for Qwen3-VL model series by @ywang96 in https://github.com/vllm-project/vllm/pull/24955
- [Docs] Clean up the contributing README by @hmellor in https://github.com/vllm-project/vllm/pull/25099
- [Core][MM] Cleanup
MultiModalCache
by @lgeiger in https://github.com/vllm-project/vllm/pull/25006 - [Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models by @toncao in https://github.com/vllm-project/vllm/pull/24960
- [Kernels] Overlap shared experts with combine instead of dispatch by @bnellnm in https://github.com/vllm-project/vllm/pull/24254
- [Model] enable data parallel for InternVL vision encoder by @666even666 in https://github.com/vllm-project/vllm/pull/23909
- Mark prompt logprobs as incompatible with prompt embeds at API level by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/25077
- [XPU] Whisper model support on XPU Platform by @chaojun-zhang in https://github.com/vllm-project/vllm/pull/25123
- [EPLB] Add EPLB support for hunyuan_v1 by @666even666 in https://github.com/vllm-project/vllm/pull/23078
- [V0 Deprecation] Remove more V0 tests by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25117
- [Spec Decode] Efficient padded speculation by @benchislett in https://github.com/vllm-project/vllm/pull/24539
- [benchmark] add peak throughput metrics and plot by @simon-mo in https://github.com/vllm-project/vllm/pull/23867
- [CLI] Use streaming in CLI chat and completion commands by @simon-mo in https://github.com/vllm-project/vllm/pull/23769
- [Kernel] Better inf handling for grouped topk cu by @lumina37 in https://github.com/vllm-project/vllm/pull/24886
- [Docs] Fix API Reference by @hmellor in https://github.com/vllm-project/vllm/pull/25140
- Retrieve
sliding_window
from text config in Gemma3 MM by @hmellor in https://github.com/vllm-project/vllm/pull/25085 - [Bugfix] when use s3 model cannot use default load_format by @lengrongfu in https://github.com/vllm-project/vllm/pull/24435
- [Qwen] Add fp8 checkpoint support for qwen3-next. by @sighingnow in https://github.com/vllm-project/vllm/pull/25079
- Add 'path' option to ImagePrompt data_format by @gfinol in https://github.com/vllm-project/vllm/pull/25081
- [Doc] Fix cross-reference warnings by @punitvara in https://github.com/vllm-project/vllm/pull/25058
- [Chore] Cleanup guided namespace, move to structured outputs config by @aarnphm in https://github.com/vllm-project/vllm/pull/22772
- Fix: Add explicit #include <omp.h> for OpenMP compatibility on certain toolchains by @ihb2032 in https://github.com/vllm-project/vllm/pull/24951
- silu-v1: Fix EPS not being used during max-reduction by @elvircrn in https://github.com/vllm-project/vllm/pull/25069
- [Frontend] Support setting logprobs to -1 by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/25031
- [Model] Improve Pooling Model by @jeejeelee in https://github.com/vllm-project/vllm/pull/25149
- Move
StructuredOutputsConfig
fromconfig/__init__.py
toconfig/structured_outputs.py
by @hmellor in https://github.com/vllm-project/vllm/pull/25153 - [Docs] Fix pooling-params doc references in openai_compatible_server.md by @yankay in https://github.com/vllm-project/vllm/pull/24939
- [Docs] add the parallel sampling usage in LLMEngine and AsyncLLM by @gigit0000 in https://github.com/vllm-project/vllm/pull/24222
- Fix forward reference warning in documentation by @hmellor in https://github.com/vllm-project/vllm/pull/25150
- Fix
validate-config
pre-commit check by @hmellor in https://github.com/vllm-project/vllm/pull/25157 - [Bugfix][Mamba] - Fix Conv State Kernel FP32 Support by @Josephasafg in https://github.com/vllm-project/vllm/pull/24883
- [Misc] Clean up flags in
vllm bench serve
by @ywang96 in https://github.com/vllm-project/vllm/pull/25138 - [Structured Output][Refactor] Move
apply_grammar_bitmask()
method fromModelRunner
to structured output utils by @shen-shanshan in https://github.com/vllm-project/vllm/pull/21999 - Refactor dense FP8 tensor/channel/block utils and add CT FP8 block by @mgoin in https://github.com/vllm-project/vllm/pull/21404
- [Misc] Add kv-connector label by @NickLucche in https://github.com/vllm-project/vllm/pull/25156
- [Kernel] Enable Hybrid Model Support in Triton Unified Attention Kernel by @jvlunteren in https://github.com/vllm-project/vllm/pull/21197
- [PERF] Add
conv1d
metadata to GDN attn by @vadiklyutiy in https://github.com/vllm-project/vllm/pull/25105 - feat(api): Return 503 on /health when engine is dead by @dongbo910220 in https://github.com/vllm-project/vllm/pull/24897
- [New Model] Support BertForTokenClassification / Named Entity Recognition (NER) task by @noooop in https://github.com/vllm-project/vllm/pull/24872
- [Docs] Fix warnings in mkdocs build (continued) by @Zerohertz in https://github.com/vllm-project/vllm/pull/25163
- Enable Allgather/ReduceScatter backend for NaiveAllToAll by @wenscarl in https://github.com/vllm-project/vllm/pull/23964
- [Misc] Add codeowner for Transformers backend by @hmellor in https://github.com/vllm-project/vllm/pull/25180
- [spec decode] Fix MTP inference path for MiMo-7B model by @zixi-qi in https://github.com/vllm-project/vllm/pull/25136
- [ROCm][CI/Build] Use ROCm7.0 as the base by @gshtras in https://github.com/vllm-project/vllm/pull/25178
- [ROCm][AITER][Bugfix] Switch AITER to use PIECEWISE_AND_FULL compilation by @Rohan138 in https://github.com/vllm-project/vllm/pull/25104
- [KV offload][1/N] Introduce an offloading component by @orozery in https://github.com/vllm-project/vllm/pull/19848
- [V0 Deprecation] Remove AsyncLLMEngine by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25025
- [fix]: remove data type hardcoding from gptoss model implementation by @nikhil-arm in https://github.com/vllm-project/vllm/pull/23807
- [feat]: Create interface for model-specific M-RoPE by @AzizCode92 in https://github.com/vllm-project/vllm/pull/24194
- [Bug] Fix
returned_lse
not Defined issue by @yewentao256 in https://github.com/vllm-project/vllm/pull/25106 - [Bug] Fix torch Compilation Cache Hit Error by @yewentao256 in https://github.com/vllm-project/vllm/pull/25093
- [V0 Deprecation] Remove unused async_timeout.py by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25190
- [KV offload][1b/N] rename offloading to kv_offload by @orozery in https://github.com/vllm-project/vllm/pull/25191
- [BugFix] Fix DeepGEMM warmup, no m.weight_scale_inv by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/25206
- [CORE] Prompt Embeddings Support for v1 Engine by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/24278
- [KV offload][2/N] Introduce LRU-based CPU offloading management by @orozery in https://github.com/vllm-project/vllm/pull/20075
- [gpt-oss] Add ResponseReasoningPartAddedEvent, ResponseReasoningPartDoneEvent for streaming by @qandrew in https://github.com/vllm-project/vllm/pull/24938
- [Perf] Optimize memory peak during EAGLE model loading. by @candyzone in https://github.com/vllm-project/vllm/pull/24585
- [Misc] Clean up MM profiling warnings by @ywang96 in https://github.com/vllm-project/vllm/pull/25222
- [Docs] Fix griffe warnings in vllm/multimodal by @windsonsea in https://github.com/vllm-project/vllm/pull/25216
- [OOT] Support sync_model_loading for OOT by @xuechendi in https://github.com/vllm-project/vllm/pull/25126
- [Build] Update Xgrammar to 0.1.24 to get a CVE fix by @russellb in https://github.com/vllm-project/vllm/pull/25188
- [CPU] Disable oneDNN linear on non-x86 platforms by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/25166
- [Bugfix][CPU] Add placeholder to avoid import errors when using fused_moe ops on platforms without triton by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/25137
- [Misc] Cleanup test conftest for deprecated encoder-decoder models by @Isotr0py in https://github.com/vllm-project/vllm/pull/25231
- [bugfix] fix MHA for models like OpenGVLab/InternVL3_5-38B by @yma11 in https://github.com/vllm-project/vllm/pull/25146
- [Kernel][Performance] Add Triton kernel for Qwen3-VL interleaved MRoPE by @Isotr0py in https://github.com/vllm-project/vllm/pull/25055
- [Bugfix][Perf] Misc fixes for Qwen3 VL by @ywang96 in https://github.com/vllm-project/vllm/pull/25238
- Move
PoolerConfig
fromconfig/__init__.py
toconfig/pooler.py
by @hmellor in https://github.com/vllm-project/vllm/pull/25181 - [P/D][Nixl] Introduce
KVTransferMetrics
and aggregation strategy by @NickLucche in https://github.com/vllm-project/vllm/pull/22188 - [V0 Deprecation] Remove V0 logic from
get_input_embeddings
interface by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25242 - [Qwen] Remove cuda hard-code in qwen3 next by @wxsIcey in https://github.com/vllm-project/vllm/pull/25243
- Update CODEOWNERS by @hmellor in https://github.com/vllm-project/vllm/pull/25269
- Move
ModelConfig
fromconfig/__init__.py
toconfig/model.py
by @hmellor in https://github.com/vllm-project/vllm/pull/25252 - refactor(benchmarks): add type annotations to wait_for_endpoint parameters by @samzong in https://github.com/vllm-project/vllm/pull/25218
- [KV offload][3/N] Add worker-side CPU support by @orozery in https://github.com/vllm-project/vllm/pull/21448
- [Frontend] Pass API server count to each process by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23717
- [Core] Modify the initialization parameters of the lora manager by @jeejeelee in https://github.com/vllm-project/vllm/pull/25249
- Remove Redundant Assignment in Qwen3_VisionPatchMerger by @LJH-LBJ in https://github.com/vllm-project/vllm/pull/25224
- Encoder model support for the Transformers backend by @hmellor in https://github.com/vllm-project/vllm/pull/25174
- [CI/Build] fix test function_calling by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/25072
- [Core][Prefix Hash] Fix prefix hash metrics sliding window maintainance by @Jialin in https://github.com/vllm-project/vllm/pull/24990
- [Docs] add init.py to vllm/model_executor/layers/quantization/compressed_tensors/transform by @samzong in https://github.com/vllm-project/vllm/pull/24974
- [bugfix] fix structured outputs key missing issue from [#24929] by @luccafong in https://github.com/vllm-project/vllm/pull/25195
- [KV offload][4/N] Offloading KV connector by @orozery in https://github.com/vllm-project/vllm/pull/22595
- Optimize triton unified attention performance for sliding window attention by @zixi-qi in https://github.com/vllm-project/vllm/pull/24390
- [Bugfix] GPT OSS Attritbute error on H100 by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/25228
- [Bugfix] Fix chunked a2_scales in modular kernels by @bnellnm in https://github.com/vllm-project/vllm/pull/25264
- Specify platform in
pip-compile
pre-commit
hook so it runs on MacOS by @hmellor in https://github.com/vllm-project/vllm/pull/25273 - [Perf] Use FlashInfer RoPE for RotaryEmbedding.forward_cuda when available by @mgoin in https://github.com/vllm-project/vllm/pull/21126
- [BugFix] Make FlashInferMetadataBuilder non-blocking by @nvjullin in https://github.com/vllm-project/vllm/pull/25040
- Fix: Correct FusedMoE layer reference in auto_round quantization by @David-Wen2025 in https://github.com/vllm-project/vllm/pull/24818
- [Frontend] Responses API messages out, just harmony for now by @alecsolder in https://github.com/vllm-project/vllm/pull/24985
- [Compile] Fix Compile Warning for Ignoring
MIN_BLOCK_PER_SM
by @yewentao256 in https://github.com/vllm-project/vllm/pull/25193 - Enable modelopt gemma3 nvfp4/fp8, make workflow more robust by @Edwardf0t1 in https://github.com/vllm-project/vllm/pull/22771
- allow disable flashinfer prefill by @luccafong in https://github.com/vllm-project/vllm/pull/25276
- [BugFix] Fix async scheduling CPU tensor race take 2 by @njhill in https://github.com/vllm-project/vllm/pull/25279
- [Bugfix] Remove VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE [#2969] by @Lucaskabela in https://github.com/vllm-project/vllm/pull/25090
- Don't skip special tokens with hermes-style tool calling by @maxdebayser in https://github.com/vllm-project/vllm/pull/25281
- test: Remove vestigial skip for prompt embeds tests after landing v1 Prompt Embeds support by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/25291
- [docs] Prompt Embedding feature support by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/25288
- [torch.compile] CUDAGraph Inductor partition integration by @BoyuanFeng in https://github.com/vllm-project/vllm/pull/24281
- [BugFix] Ensure appropriate guards in destructors by @njhill in https://github.com/vllm-project/vllm/pull/25284
- [Misc] Support more collective_rpc return types by @njhill in https://github.com/vllm-project/vllm/pull/25294
- Improve weight loading for encoder models in Transformers backend by @hmellor in https://github.com/vllm-project/vllm/pull/25289
- [BUGFIX] GPTQ quantization compatibility for Qwen3 Next MOE models (AutoGPTQ and AutoRound-GPTQ) by @JartX in https://github.com/vllm-project/vllm/pull/25268
- [BugFix] Exclude self when checking for port collision by @njhill in https://github.com/vllm-project/vllm/pull/25286
- [BUG FIX][NON-CUDA]quick fix to avoid call cudagraph_unsafe in attention by @xuechendi in https://github.com/vllm-project/vllm/pull/25298
- [Bugfix] fix tool call arguments is empty by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/25223
- [Optimization] Avoid repeated model architecture conversion for pooling models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25261
- [Hybrid Allocator] Support full attention with different hidden size by @heheda12345 in https://github.com/vllm-project/vllm/pull/25101
- [Bugfix] Fix Qwen3-VL-MoE weight loading for EP by @ywang96 in https://github.com/vllm-project/vllm/pull/25300
- [V1] Support
LLM.apply_model
by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/18465 - [CI Failure] Disable FlashInfer RoPE to unblock CI by @mgoin in https://github.com/vllm-project/vllm/pull/25299
- [Docs] Fix warnings in mkdocs build (continued) by @wwl2755 in https://github.com/vllm-project/vllm/pull/25042
- Generate _ModelInfo properties file when loading to improve loading speed by @manoelmarques in https://github.com/vllm-project/vllm/pull/23558
- [Model] Cleanup InternViT's data parallel implementation by @Isotr0py in https://github.com/vllm-project/vllm/pull/25306
- [Core] Enable sharded state loader for V1 engine and enhance test coverage by @lirong-lirong in https://github.com/vllm-project/vllm/pull/25308
- [V0 Deprecation] Enable the remaining multimodal tests in V1 by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25307
- [Docs] Fix warnings in vllm/profiler and vllm/transformers_utils by @windsonsea in https://github.com/vllm-project/vllm/pull/25220
- [V0 Deprecation] Remove LLMEngine by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25033
- [V0 Deprecation] Remove V0 Output Processor by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25320
- [Chore] Remove unused sampler in models by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25324
- [CI] Skip tests failing on main by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25326
- [V0 Deprecation] Remove V0 core by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25321
- [Doc] improve test-pipeline.yaml documentation by @hl475 in https://github.com/vllm-project/vllm/pull/25305
- [V0 Deprecation] Remove V0 model runner base & simplify worker base by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25328
- [Multi Modal][Performance] Fused Q,K's apply_rope in more models by @wwl2755 in https://github.com/vllm-project/vllm/pull/25005
- [V0 Deprecation] Remove from_seq_group methods by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25330
- [V0 Deprecation] Remove V0 MP executor by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25329
- [V1] Add sliding window support to Flex Attention backend by @Isotr0py in https://github.com/vllm-project/vllm/pull/24089
- [MM][Perf] Minor Optimization on Qwen3-VL
fast_pos_embed_interpolate
by @ywang96 in https://github.com/vllm-project/vllm/pull/25337 - [Bugfix] Typos in error message for missing model config file by @simondanielsson in https://github.com/vllm-project/vllm/pull/25339
- [Optimization] Cache chat template result when processor fails to be loaded by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25341
- [V0 Deprecation] Remove V0 Sequence class & Sampler by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25332
- [V0 Deprecation] Remove async_output_proc, preemption mode, delay factor by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25334
- feat: Enable engine-level arguments with speculators models by @rahul-tuli in https://github.com/vllm-project/vllm/pull/25250
- [V0 Deprecation] Remove V0 sampling metadata by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25345
- [Perf] Further optimization for Qwen3-VL
fast_pos_embed_interpolate
by @Isotr0py in https://github.com/vllm-project/vllm/pull/25347 - Remove V0 attention backends by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25351
- [Bugfix][V0 Deprecation][CI] use async mock and await for async method by @KKSK-DON in https://github.com/vllm-project/vllm/pull/25325
- Multimodal - audio tests by @debroy-rh in https://github.com/vllm-project/vllm/pull/25285
- [Model] Support Dots OCR by @ywang96 in https://github.com/vllm-project/vllm/pull/24645
- [Docs] GSM8K Accuracy Evaluation doc update by @david6666666 in https://github.com/vllm-project/vllm/pull/25360
- [Bugfix] Fix hermes tool parser handling of non-string argument types by @david6666666 in https://github.com/vllm-project/vllm/pull/22002
- [V0 Deprecation] Remove V0-only methods in multi-modal registry by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25362
- [V0 Deprecation] Remove
MultiModalPlaceholderMap
by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25366 - Enable Eagle3 speculative decoding for GPT-OSS model by @eldarkurtic in https://github.com/vllm-project/vllm/pull/25246
- [TPU][Bugfix][CI] Fix broken tests/build dependency by @NickLucche in https://github.com/vllm-project/vllm/pull/25255
- [TPU] Deprecate
xm.mark_step
in favor of`torch_xla.sync
by @NickLucche in https://github.com/vllm-project/vllm/pull/25254 - refactor: abstract graph mode support into platform interface by @yiz-liu in https://github.com/vllm-project/vllm/pull/25161
- [Misc] Remove unused encoder-decoder error strings by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25374
- Make pickle import check fast by @hmellor in https://github.com/vllm-project/vllm/pull/25379
- Make
mypy
behave like a proper pre-commit hook by @hmellor in https://github.com/vllm-project/vllm/pull/25313 - MI-300X triton moe configs by @Sara-KS in https://github.com/vllm-project/vllm/pull/23445
- [Bugfix] Fix several issues with p2p xPyD in GET type by @Csrayz in https://github.com/vllm-project/vllm/pull/23993
- [V1][Attention] Split triton_attn in triton-only and rocm specific backends by @bringlein in https://github.com/vllm-project/vllm/pull/24648
- [EPLB] Reduce EPLB Inference Overhead by @abmfy in https://github.com/vllm-project/vllm/pull/24573
- [CLI env var] Add VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH in env variables by @Daisy-Ma-coder in https://github.com/vllm-project/vllm/pull/25274
- [Compiler] Disable Inductor standalone compile by default by @ElizaWszola in https://github.com/vllm-project/vllm/pull/25391
- [CI Failure] Fix fp8 kv cache on <SM90 by @mgoin in https://github.com/vllm-project/vllm/pull/25396
- [DP] support torchrun external launcher with Data Parallelism by @luccafong in https://github.com/vllm-project/vllm/pull/24899
- Remove RFC review hours reference by @simon-mo in https://github.com/vllm-project/vllm/pull/25416
- [torch.compile] Cleanup compilation tests and custom passes, add debug utils, fix DCE bug (#23091), fix test (#24376), and prep for custom op matching (#24604) by @ProExpertProg in https://github.com/vllm-project/vllm/pull/24542
- [KV offload][5/N] Add
CPUOffloadingSpec
by @orozery in https://github.com/vllm-project/vllm/pull/24251 - [CI/Build] Skip Qwen3-VL initialization tests until models are actually released by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25394
- [TPU] update torch_xla dependency for PyPI compatibility by @jcyang43 in https://github.com/vllm-project/vllm/pull/25278
- [Frontend] Responses API MCP tools for built in tools and to pass through headers by @alecsolder in https://github.com/vllm-project/vllm/pull/24628
- [Bugfix] fix custom op test by @ProExpertProg in https://github.com/vllm-project/vllm/pull/25429
- [Core] Drop overly aggressive whisper assertion by @russellb in https://github.com/vllm-project/vllm/pull/25408
- [Bugfix] Fix missing
clear_connector_metadata
by @NickLucche in https://github.com/vllm-project/vllm/pull/25397 - [BugFix] [DP/EP] Fix slow execution when BS <= DP by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/25407
- [Performance] Remove input pads in cutlass_mla and optimize v_proj output handling by @alexm-redhat in https://github.com/vllm-project/vllm/pull/25184
- [Perf] Apply torch.compile for
per_block_cast_to_fp8
by @yewentao256 in https://github.com/vllm-project/vllm/pull/24611 - [V0 deprecation] Remove platform v1 controling interface by @Isotr0py in https://github.com/vllm-project/vllm/pull/25410
- [V0 deprecation] Remove
_set_default_args_v0
function by @Isotr0py in https://github.com/vllm-project/vllm/pull/25409 - [Bug] Fix Long Context OOM Issue by @yewentao256 in https://github.com/vllm-project/vllm/pull/25290
- [feat] Support MRoPE + YaRN by @JJJYmmm in https://github.com/vllm-project/vllm/pull/25384
- [XPU] Fix
compile_size
isNone
case. by @jikunshang in https://github.com/vllm-project/vllm/pull/25433 - [benchmarks]allow skip ready check for bench serve by @luccafong in https://github.com/vllm-project/vllm/pull/25420
- [Bugfix] Remove contiguous output req for context parallel MLA by @mgoin in https://github.com/vllm-project/vllm/pull/25414
- [Docs] Fix griffe warnings in vllm/lora/ops by @windsonsea in https://github.com/vllm-project/vllm/pull/25369
- [DP/EP][GPTOSS] Use triton matmul-ogs kernels for GPTOSS DP/EP by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/24588
- [NIXL][OOT platform] support nixl_connector with oot platform and other nixl_backend by @xuechendi in https://github.com/vllm-project/vllm/pull/25121
- [Model] Enable DP for ViT in Qwen2-VL by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25445
- Handle triton kernel import exception by @minosfuture in https://github.com/vllm-project/vllm/pull/25319
- [Frontend] Add a new xml-based tool parser for qwen3-coder by @Zhikaiiii in https://github.com/vllm-project/vllm/pull/25028
- [Misc] Move DP for ViT code inside model executor dir by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25459
- [Test]: Hermes tool parser stream output error in Qwen3 case by @ahartel in https://github.com/vllm-project/vllm/pull/25203
- [Bugfix] Fix idefics3
tie_word_embeddings
by @Isotr0py in https://github.com/vllm-project/vllm/pull/25454 - [Core] Optimize LoRA weight loading by @jeejeelee in https://github.com/vllm-project/vllm/pull/25403
- [docs] Benchmark Serving Incorrect Arg by @vllmellm in https://github.com/vllm-project/vllm/pull/25474
- [CI/Build] Fix disabled v1 attention backend selection test by @Isotr0py in https://github.com/vllm-project/vllm/pull/25471
- [BugFix] Register expert_map as named buffer for wake_up and sleep by @wuxibin89 in https://github.com/vllm-project/vllm/pull/25458
- [P/D] Support NIXL connector to disconnect during a clean shutdown by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/24423
- [test/doc] make NixlConnector example more clear by @panpan0000 in https://github.com/vllm-project/vllm/pull/24249
- [XPU] Fix MOE DP accuracy issue on XPU by @faaany in https://github.com/vllm-project/vllm/pull/25465
- [UX] Change kv-cache-memory log level to debug by @mgoin in https://github.com/vllm-project/vllm/pull/25479
- [V1] Remove V0 code paths for Hybrid models by @tdoublep in https://github.com/vllm-project/vllm/pull/25400
- [Core/DBO][2/N] Dual-Batch Overlap add DeepEP High Throughput support and Prefill support by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/24845
- Add backward compatibility for
GuidedDecodingParams
by @hmellor in https://github.com/vllm-project/vllm/pull/25422 - [Kernels] Support blocked fp8 quantization for compressed tensors MoE by @bnellnm in https://github.com/vllm-project/vllm/pull/25219
- [BugFix] Fix UB in per_token_group_quant.cu by @rivos-shreeasish in https://github.com/vllm-project/vllm/pull/24913
- [Log] Optimize kv cache memory log from Bytes to GiB by @yewentao256 in https://github.com/vllm-project/vllm/pull/25204
- Use macro guard CUDA functions for back compatibility in grouped_topk_kernel.cu by @minosfuture in https://github.com/vllm-project/vllm/pull/25346
- [V1][Kernel] Add triton implementation for
reshape_and_cache_flash
by @bringlein in https://github.com/vllm-project/vllm/pull/24503 - [Misc] Reduce initialization time of auto_tune by @wdhongtw in https://github.com/vllm-project/vllm/pull/23682
- [Spec Decode][CI] Add e2e test for
examples/spec_decode.py
and prevent breaking Acceptance Length by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/24531 - [Core] Ensure LoRA linear respect the base_layer's tp_size and tp_rank by @jeejeelee in https://github.com/vllm-project/vllm/pull/25487
- [ROCm] Add skinny gemm bias support for dtypes fp16,bf16,fp8 by @amd-hhashemi in https://github.com/vllm-project/vllm/pull/24988
- [core] add nccl symmetric memory for all reduce by @Amir-19 in https://github.com/vllm-project/vllm/pull/24532
- [Performance] Move apply_w8a8_block_fp8_linear to an op class by @ElizaWszola in https://github.com/vllm-project/vllm/pull/24666
- [Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEWISE by @mgoin in https://github.com/vllm-project/vllm/pull/25444
- [Speculators][Speculative Decoding] Fix gpt-oss eagle3 accuracy issue by @jiahanc in https://github.com/vllm-project/vllm/pull/25406
- [Bugfix] Lower gpt-oss max cudagraph size to 992 to be compatible with FA3 by @mgoin in https://github.com/vllm-project/vllm/pull/25508
- Enable symmetric memory all reduce by default only enabling for TP by @ilmarkov in https://github.com/vllm-project/vllm/pull/25070
- [CI] Fix Pre-commit Issue by @yewentao256 in https://github.com/vllm-project/vllm/pull/25497
- [Bugfix] gpt-oss container tool output bug by @alecsolder in https://github.com/vllm-project/vllm/pull/25485
- [Build] Update Xgrammar to 0.1.25 by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/25467
- [Bugfix] Fix for the import error from [#24588] by @gshtras in https://github.com/vllm-project/vllm/pull/25481
- [CI/Build] Fix and re-enable v1 PP test on CI by @Isotr0py in https://github.com/vllm-project/vllm/pull/25496
- [Core] Use KVCacheBlock as much as possible instead of dict[block_id, KVCacheBlock] by @Jialin in https://github.com/vllm-project/vllm/pull/24830
- [V0 Deprecation] Remove placeholder attn by @tdoublep in https://github.com/vllm-project/vllm/pull/25510
- Add VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE & VLLM_ENABLE_INDUCTOR_COORDINA… by @rouchenzi in https://github.com/vllm-project/vllm/pull/25493
- Fix triton_reshape_and_cache_flash.py triton import by @mgoin in https://github.com/vllm-project/vllm/pull/25522
- [gpt-oss][bugfix] remove logic to require resp_ in ResponseAPI by @qandrew in https://github.com/vllm-project/vllm/pull/25428
- Remove redundant mutates_args and dispatch_key for direct_register_custom_op by @mgoin in https://github.com/vllm-project/vllm/pull/25512
- [BugFix] Fix OOM in vLLM replicas by ensuring consistent NCCL memory accounting by @kouroshHakha in https://github.com/vllm-project/vllm/pull/25359
- Add
VLLM_NVTX_SCOPES_FOR_PROFILING=1
to enablenvtx.annotate
scopes by @coreylowman in https://github.com/vllm-project/vllm/pull/25501 - [Kernel] [Mamba] Remove BLOCK_H=1 from list of tuneable configurations for
_chunk_cumsum_fwd_kernel
by @tdoublep in https://github.com/vllm-project/vllm/pull/25197 - [ROCm] Small functional changes for gptoss by @jpvillam-amd in https://github.com/vllm-project/vllm/pull/25201
- [Perf] Increase default max splits for FA3 full cudagraphs by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/25495
- [Bugfix] [B200] cutlass_mla - ensure kv_split == 1 for batch size > 1 by @alexm-redhat in https://github.com/vllm-project/vllm/pull/25509
- [BugFix] AssertionError: Do not capture num_reqs > max_num_reqs for uniform batch by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/25505
- Improve output when failing json.loads() on structured output test by @dougbtv in https://github.com/vllm-project/vllm/pull/25483
- Add CUTLASS FP8 MOE benchmark scripts and kernel config by @chenxi-yang in https://github.com/vllm-project/vllm/pull/25302
- [Bug] Fix AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv' by @yewentao256 in https://github.com/vllm-project/vllm/pull/25519
- [BUG] Allows for RunAI Streamer and Torch.compile cache to be used together by @ahao-anyscale in https://github.com/vllm-project/vllm/pull/24922
- [Model] Support SeedOss Reason Parser by @LuYanFCP in https://github.com/vllm-project/vllm/pull/24263
- [V1][Metrics] Add per-request TPOT histogram by @baxingpiaochong in https://github.com/vllm-project/vllm/pull/24015
- [Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen by @benchislett in https://github.com/vllm-project/vllm/pull/25520
- [Core] Support weight_loader_v2 for
UnquantizedLinearMethod
by @kylesayrs in https://github.com/vllm-project/vllm/pull/23036 - [Compile] Fix AMD Compile Error by @yewentao256 in https://github.com/vllm-project/vllm/pull/25518
- [BugFix] Fix MLA assert with CUTLASS MLA by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/25478
- [fix]: add Arm 4bit fused moe support by @nikhil-arm in https://github.com/vllm-project/vllm/pull/23809
- [KV sharing] Re-land Gemma3n model changes from [#22628] by @sarckk in https://github.com/vllm-project/vllm/pull/24357
- [Spec Decode] Enable FlashInfer Spec Decoding by @benchislett in https://github.com/vllm-project/vllm/pull/25196
- [Perf] Fix jit compiles at runtime of fla gated delta rule by @coreylowman in https://github.com/vllm-project/vllm/pull/25432
- [Bugfix] [Frontend] Cleanup gpt-oss non-streaming chat tool calls by @bbrowning in https://github.com/vllm-project/vllm/pull/25514
- [TPU][Bugfix] fix the missing apply_model in tpu worker by @yaochengji in https://github.com/vllm-project/vllm/pull/25526
- [Misc] Retry HF processing if "Already borrowed" error occurs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25535
- [Bugfix][CPU] Skip unsupported custom op register on CPU by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/25534
- [CI/Build] Fix v1 OOT registration test by @Isotr0py in https://github.com/vllm-project/vllm/pull/25547
- [Misc]] Move processing context to multimodal directory by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25548
- [CI/Build] add nightly prime-rl integration tests by @Jackmin801 in https://github.com/vllm-project/vllm/pull/25207
- [V0 Deprecation] Remove max_seq_len_to_capture by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25543
- [BugFix] Potential Fix for FA3 full-cudagraph IMA by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/25490
- [misc] update the warning message by @youkaichao in https://github.com/vllm-project/vllm/pull/25566
- [Bugfix] Fix dummy video number of frames calculation by @ywang96 in https://github.com/vllm-project/vllm/pull/25553
- [Bug] fix import and unit test by @jmkuebler in https://github.com/vllm-project/vllm/pull/25558
- [Benchmark] Fix regression in structured output benchmark by @russellb in https://github.com/vllm-project/vllm/pull/25500
- [docs] fix nixl kv_connector_extra_config.backends key by @panpan0000 in https://github.com/vllm-project/vllm/pull/25565
- [Bugfix] Fix DeepSeekV31ToolParser to correctly parse multiple tools in non-streaming output by @taohui in https://github.com/vllm-project/vllm/pull/25405
- Move
DeviceConfig
,ObservabilityConfig
,SpeechToTextConfig
to their own files by @hmellor in https://github.com/vllm-project/vllm/pull/25564 - [Misc] Improve type annotations for jsontree by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25577
- [ROCm][Bugfix] Only enable +rms_norm based on aiter if not explicitly disabled by @gshtras in https://github.com/vllm-project/vllm/pull/25275
- [ROCm][Build][Bugfix] Fix ROCm base docker whls installation order by @gshtras in https://github.com/vllm-project/vllm/pull/25415
- Fixes and updates to bench_per_token_quant_fp8 by @mgoin in https://github.com/vllm-project/vllm/pull/25591
- [Bugfix] add cache model when from object storage get model by @lengrongfu in https://github.com/vllm-project/vllm/pull/24764
- Support mnnvl all2allv from Flashinfer by @wenscarl in https://github.com/vllm-project/vllm/pull/21003
- Suppress benign cuBLAS warning when capturing cudagraphs with DBO by @SageMoore in https://github.com/vllm-project/vllm/pull/25596
- [Docs] Enable
fail_on_warning
for the docs build in CI by @hmellor in https://github.com/vllm-project/vllm/pull/25580 - [V0 Deprecation] Remove unused classes in attention by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25541
- [Logging] Improve log for when DeepEP HT disables CUDA Graphs by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/25531
- feat: BF16 FlashInfer Fused Cutlass MOE for Hopper and Blackwell Expert Parallel by @djmmoss in https://github.com/vllm-project/vllm/pull/25503
- [Refactor] Use DeepGEMM Col Major TMA Aligned Tensor by @yewentao256 in https://github.com/vllm-project/vllm/pull/25517
- Improve
--help
for enhanced user experience by @hmellor in https://github.com/vllm-project/vllm/pull/24903 - [MISC] replace c10::optional with std::optional by @842974287 in https://github.com/vllm-project/vllm/pull/25602
- [Model] Improve DotsOCRForCausalLM by @jeejeelee in https://github.com/vllm-project/vllm/pull/25466
- [Kernel] Support DCP for Triton backend by @frank-wei in https://github.com/vllm-project/vllm/pull/25132
- [Bug] Dynamo Unsupported due to
BasevLLMParameter.torch_function
calling disabled super() by @yewentao256 in https://github.com/vllm-project/vllm/pull/25613 - Enable Fbgemm NVFP4 on Dense models by @samanamp in https://github.com/vllm-project/vllm/pull/25609
- [Model] Add LongCat-Flash by @OftenDream in https://github.com/vllm-project/vllm/pull/23991
- optimize: eliminate duplicate split_enc_dec_inputs calls by @nicole-lihui in https://github.com/vllm-project/vllm/pull/25573
- [Bugfix] fix apply_temperature to avoid nan in probs by @courage17340 in https://github.com/vllm-project/vllm/pull/24734
- [Misc] Simplify PoolerOutput and move to
v1/outputs
by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25629 - Map CwmForCausalLM to llama and LlamaForCausalLM by @jacobkahn in https://github.com/vllm-project/vllm/pull/25611
- typo: remove duplicate
is
by @nicole-lihui in https://github.com/vllm-project/vllm/pull/25641 - Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class… by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/25607
- [fix] Update torch version in cpu-build.txt for AArch64/ppc64le and Darwin by @fadara01 in https://github.com/vllm-project/vllm/pull/25579
- [Misc] Fix Qwen3-VL
video_grid_thw
typing by @ywang96 in https://github.com/vllm-project/vllm/pull/25646 - [Bugfix] Add triton.language.tensor placeholder by @adobrzyn in https://github.com/vllm-project/vllm/pull/25649
- [Bugfix] Fix Qwen3-VL max_num_video_tokens calculation for video profiling by @Isotr0py in https://github.com/vllm-project/vllm/pull/25648
- [mypy] Further improve MM type annotations by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25654
- [Bugfix] Parse SpeculativeConfig Error by @yyzxw in https://github.com/vllm-project/vllm/pull/25142
- [V0 deprecation] Remove unreachable model_config.supported_tasks by @noooop in https://github.com/vllm-project/vllm/pull/25642
- Add backward compatibility for
guided_...
API by @hmellor in https://github.com/vllm-project/vllm/pull/25615 - [CI/Build] Fix flaky entrypoints test by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25663
- [XPU][Triton]add xpu config in triton_reshape_and_cache_flash by @jikunshang in https://github.com/vllm-project/vllm/pull/25643
- [Hardware][RISC-V] Add riscv64 support for vLLM with scalar by @langc23 in https://github.com/vllm-project/vllm/pull/22112
- [mypy] Fix wrong type annotations related to tuple by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25660
- [misc] warning by default for hanging / busy / idle by @youkaichao in https://github.com/vllm-project/vllm/pull/25627
- [torch.compile] Make Query Quantization Fusable by @jmkuebler in https://github.com/vllm-project/vllm/pull/24914
- [CPU] update torch 2.8 and fix missing fields in TorchSDPAMetadata by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/25652
- [ux] Switch a warning to debug about a pytorch fallback by @russellb in https://github.com/vllm-project/vllm/pull/23750
- [Bugfix] Fix InternS1 video processing after Transformers v4.56 by @Isotr0py in https://github.com/vllm-project/vllm/pull/25644
- [Misc] Remove cruft file in repo by @NickLucche in https://github.com/vllm-project/vllm/pull/25678
- [Logging] Remove TORCH_NCCL_AVOID_RECORD_STREAMS to squash a warning by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/25532
- [BUGFIX] Fix crash in Eagle Speculative Decoding models when exceedin… by @AlonKejzman in https://github.com/vllm-project/vllm/pull/24662
- Revert "[Bug] Dynamo Unsupported due to
BasevLLMParameter.torch_function
calling disabled super()" by @mgoin in https://github.com/vllm-project/vllm/pull/25681 - [BugFix] Fix DBO hang by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/25625
- [Model] Add optional parameter to reasoning parser constructor by @taohui in https://github.com/vllm-project/vllm/pull/25554
- [Model] Define
merge_by_field_config
MM interface by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25676 - [V0 deprecation] Clean up V0 fallback in compilation config by @Isotr0py in https://github.com/vllm-project/vllm/pull/25675
- [V0 deprecation] Remove _VLLM_V1 suffixes from attention backend names by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/25489
- [V0 deprecation] Clean up LoRA by @jeejeelee in https://github.com/vllm-project/vllm/pull/25686
- [Misc] Simplify
test_argsort_mm_positions
by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25690 - [Optimization] Streamline
InputPreprocessor
by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25702 - [Optimization] Use a cheaper cache key in
get_model_architecture
by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25682 - [Spec Decode] Add Batch Parallel Ngram. Upto 8x lower overhead. by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/24986
- [Core] Enable command line logging for LLMEngine by @zhuohan123 in https://github.com/vllm-project/vllm/pull/25610
- [Model] rename NemotronH_Nano_VL -> NemotronH_Nano_VL_V2 by @tomeras91 in https://github.com/vllm-project/vllm/pull/25708
- Fix routing_bias dtype by @wenscarl in https://github.com/vllm-project/vllm/pull/25711
- [Refactor] Remove DeepGEMM OP Register by @yewentao256 in https://github.com/vllm-project/vllm/pull/25710
- [Misc] Don't log shm dequeue delay warning on worker side by @njhill in https://github.com/vllm-project/vllm/pull/25720
- Llamas 3.1 405B fp4 changes upstreaming from 355_wip by @maleksan85 in https://github.com/vllm-project/vllm/pull/25135
- [Core] Force PIECEWISE CUDAGraph mode for encoder-decoder by @russellb in https://github.com/vllm-project/vllm/pull/25701
- [Misc] Remove unnecessary memoryviews in shm_broadcast.py by @njhill in https://github.com/vllm-project/vllm/pull/25721
- EVS Support (Video tokens pruning) by @BloodAxe in https://github.com/vllm-project/vllm/pull/22980
- [CI/Build] fix doc build warning: Failed to get 'name: description' pair by @yitingdc in https://github.com/vllm-project/vllm/pull/25733
- fix: revert cast to cpu in
MsgpackEncoder._encode_tensor
to avoid hidden performance regressions by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/25738 - perf: Avoid copying inputs_embeds tensors to GPU unless prompt_embeds is enabled by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/25739
- [Harware][AMD][Model] Triton MoE tuning configs for GLM-4.5 for MI300X by @xaguilar-amd in https://github.com/vllm-project/vllm/pull/25703
- fix: print outputt offline_inference/base/chat.py example by @Iceber in https://github.com/vllm-project/vllm/pull/25744
- [Qwen3-Next][GDN] fixes cuda graph capturing bug in GDN metadata and a stride bug in causal_conv_1d. by @sighingnow in https://github.com/vllm-project/vllm/pull/25743
- Remove cuda hard-code in compute_causal_conv1d_metadata by @wxsIcey in https://github.com/vllm-project/vllm/pull/25555
- [misc] refactor speculative config by @yyzxw in https://github.com/vllm-project/vllm/pull/25657
- [Bugfix] Fix Shared Expert/Zero expert code in FusedMoE.process_chunk by @SageMoore in https://github.com/vllm-project/vllm/pull/25698
- Support LongCat-Flash-Chat tool call by @Xu-Wenqing in https://github.com/vllm-project/vllm/pull/24083
- [Doc] Update Batch-level DP docs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25757
- [Model] Mamba2 varlen and metadata refactor by @cyang49 in https://github.com/vllm-project/vllm/pull/21467
- [CI] Fix test_shared_storage_connector_hashes by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/25748
- [Bugfix] Properly abort pooling request. by @noooop in https://github.com/vllm-project/vllm/pull/25734
- [CI/Build] Split up Distributed Tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25572
- [CI/Build] Fix some V1 tests not being run by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25569
- [Quantization] Add field to skip unquantized modules for GPTQ config by @Isotr0py in https://github.com/vllm-project/vllm/pull/25455
- [BugFix] Fix using
dbo_decode_token_threshold
always (and ignoringdbo_prefill_token_threshold
) by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/25622 - [ray][metrics] Replace ':' with '_' for OpenTelemetry compatibility in Ray by @eicherseiji in https://github.com/vllm-project/vllm/pull/25439
- [Fix][torch.compile] fix unique_filepath by @ZJY0516 in https://github.com/vllm-project/vllm/pull/25732
- Eagle3 that supports the Minicpm3 model by @LDLINGLINGLING in https://github.com/vllm-project/vllm/pull/24243
- [Doc]: improve CPU(x86) build-wheel-from-source section by @brokedba in https://github.com/vllm-project/vllm/pull/25617
New Contributors
- @SamitHuang made their first contribution in https://github.com/vllm-project/vllm/pull/24733
- @rozeappletree made their first contribution in https://github.com/vllm-project/vllm/pull/24785
- @ChenTaoyu-SJTU made their first contribution in https://github.com/vllm-project/vllm/pull/24732
- @ziliangpeng made their first contribution in https://github.com/vllm-project/vllm/pull/24793
- @chenfengjin made their first contribution in https://github.com/vllm-project/vllm/pull/24823
- @LCAIZJ made their first contribution in https://github.com/vllm-project/vllm/pull/23917
- @xiao-llm made their first contribution in https://github.com/vllm-project/vllm/pull/22222
- @koiker made their first contribution in https://github.com/vllm-project/vllm/pull/20321
- @cboss6 made their first contribution in https://github.com/vllm-project/vllm/pull/23745
- @liangwen12year made their first contribution in https://github.com/vllm-project/vllm/pull/23498
- @lianyiibo made their first contribution in https://github.com/vllm-project/vllm/pull/24571
- @tahsintunan made their first contribution in https://github.com/vllm-project/vllm/pull/24342
- @haoyangli-amd made their first contribution in https://github.com/vllm-project/vllm/pull/24649
- @rouchenzi made their first contribution in https://github.com/vllm-project/vllm/pull/22842
- @xinyu-intel made their first contribution in https://github.com/vllm-project/vllm/pull/25026
- @shijun-yin made their first contribution in https://github.com/vllm-project/vllm/pull/24668
- @Aidyn-A made their first contribution in https://github.com/vllm-project/vllm/pull/24599
- @dolpm made their first contribution in https://github.com/vllm-project/vllm/pull/23774
- @samzong made their first contribution in https://github.com/vllm-project/vllm/pull/25010
- @mmangkad made their first contribution in https://github.com/vllm-project/vllm/pull/24766
- @karan made their first contribution in https://github.com/vllm-project/vllm/pull/25076
- @toncao made their first contribution in https://github.com/vllm-project/vllm/pull/24960
- @666even666 made their first contribution in https://github.com/vllm-project/vllm/pull/23909
- @lumina37 made their first contribution in https://github.com/vllm-project/vllm/pull/24886
- @gfinol made their first contribution in https://github.com/vllm-project/vllm/pull/25081
- @punitvara made their first contribution in https://github.com/vllm-project/vllm/pull/25058
- @gigit0000 made their first contribution in https://github.com/vllm-project/vllm/pull/24222
- @Rohan138 made their first contribution in https://github.com/vllm-project/vllm/pull/25104
- @candyzone made their first contribution in https://github.com/vllm-project/vllm/pull/24585
- @wxsIcey made their first contribution in https://github.com/vllm-project/vllm/pull/25243
- @LJH-LBJ made their first contribution in https://github.com/vllm-project/vllm/pull/25224
- @David-Wen2025 made their first contribution in https://github.com/vllm-project/vllm/pull/24818
- @alecsolder made their first contribution in https://github.com/vllm-project/vllm/pull/24985
- @Lucaskabela made their first contribution in https://github.com/vllm-project/vllm/pull/25090
- @manoelmarques made their first contribution in https://github.com/vllm-project/vllm/pull/23558
- @lirong-lirong made their first contribution in https://github.com/vllm-project/vllm/pull/25308
- @debroy-rh made their first contribution in https://github.com/vllm-project/vllm/pull/25285
- @Sara-KS made their first contribution in https://github.com/vllm-project/vllm/pull/23445
- @Daisy-Ma-coder made their first contribution in https://github.com/vllm-project/vllm/pull/25274
- @jcyang43 made their first contribution in https://github.com/vllm-project/vllm/pull/25278
- @Zhikaiiii made their first contribution in https://github.com/vllm-project/vllm/pull/25028
- @ahartel made their first contribution in https://github.com/vllm-project/vllm/pull/25203
- @wuxibin89 made their first contribution in https://github.com/vllm-project/vllm/pull/25458
- @rivos-shreeasish made their first contribution in https://github.com/vllm-project/vllm/pull/24913
- @Amir-19 made their first contribution in https://github.com/vllm-project/vllm/pull/24532
- @LuYanFCP made their first contribution in https://github.com/vllm-project/vllm/pull/24263
- @baxingpiaochong made their first contribution in https://github.com/vllm-project/vllm/pull/24015
- @Jackmin801 made their first contribution in https://github.com/vllm-project/vllm/pull/25207
- @taohui made their first contribution in https://github.com/vllm-project/vllm/pull/25405
- @OftenDream made their first contribution in https://github.com/vllm-project/vllm/pull/23991
- @nicole-lihui made their first contribution in https://github.com/vllm-project/vllm/pull/25573
- @jacobkahn made their first contribution in https://github.com/vllm-project/vllm/pull/25611
- @fadara01 made their first contribution in https://github.com/vllm-project/vllm/pull/25579
- @langc23 made their first contribution in https://github.com/vllm-project/vllm/pull/22112
- @AlonKejzman made their first contribution in https://github.com/vllm-project/vllm/pull/24662
- @BloodAxe made their first contribution in https://github.com/vllm-project/vllm/pull/22980
- @yitingdc made their first contribution in https://github.com/vllm-project/vllm/pull/25733
- @xaguilar-amd made their first contribution in https://github.com/vllm-project/vllm/pull/25703
- @Iceber made their first contribution in https://github.com/vllm-project/vllm/pull/25744
- @LDLINGLINGLING made their first contribution in https://github.com/vllm-project/vllm/pull/24243
- @brokedba made their first contribution in https://github.com/vllm-project/vllm/pull/25617
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.10.2...v0.11.0