Download Latest Version vllm-0.10.0.tar.gz (9.2 MB)
Email in envelope

Get an email when there's a new version of vLLM

Home / v0.10.0
Name Modified Size InfoDownloads / Week
Parent folder
vllm-0.10.0+cu118-cp38-abi3-manylinux1_x86_64.whl 2025-07-25 245.5 MB
vllm-0.10.0+cu126-cp38-abi3-manylinux1_x86_64.whl 2025-07-25 360.9 MB
vllm-0.10.0-cp38-abi3-manylinux1_x86_64.whl 2025-07-25 386.6 MB
vllm-0.10.0.tar.gz 2025-07-25 9.2 MB
README.md 2025-07-24 122.4 kB
v0.10.0 source code.tar.gz 2025-07-24 9.1 MB
v0.10.0 source code.zip 2025-07-24 10.9 MB
Totals: 7 Items   1.0 GB 13

Highlights

v0.10.0 release includes 308 commits, 168 contributors (62 new!).

NOTE: This release begins the cleanup of V0 engine codebase. We have removed V0 CPU/XPU/TPU/HPU backends (#20412), long context LoRA (#21169), Prompt Adapters (#20588), Phi3-Small & BlockSparse Attention (#21217), and Spec Decode workers (#21152) so far and plan to continued to delete code that is no longer used.

Model Support

  • New families: Llama 4 with EAGLE support (#20591), EXAONE 4.0 (#21060), Microsoft Phi-4-mini-flash-reasoning (#20702), Hunyuan V1 Dense + A13B with reasoning/tool parsing (#21368, [#20625], [#20820]), Ling MoE models (#20680), JinaVL Reranker (#20260), Nemotron-Nano-VL-8B-V1 (#20349), Arcee (#21296), Voxtral (#20970).
  • Enhanced compatibility: BERT/RoBERTa with AutoWeightsLoader (#20534), HF format support for MiniMax (#20211), Gemini configuration (#20971), GLM-4 updates (#20736).
  • Architecture expansions: Attention-free model support (#20811), Hybrid SSM/Attention models on V1 (#20016), LlamaForSequenceClassification (#20807), expanded Mamba2 layer support (#20660).
  • VLM improvements: VLM support with transformers backend (#20543), PrithviMAE on V1 engine (#20577).

Engine Core

  • Experimental async scheduling --async-scheduling flag to overlap engine core scheduling with GPU runner (#19970).
  • V1 engine improvements: backend-agnostic local attention (#21093), MLA FlashInfer ragged prefill (#20034), hybrid KV cache with local chunked attention (#19351).
  • Multi-task support: models can now support multiple tasks (#20771), multiple poolers (#21227), and dynamic pooling parameter configuration (#21128).
  • RLHF Support: new RPC methods for runtime weight reloading (#20096) and config updates (#20095), logprobs mode for selecting which stage of logprobs to return (#21398).
  • Enhanced caching: multi-modal caching for transformers backend (#21358), reproducible prefix cache hashing using SHA-256 + CBOR (#20511).
  • Startup time reduction via CUDA graph capture speedup via frozen GC (#21146).
  • Elastic expert parallel for dynamic GPU scaling while preserving state (#20775).

Hardwares & Performance

  • NVIDIA Blackwell/SM100 optimizations: CUTLASS block scaled group GEMM for smaller batches (#20640), FP8 groupGEMM support (#20447), DeepGEMM integration (#20087), FlashInfer MoE blockscale FP8 backend (#20645), CUDNN prefill API for MLA (#20411), Triton Fused MoE kernel config for FP8 E=16 on B200 (#20516).
  • Performance improvements: 48% request duration reduction via microbatch tokenization for concurrent requests (#19334), fused MLA QKV + strided layernorm (#21116), Triton causal-conv1d for Mamba models (#18218).
  • Hardware expansion: ARM CPU int8 quantization (#14129), PPC64LE/ARM V1 support (#20554), Intel XPU ray distributed execution (#20659), shared-memory pipeline parallel for CPU (#21289), FlashInfer ARM CUDA support (#21013).

Quantization

  • New quantization support: MXFP4 for MoE models (#17888), BNB support for Mixtral and additional MoE models (#20893, [#21100]), in-flight quantization for MoE (#20061).
  • Hardware-specific: FP8 KV cache quantization on TPU (#19292), FP8 support for BatchedTritonExperts (#18864), optimized INT8 vectorization kernels (#20331).
  • Performance optimizations: Triton backend for DeepGEMM per-token group quantization (#20841), CUDA kernel for per-token group quantization (#21083), CustomOp abstraction for FP8 (#19830).

API & Frontend

  • OpenAI compatibility: Responses API implementation (#20504, [#20975]), image object support in llm.chat (#19635), tool calling with required choice and $defs (#20629).
  • New endpoints: get_tokenizer_info for tokenizer/chat-template information (#20575), cache_salt support for completions/responses (#20981).
  • Model loading: Tensorizer S3 integration with arbitrary arguments (#19619), HF repo paths & URLs for GGUF models (#20793), tokenization_kwargs for embedding truncation (#21033).
  • CLI improvements: --help=page option for enhanced help documentation (#20961), default model changed to Qwen3-0.6B (#20335).

Dependencies

  • Updated PyTorch to 2.7.1 for CUDA (#21011)
  • FlashInfer updated to v0.2.8rc1 (#20718)

What's Changed

New Contributors

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.9.1...v0.10.0

Source: README.md, updated 2025-07-24