Download Latest Version v0.5.10.post1 source code.tar.gz (10.6 MB)
Email in envelope

Get an email when there's a new version of SGLang

Home / v0.5.9
Name Modified Size InfoDownloads / Week
Parent folder
README.md 2026-02-23 107.1 kB
v0.5.9 source code.tar.gz 2026-02-23 9.6 MB
v0.5.9 source code.zip 2026-02-23 12.1 MB
Totals: 3 Items   21.8 MB 1

Highlights

  • LoRA Weight Loading Overlap with Computation: Overlap LoRA weight loading with computation during inference, reducing TTFT by ~78% and TPOT by ~34.88% on large adaptors: [#15512]

  • TRT-LLM NSA Kernel Integration for DeepSeek V3.2: Integrate TRT-LLM DSA kernels for Native Sparse Attention, boosting DeepSeek V3.2 performance by 3x-5x on Blackwell platforms with trtllm for both --nsa-prefill-backend and --nsa-decode-backend (with minor accuracy drop): [#16758], [#17662], [#18389]

  • Flashinfer All-to-All MoE Dispatcher: Add the Flashinfer all-to-all MoE dispatcher for efficient expert parallelism communication, enabling optimized routing in MoE models: [#14668]

  • FA4 (FP4 Attention) Support for Multimodal Encoder: Introduce FP4 attention backend and variable-length attention function for multimodal encoders, enabling lower-precision inference for vision-language models: [#13539]

  • Anthropic Compatible API Endpoint: Add native Anthropic API compatibility to SGLang, allowing direct integration with tools and clients built for the Anthropic API format: [#18630]

  • SGLang-Diffusion Advanced Optimizations: Production-ready improvements including token-level sequence sharding, parallel VAE decoding, fused kernels, Nunchaku and FP8 support, and multiple new models in the ComfyUI plugin: blog

  • Spec V2 Critical bug fix: Fix out-of-index bug caused by torch garbage collection in speculative decoding v2, improving reliability of speculative verification: [#18958]

  • Deploying DeepSeek on GB300 NVL72: Optimization work for long-context inference using prefill-decode disaggregation and other SGLang features on NVIDIA's latest GB300 platform: blog

  • Bump AITER version to 0.1.10.post3: Support FP8 Prefill/Decode/KV Cache

  • Commit-to-Version Lookup in docs.sglang.io: Easily find the earliest official version that includes a given PR or commit, streamlining release tracking for users and developers: [#18450]

New Model Support

  • Kimi-K2.5: [#17789], cookbook
  • GLM-5: cookbook (still requires a custom docker for transformers upgrade, will follow up with a rc release since transformers upgrade is risky)
  • Qwen 3.5: [#18489], [#18926], [#18937], cookbook
  • MiniMax 2.5: cookbook
  • Ernie4.5-VL: [#15679]
  • Step3-VL: [#17513]
  • Step-3.5-Flash: [#18084], cookbook
  • LLaDA 2.1: cookbook
  • Ring 2.5 1T / Ling 2.5 1T: [#18598], cookbook, cookbook
  • MOVA (Diffusion): [#17704]
  • GLM-OCR: [#17582], cookbook
  • DeepSeek-OCR-2: [#17897]

SGLang-Diffusion

  • Support multiple new models in ComfyUI Plugin
  • Parallel Folding and Parallel VAE Decoding for faster image/video generation
  • Nunchaku and FP8 support for diffusion models
  • Sequence Sharding (token-level) replacing Frame Sharding for improved efficiency
  • LTX-2 support: [#17495], [#17496]
  • MOVA model support: [#17704]
  • Cache-DiT optimizations and fused kernel improvements
  • Numerous bug fixes and refactors across the diffusion pipeline

Performance

  • Integrate TRT-LLM NSA kernels with up to 3-5x speedup on Blackwell: [#16758], [#17662], [#18389]
  • LoRA weight loading overlap reducing TTFT by ~78%: [#15512]
  • Flashinfer all-to-all MoE dispatcher: [#14668]
  • FA4 for multimodal encoder: [#13539]
  • Optimize GDN decode for Qwen3 Next: [#17094]
  • Tune fused MoE kernels for Llama-4-Scout, MiniMax M2: [#17891], [#18851], [#18833]
  • Symmetric memory pre-allocation to avoid fragmentation: [#17089]
  • Optimize fused_moe triton kernel TMA: [#18782]
  • Fused triton kernel for Ernie4.5-VL rotary embedding: [#18856]
  • Support MxINT4 Flashinfer TRT-LLM MoE GEMM: [#16892]
  • AITER bias MoE support for GPT-OSS MxFP4: [#17735]

Prefill-Decode Disaggregation

  • Support KV transfer with MORI-IO: [#14626]
  • Mooncake intra-node NVLink KV transfer: [#17866]
  • Improve KV offset calculation for MHA model with different TP size: [#18163]
  • Document SGLANG_MOONCAKE_CUSTOM_MEM_POOL: [#18259]

Diffusion LLM (dLLM)

  • Remove cuda graph batch size limitation: [#17458]
  • JointThreshold algorithm for joint M2T and T2T decoding: [#18171]
  • Basic dLLM scheduling strategy and implementation: [#17484]

Speculative Decoding

  • Fix out-of-index bug caused by torch garbage collection in Spec V2: [#18958]
  • Move forward timeout before verify to fix Eagle v1 filter mismatch: [#18760]

Dependencies

  • Flashinfer updated to 0.6.3: [#17700]
  • AITER updated to 0.1.10.post3: [#18741]
  • Mooncake transfer engine updated to 0.3.9: [#18316]

AMD Hardware

  • AITER updated to v0.1.10.post3 with FP8 Prefill, FP8 Decode, FP8 KV Cache support
  • ROCm 7 standardization and ROCm 6.3 deprecation: [#17785]
  • Kimi K2.5 Day 0 ROCm support: [#17863]
  • FP8 prefill attention kernel integration: [#18528]
  • Two-batch overlapping for MORI EP: [#17953]
  • DeepSeek V3.2 and Kimi-K2 nightly CI tests: [#17523]

NPU/Ascend

  • Support for MiniCPM3-4B: [#16866]
  • Qwen 3.5 support on Ascend: [#18544]
  • Accuracy improvements for StableLM-2: [#17470]
  • Bug fixes for DeepSeek V3.2 and DeepSeek-VL2: [#17007]

CPU Backend

  • Optimize Qwen3-Next model on CPU: [#12525]
  • Optimize flash_attn_varlen_func: [#15708]
  • Add INT4 kernels for CPU: [#8226]

Kernel Slimming

  • Migrate GPTQ-Marlin repack kernel to JIT: [#18543]
  • Migrate AWQ Marlin repack kernel to JIT: [#18949]

Documentation

  • Add RL documentation: [#17663]
  • Update torch compile description: [#17819]
  • Refine spec decode docs for SpecV2/STANDALONE/NGRAM: [#18321]
  • Consolidate diffusion documentation: [#18095]

What's Changed

New Contributors

Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.8...v0.5.9

Source: README.md, updated 2026-02-23