Download Latest Version Release v0.4.8 source code.tar.gz (4.2 MB)
Email in envelope

Get an email when there's a new version of SGLang

Home / v0.4.4
Name Modified Size InfoDownloads / Week
Parent folder
README.md 2025-03-13 46.5 kB
Release v0.4.4 source code.tar.gz 2025-03-13 3.4 MB
Release v0.4.4 source code.zip 2025-03-13 4.0 MB
Totals: 3 Items   7.4 MB 0

Highlights

The SGLang team is excited to announce the release of v0.4.4. We will keep improving DeepSeek V3/R1 performance. With the combination of FlashInfer, MTP, DeepGEMM, and Torch Compile optimizations on H200, it can achieve nearly 100 tokens/s, which is currently the fastest open-source implementation. Look out for new optimizations coming soon!

Thanks very much to xAI Team, NVIDIA Team, AMD Team, LinkedIn team, Baseten Team, Meituan Team and the open source community users for their contributions!

Regarding the use of SGLang for DeepSeek R1 inference acceleration, in addition to the users mentioned in the announcement , there are also teams such as Tencent and Ant Group. We are very happy to have received recognition and usage from these teams!

Though surely there will be bugs and fixes that we'll be discovering and quickly patching in the coming days, including today :) Let's build and ship. Please feel free to join our Slack channel https://slack.sglang.ai/ Cheers!

Optimizations

  • AMD Performance Leadership: SGLang is now the fastest LLM engine for DeepSeek V3/R1 inference on AMD hardware, as confirmed by AMD's technical blog

  • Enhanced FlashInfer MLA Support: Now fully compatible with radix cache, chunked prefill, and MTP optimizations - enable with --enable-flashinfer-mla

  • Advanced MTP Capabilities: Both Triton and FlashInfer backends now offer comprehensive Multi-Token Prediction support, easily tunable via the bench_speculative script

  • DeepGEMM Integration: Full integration of DeepGEMM for NVIDIA Hopper architectures - enable with export SGL_ENABLE_JIT_DEEPGEMM=1

  • Pioneering INT8 Quantization: First industry implementation of INT8 support for DeepSeek R1 models:

  • meituan/DeepSeek-R1-Channel-INT8

  • meituan/DeepSeek-R1-Block-INT8

  • Hardware Optimizations:

  • Blackwell architecture Block Scale FP8 GEMM support

  • Support page size greater than 1 https://github.com/sgl-project/sglang/pull/4356

  • Optimized W8A8 FP8 implementation with performance gains across all architectures (sm80, sm89, sm90), featuring 15%+ improvement specifically on sm89

  • Enhanced distributed parallelism capabilities (e.g., two-node configurations with DP 2, TP 8) https://github.com/sgl-project/sglang/pull/4390

Coming soon

What's Changed

New Contributors

Full Changelog: https://github.com/sgl-project/sglang/compare/v0.4.3...v0.4.4

Source: README.md, updated 2025-03-13