Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
README.md | 2025-03-13 | 46.5 kB | |
Release v0.4.4 source code.tar.gz | 2025-03-13 | 3.4 MB | |
Release v0.4.4 source code.zip | 2025-03-13 | 4.0 MB | |
Totals: 3 Items | 7.4 MB | 0 |
Highlights
The SGLang team is excited to announce the release of v0.4.4. We will keep improving DeepSeek V3/R1 performance. With the combination of FlashInfer, MTP, DeepGEMM, and Torch Compile optimizations on H200, it can achieve nearly 100 tokens/s, which is currently the fastest open-source implementation. Look out for new optimizations coming soon!
Thanks very much to xAI Team, NVIDIA Team, AMD Team, LinkedIn team, Baseten Team, Meituan Team and the open source community users for their contributions!
Regarding the use of SGLang for DeepSeek R1 inference acceleration, in addition to the users mentioned in the announcement , there are also teams such as Tencent and Ant Group. We are very happy to have received recognition and usage from these teams!
Though surely there will be bugs and fixes that we'll be discovering and quickly patching in the coming days, including today :) Let's build and ship. Please feel free to join our Slack channel https://slack.sglang.ai/ Cheers!
Optimizations
-
AMD Performance Leadership: SGLang is now the fastest LLM engine for DeepSeek V3/R1 inference on AMD hardware, as confirmed by AMD's technical blog
-
Enhanced FlashInfer MLA Support: Now fully compatible with radix cache, chunked prefill, and MTP optimizations - enable with
--enable-flashinfer-mla
-
Advanced MTP Capabilities: Both Triton and FlashInfer backends now offer comprehensive Multi-Token Prediction support, easily tunable via the bench_speculative script
-
DeepGEMM Integration: Full integration of DeepGEMM for NVIDIA Hopper architectures - enable with
export SGL_ENABLE_JIT_DEEPGEMM=1
-
Pioneering INT8 Quantization: First industry implementation of INT8 support for DeepSeek R1 models:
-
Hardware Optimizations:
-
Blackwell architecture Block Scale FP8 GEMM support
-
Support page size greater than 1 https://github.com/sgl-project/sglang/pull/4356
-
Optimized W8A8 FP8 implementation with performance gains across all architectures (sm80, sm89, sm90), featuring 15%+ improvement specifically on sm89
-
Enhanced distributed parallelism capabilities (e.g., two-node configurations with DP 2, TP 8) https://github.com/sgl-project/sglang/pull/4390
Coming soon
-
Integrate Flash Attention https://github.com/sgl-project/sglang/issues/4385
-
Integrate FlashMLA https://github.com/sgl-project/sglang/issues/4384
-
EAGLE 2 optimization https://github.com/sgl-project/sglang/pull/4383
-
EAGLE 3 day one support https://github.com/sgl-project/sglang/pull/4247
-
Integrate DeepEP https://github.com/sgl-project/sglang/pull/4232
-
Prefill and Decoding Disaggregation
What's Changed
- update flashinfer-python by @zhyncs in https://github.com/sgl-project/sglang/pull/3557
- fix doc by @zhyncs in https://github.com/sgl-project/sglang/pull/3558
- Add support for OpenAI API o1 model by @ChuyueSun in https://github.com/sgl-project/sglang/pull/3363
- fix sgl-kernel codestyle by @BBuf in https://github.com/sgl-project/sglang/pull/3563
- docs: update install by @zhyncs in https://github.com/sgl-project/sglang/pull/3581
- Copy config files for MI300X to support in virtualized environments by @yosoyjay in https://github.com/sgl-project/sglang/pull/3505
- ROCm docker: triton update by @HaiShaw in https://github.com/sgl-project/sglang/pull/3584
- [fix] added support for vlm in offline inference by @FrankLeeeee in https://github.com/sgl-project/sglang/pull/3548
- Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 by @ispobock in https://github.com/sgl-project/sglang/pull/3582
- [CI] Improve Docs CI Efficiency by @shuaills in https://github.com/sgl-project/sglang/pull/3587
- doc: emphasize and notify the usage of chat_template by @mickqian in https://github.com/sgl-project/sglang/pull/3589
- fix eagle unit test by @zhyncs in https://github.com/sgl-project/sglang/pull/3591
- fix high qps crash when enable mtp by @zhyncs in https://github.com/sgl-project/sglang/pull/3592
- fix apply_token_bitmask_inplace_cuda by @zhyncs in https://github.com/sgl-project/sglang/pull/3594
- [docs] added favicon to sphinx html by @FrankLeeeee in https://github.com/sgl-project/sglang/pull/3564
- fix lockfile and port_registry file permission error by @Jiadalee in https://github.com/sgl-project/sglang/pull/3598
- feat: Support Qwen 2.5 vl by @mickqian in https://github.com/sgl-project/sglang/pull/3258
- [ROCm] Use
tl.range()
in block GEMM kernels withnum_stages
set by host. by @whchung in https://github.com/sgl-project/sglang/pull/3535 - Update to latest amd image. by @saienduri in https://github.com/sgl-project/sglang/pull/3597
- Benchmark for reasoning models by @simveit in https://github.com/sgl-project/sglang/pull/3532
- Draft of updated doc for sampling params. by @simveit in https://github.com/sgl-project/sglang/pull/3260
- [docs] Update sampling_params.md by @shuaills in https://github.com/sgl-project/sglang/pull/3617
- [docker] added rdma support by @FrankLeeeee in https://github.com/sgl-project/sglang/pull/3619
- Revert "[ROCm] Use
tl.range()
in block GEMM kernels with `num_stage… by @zhyncs in https://github.com/sgl-project/sglang/pull/3632 - add mtp unit test by @zhyncs in https://github.com/sgl-project/sglang/pull/3634
- update unit test by @zhyncs in https://github.com/sgl-project/sglang/pull/3636
- chore: bump v0.4.3.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/3638
- h800 deepseek r1 config and support multi-gpu block-gemm tuning by @BBuf in https://github.com/sgl-project/sglang/pull/3639
- feat: support flashinfer mla with prefix cache by @zhyncs in https://github.com/sgl-project/sglang/pull/3643
- chore: update flashinfer v0.2.1.post2 by @zhyncs in https://github.com/sgl-project/sglang/pull/3644
- chore: bump v0.4.3.post2 by @zhyncs in https://github.com/sgl-project/sglang/pull/3645
- use transformers 4.48.3 by @zhyncs in https://github.com/sgl-project/sglang/pull/3650
- [ROCm] Add additional block quant GEMM tuning configs for AMD GPUs. by @whchung in https://github.com/sgl-project/sglang/pull/3616
- [ROCm] Optimal MOE Tuning for AMD Radeon Graphics by @BruceXcluding in https://github.com/sgl-project/sglang/pull/3567
- Deploy multi-node inference (LWS method) using sglang in a K8s cluster by @whybeyoung in https://github.com/sgl-project/sglang/pull/3624
- Update amd docker image. by @saienduri in https://github.com/sgl-project/sglang/pull/3654
- [Feature] Apply Cublas Grouped Gemm kernel by @Fridge003 in https://github.com/sgl-project/sglang/pull/3629
- update pr-test by @zhyncs in https://github.com/sgl-project/sglang/pull/3663
- Fix draft decode max batch size by @ispobock in https://github.com/sgl-project/sglang/pull/3676
- fix: remove dependency on latest transformers impl by @mickqian in https://github.com/sgl-project/sglang/pull/3635
- AMD Prefill optimize by @fsx950223 in https://github.com/sgl-project/sglang/pull/3665
- fix: apply cache size limit of attention mask for VisionAttention by @mickqian in https://github.com/sgl-project/sglang/pull/3657
- set NCCL_IB_GID_INDEX=3 for multi node NVIDIA InfiniBand if needed by @zhyncs in https://github.com/sgl-project/sglang/pull/3698
- use warp shuffle style reduce and flashinfer vectorize by @BBuf in https://github.com/sgl-project/sglang/pull/3628
- [Docs] Add SkyPilot DeepSeek example by @Michaelvll in https://github.com/sgl-project/sglang/pull/3706
- [k8s] remove unnecessary hostIPC for security concern by @panpan0000 in https://github.com/sgl-project/sglang/pull/3700
- [moe] optim: reduce memory consumption in fused_moe by @ch-wan in https://github.com/sgl-project/sglang/pull/3692
- [Improve] Fix Multi-User Port Allocation Conflicts by @shuaills in https://github.com/sgl-project/sglang/pull/3601
- Variance measure for reasoning benchmark by @simveit in https://github.com/sgl-project/sglang/pull/3677
- Docs: Fix layout with sub-section by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3710
- add control for cutlass fp8 blockwise gemm by @yizhang2077 in https://github.com/sgl-project/sglang/pull/3727
- revert BLOCK and num_warps on HIP by @HaiShaw in https://github.com/sgl-project/sglang/pull/3722
- Optimize triton attention custom mask by @ispobock in https://github.com/sgl-project/sglang/pull/3731
- [Bugfix] Fix scores mask for moe topk by @Chen-XiaoBing in https://github.com/sgl-project/sglang/pull/3705
- [Docs] Modify ep related server args and remove cublas part of deepseek by @Fridge003 in https://github.com/sgl-project/sglang/pull/3732
- [Fix] Fix bugs and refactor codes in lora for better scalability. by @aoshen524 in https://github.com/sgl-project/sglang/pull/3652
- docs: fix 404 link by @trayvonpan in https://github.com/sgl-project/sglang/pull/3588
- [docs] added torch.compile cache to dpsk manual by @FrankLeeeee in https://github.com/sgl-project/sglang/pull/3737
- AMD/ROCm: update AITER repo to ROCm/aiter by @HaiShaw in https://github.com/sgl-project/sglang/pull/3747
- feat: update grouped_topk to support softmax and sigmoid by @zixuanzhang226 in https://github.com/sgl-project/sglang/pull/3680
- feat: Add SageMaker support by @andjsmi in https://github.com/sgl-project/sglang/pull/3740
- Change description of nvidia jetson docs by @shahizat in https://github.com/sgl-project/sglang/pull/3761
- [Fix] fix OpenAI API adapter tokenizer encoding by @shuaills in https://github.com/sgl-project/sglang/pull/3432
- [bug] fixed batch api by @FrankLeeeee in https://github.com/sgl-project/sglang/pull/3754
- Adjustments to docs by @simveit in https://github.com/sgl-project/sglang/pull/3733
- docs: Add offline engine launch example and documentation by @shuaills in https://github.com/sgl-project/sglang/pull/3771
- Update offline_engine_api.ipynb by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3773
- Support Qwen RM model. by @simveit in https://github.com/sgl-project/sglang/pull/3772
- Add support for nvidia modelopt fp8 kv cache by @Edwardf0t1 in https://github.com/sgl-project/sglang/pull/3223
- Tiny fix Olmo2 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/3348
- fix lm head weights in Qwen models by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3777
- Fix weight loader error when LM head weights are tied by @fzyzcjy in https://github.com/sgl-project/sglang/pull/3766
- Let DetokenizerManager use TypeBasedDispatcher by @fzyzcjy in https://github.com/sgl-project/sglang/pull/3117
- bench: Add a benchmark for vLM: MMMU by @mickqian in https://github.com/sgl-project/sglang/pull/3562
- Extract generation_manager from tokenizer_manager by @fzyzcjy in https://github.com/sgl-project/sglang/pull/3115
- Rename TokenizerManager to StdOrchestrator by @fzyzcjy in https://github.com/sgl-project/sglang/pull/3116
- [Docs]Add instruction for manually stopping nsys profiler by @Fridge003 in https://github.com/sgl-project/sglang/pull/3795
- Hierarchical Caching for SGLang by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/2693
- Update readme by @merrymercy in https://github.com/sgl-project/sglang/pull/3809
- Fix dependency by @merrymercy in https://github.com/sgl-project/sglang/pull/3813
- Refactor flashinfer logic for deepseek v3 and fix accuracy bug by @Fridge003 in https://github.com/sgl-project/sglang/pull/3785
- Feature DeepSeek V3/R1 INT8 Quantization (block-wise) by @laixinn in https://github.com/sgl-project/sglang/pull/3730
- Fix pandas dependency in CI by @merrymercy in https://github.com/sgl-project/sglang/pull/3818
- Revert "Rename TokenizerManager to StdOrchestrator" by @merrymercy in https://github.com/sgl-project/sglang/pull/3828
- Revert "Extract generation_manager from tokenizer_manager" by @merrymercy in https://github.com/sgl-project/sglang/pull/3829
- Fix CI and install docs by @merrymercy in https://github.com/sgl-project/sglang/pull/3821
- typos by @WrRan in https://github.com/sgl-project/sglang/pull/3801
- doc: fix dead link in router.md by @He1pa in https://github.com/sgl-project/sglang/pull/3799
- Fix doc site copyright to current year by @wilsonwu in https://github.com/sgl-project/sglang/pull/3741
- [Doc] Fix typo in server-argument description by @yuanheng-zhao in https://github.com/sgl-project/sglang/pull/3641
- [ROCm] Enable Fused MLA Triton kernel for DeepSeekV3 by @lcskrishna in https://github.com/sgl-project/sglang/pull/3237
- [BugFix]: Add missing clamp to llavavid by @PanJason in https://github.com/sgl-project/sglang/pull/3787
- [dist] made timeout configurable by @FrankLeeeee in https://github.com/sgl-project/sglang/pull/3803
- Fix allgather ops inside cuda graphs by @nvcastet in https://github.com/sgl-project/sglang/pull/3709
- fix capture_bs by @fsx950223 in https://github.com/sgl-project/sglang/pull/3857
- [BugFix] Fix crash when receive a req with structed output in DP attention mode. by @hcyz33 in https://github.com/sgl-project/sglang/pull/3841
- Fix maximum recursion depth triggered on exception exit by @kebe7jun in https://github.com/sgl-project/sglang/pull/3519
- [doc] added quantization doc for dpsk by @FrankLeeeee in https://github.com/sgl-project/sglang/pull/3843
- [doc] fixed dpsk quant faq by @FrankLeeeee in https://github.com/sgl-project/sglang/pull/3865
- Expert Parallelism (EP) Support for DeepSeek V3/R1 by @sleepcoo in https://github.com/sgl-project/sglang/pull/3602
- Revert recent changes by @simveit in https://github.com/sgl-project/sglang/pull/3845
- Feature/improve docs by @simveit in https://github.com/sgl-project/sglang/pull/3860
- [Feature] Support llguidance for constrained decoding by @JC1DA in https://github.com/sgl-project/sglang/pull/3298
- Move dpsk docs forward a step by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3894
- Docs: Reorngaize dpsk links by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3900
- Implemented frontend docs by @simveit in https://github.com/sgl-project/sglang/pull/3791
- [doc] update sponsorship by @whybeyoung in https://github.com/sgl-project/sglang/pull/3903
- [Rocm] Fix to the rocm_mla_decode_rope.py returning random result by @Chi-Chu319 in https://github.com/sgl-project/sglang/pull/3898
- [doc] Update document for flashinfer mla by @Fridge003 in https://github.com/sgl-project/sglang/pull/3907
- Add return hidden state in the native API by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/3897
- [Docs] Disable notebook CI when merge to main by @xqoasis in https://github.com/sgl-project/sglang/pull/3905
- [Docs] Improve DPSK docs in dark mode by @hebiao064 in https://github.com/sgl-project/sglang/pull/3914
- [Doc] Add experimental tag for flashinfer mla by @Fridge003 in https://github.com/sgl-project/sglang/pull/3925
- Tuning Script for Feature DeepSeek V3/R1 INT8 Quantization (block-wise) by @laixinn in https://github.com/sgl-project/sglang/pull/3922
- xgrammar 0.1.14 by @qeternity in https://github.com/sgl-project/sglang/pull/3593
- revert "Docs: Reorngaize dpsk links [#3900]" by @zhyncs in https://github.com/sgl-project/sglang/pull/3933
- upgrade flashinfer v0.2.2.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/3934
- Fix the doc link for sampling params by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/3861
- [feat] Add Vertex AI compatible prediction route for /generate by @KCFindstr in https://github.com/sgl-project/sglang/pull/3866
- [MOE] enable efficient moe_alignment multi-blocks execution (3x~6x) by @yiakwy-xpu-ml-framework-team in https://github.com/sgl-project/sglang/pull/3613
- Fix bench_serving not recognizing OPENAI_API_KEY by @kebe7jun in https://github.com/sgl-project/sglang/pull/3870
- set a strict sgl-kernel version by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3950
- [Bugfix] Fix tokenizer_manager not getting 400 when req is too long by @CatherineSue in https://github.com/sgl-project/sglang/pull/3678
- [Feature] integrate Structural Tag in xgrammar backend for function calling by @minleminzui in https://github.com/sgl-project/sglang/pull/3566
- SGLang + Verl by @fzyzcjy in https://github.com/sgl-project/sglang/pull/3852
- Remove unused imports from rocm mla kernel. by @lcskrishna in https://github.com/sgl-project/sglang/pull/3963
- Update cutlass dependency by @elfiegg in https://github.com/sgl-project/sglang/pull/3966
- [Feature]Support ragged prefill in flashinfer mla backend by @Fridge003 in https://github.com/sgl-project/sglang/pull/3967
- Docs: add type hint to smapling parameters by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3975
- Add redline to highlight main process by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3977
- rename FunctionCallReqInput to ParseFunctionCallReq by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3976
- Docs: add special warning to engine docs by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3979
- Revert "[MOE] enable efficient moe_alignment multi-blocks execution (3x~6x)" by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3982
- Move return_hidden_states to the generate input by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/3985
- Update CODEOWNERS by @merrymercy in https://github.com/sgl-project/sglang/pull/3989
- add deepgemm and sglang fp8 block-wise gemm benchmark by @BBuf in https://github.com/sgl-project/sglang/pull/3893
- fix typo by @BBuf in https://github.com/sgl-project/sglang/pull/3991
- Fix all gather torch compile by @ispobock in https://github.com/sgl-project/sglang/pull/3992
- Add accuracy test for TP torch compile by @ispobock in https://github.com/sgl-project/sglang/pull/3994
- Enable custom AR for AMD GPUs and maintain it in sgl-kernel by @hubertlu-tw in https://github.com/sgl-project/sglang/pull/3406
- Add Benchmark for DeepGEMM Group GEMM by @hebiao064 in https://github.com/sgl-project/sglang/pull/3993
- [feat] add small vocab table for eagle's draft model[1]. by @Zhou-sx in https://github.com/sgl-project/sglang/pull/3822
- Add fast decode plan for flashinfer mla by @Fridge003 in https://github.com/sgl-project/sglang/pull/3987
- Revert "Add fast decode plan for flashinfer mla" by @merrymercy in https://github.com/sgl-project/sglang/pull/4008
- Add examples to token-in-token-out for LLM by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4010
- Fix nightly-test CI by @yinfan98 in https://github.com/sgl-project/sglang/pull/3826
- Optimize Triton Kernel of Group GEMM in DeepGEMM Benchmark by @hebiao064 in https://github.com/sgl-project/sglang/pull/4014
- Improve code styles by @merrymercy in https://github.com/sgl-project/sglang/pull/4021
- Clean up custom allreduce by @merrymercy in https://github.com/sgl-project/sglang/pull/4029
- remove cache configs in model definitions by @merrymercy in https://github.com/sgl-project/sglang/pull/4031
- Update metrics documentation by @binarycrayon in https://github.com/sgl-project/sglang/pull/3264
- Reorganize c++ source files in sgl-kernel with multiple folders by @merrymercy in https://github.com/sgl-project/sglang/pull/4025
- Reorganize python source files in sgl-kernel with multiple files by @merrymercy in https://github.com/sgl-project/sglang/pull/4027
- Misc clean up; Remove the support of jump forward by @merrymercy in https://github.com/sgl-project/sglang/pull/4032
- Docs: Fix sampling parameter by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4034
- Remove outdated test utils and fix links for the doc of sampling params by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/3999
- Add examples in sampling parameters by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4039
- Share target model embed and head weights for nextn by @ispobock in https://github.com/sgl-project/sglang/pull/4033
- Add a link to the roadmap in README.md by @merrymercy in https://github.com/sgl-project/sglang/pull/4043
- docs: update README by @zhyncs in https://github.com/sgl-project/sglang/pull/4044
- Fix assert options.num_stages != 0 error in the latest ROCm build image by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/4049
- Reasoning parser by @xihuai18 in https://github.com/sgl-project/sglang/pull/4000
- HotFix for [#3988] using blockwise_int8 by @xihuai18 in https://github.com/sgl-project/sglang/pull/4023
- Fix breakage problem when using custom_ar by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/4052
- ROCm: update aiter and its usage to fused moe (bloat16, fp8, fp8 block-quant) by @HaiShaw in https://github.com/sgl-project/sglang/pull/4053
- Fix
debug_tensor_dump_output_folder
optional key missing by @Qubitium in https://github.com/sgl-project/sglang/pull/4046 - Remove grafana dashboard's datasource uid by @kebe7jun in https://github.com/sgl-project/sglang/pull/4051
- [Fix & Style] Refactor the grammar backend to reduce human errors and improve readability by @DarkSharpness in https://github.com/sgl-project/sglang/pull/4030
- [XCCL] Use xccl for xpu backend since xccl is ready in latest PyTorch. by @cboss6 in https://github.com/sgl-project/sglang/pull/3954
- sgl-router - issues on routing and project build. (#3870) by @michaelfeil in https://github.com/sgl-project/sglang/pull/3948
- fix: support gelu_new activation function in gpt2 by @Xiuyu-Li in https://github.com/sgl-project/sglang/pull/3712
- remove unused max_jobs by @sgjzfzzf in https://github.com/sgl-project/sglang/pull/3607
- [Feature] Add test for speculative_token_map by @Achazwl in https://github.com/sgl-project/sglang/pull/4016
- Revert "Fix nightly-test CI" by @merrymercy in https://github.com/sgl-project/sglang/pull/4065
- Update nextn ci test by @ispobock in https://github.com/sgl-project/sglang/pull/4071
- Simplify eagle tests and TP sync in grammar backend by @merrymercy in https://github.com/sgl-project/sglang/pull/4066
- Add examples for returning hidden states when using the server by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/4074
- [Minor] more code cleanup by @merrymercy in https://github.com/sgl-project/sglang/pull/4077
- test: add vlm to token in & out example by @mickqian in https://github.com/sgl-project/sglang/pull/3941
- [QUANT] Add GPTQModel Dynamic Quantization +
lm_head
Quantization by @Qubitium in https://github.com/sgl-project/sglang/pull/3790 - bench: add dataset param for bench_multiturn by @zeroorhero in https://github.com/sgl-project/sglang/pull/3990
- ROCM: AITER BLOCK GEMM by @BruceXcluding in https://github.com/sgl-project/sglang/pull/4075
- [Eagle] Refactor eagle speculative decoding by @Ying1123 in https://github.com/sgl-project/sglang/pull/3986
- Fix the moe padding conditional logic by @HaiShaw in https://github.com/sgl-project/sglang/pull/4081
- [Revision] Add fast decode plan for flashinfer mla by @Fridge003 in https://github.com/sgl-project/sglang/pull/4012
- Fix triton kernel illegal memory issue for eagle by @ispobock in https://github.com/sgl-project/sglang/pull/4100
- Add update_weights_from_disk endpoint to Engine by @jhinpan in https://github.com/sgl-project/sglang/pull/4102
- Add DeepSeek optimization ablations documentation by @M0gician in https://github.com/sgl-project/sglang/pull/4107
- reorganize dpsk docs by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4108
- Add examples for server token-in-token-out by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/4103
- revert deepseek docs by @zhyncs in https://github.com/sgl-project/sglang/pull/4109
- Create release-docker-amd-nightly.yml by @saienduri in https://github.com/sgl-project/sglang/pull/4105
- remove testing on PR workflow change by @saienduri in https://github.com/sgl-project/sglang/pull/4110
- Debug radixcache: refactor recursive helper methods by @luzengxiangcn in https://github.com/sgl-project/sglang/pull/3029
- Online serving benchmarks of real datasets for hierarchical KV caching by @PanJason in https://github.com/sgl-project/sglang/pull/3211
- fix cross-reference error and spelling mistakes by @samzong in https://github.com/sgl-project/sglang/pull/4101
- fix Non-consecutive header level increase in docs/router/router.md by @samzong in https://github.com/sgl-project/sglang/pull/4099
- chore: bump v0.4.3.post3 by @zhyncs in https://github.com/sgl-project/sglang/pull/4114
- [Hoxfix] Fix incomplete token_to_kv_pool refactor by @Edenzzzz in https://github.com/sgl-project/sglang/pull/4121
- Remove prefill-only-one-req by @merrymercy in https://github.com/sgl-project/sglang/pull/4117
- Add a pointer to the real KV cache pool by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4113
- feat: support docs auto live-reload with sphinx-autobuild by @samzong in https://github.com/sgl-project/sglang/pull/4111
- EAGLE docs by @simveit in https://github.com/sgl-project/sglang/pull/4038
- Add codeowners for eagle implementations by @Ying1123 in https://github.com/sgl-project/sglang/pull/4131
- Add tag suffix to nightly docker builds. by @saienduri in https://github.com/sgl-project/sglang/pull/4129
- remove unused max_jobs in setup_rocm.py by @sgjzfzzf in https://github.com/sgl-project/sglang/pull/4126
- Split the init of scheduler as smaller functions. Improve the eagle tests by @merrymercy in https://github.com/sgl-project/sglang/pull/4128
- [Minor] make the
__init__
function of model_runner.py shorter by @merrymercy in https://github.com/sgl-project/sglang/pull/4132 - AMD/ROCm: update base image string by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/4137
- Update CODEOWNER by @merrymercy in https://github.com/sgl-project/sglang/pull/4138
- fix bench serving bug by @Lzhang-hub in https://github.com/sgl-project/sglang/pull/4135
- Fix a draft model accuracy bug in eagle; support step=1; return logprob in eagle by @merrymercy in https://github.com/sgl-project/sglang/pull/4134
- Fix nightly ci Gsm8k & Fix flashinfer backend kvcache quant by @yinfan98 in https://github.com/sgl-project/sglang/pull/4147
- Fix constrained generation errors by adding datasets dependency by @olliestanley in https://github.com/sgl-project/sglang/pull/4142
- Release v0.4.3.post4 by @merrymercy in https://github.com/sgl-project/sglang/pull/4140
- [docs] fix HF reference script command by @adarshxs in https://github.com/sgl-project/sglang/pull/4148
- Docs: add torch compile cache by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4151
- Hot fix small vocal eagle in docs by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4154
- ROCm: enable trillion-parameter MoE models with INT4-FP8 single node by @HaiShaw in https://github.com/sgl-project/sglang/pull/4152
- Add Support for Qwen2-VL Multi-modal Embedding Models by @Titan-p in https://github.com/sgl-project/sglang/pull/3694
- [quant kernel] sgl-kernel support per_tensor_quant fp8 by @BBuf in https://github.com/sgl-project/sglang/pull/3786
- Add sgl_per_token_quant_fp8 by @hebiao064 in https://github.com/sgl-project/sglang/pull/4089
- [Feature] DeepSeek V3/R1 INT8 Quantization (channel-wise) by @HandH1998 in https://github.com/sgl-project/sglang/pull/3888
- [Refactor] Reducing code duplication across FP8 CUDA quantization kernels by @hebiao064 in https://github.com/sgl-project/sglang/pull/4163
- [Docs] Fix links and grammar issues by @windsonsea in https://github.com/sgl-project/sglang/pull/4162
- Remove non-existent AMD header include by @hebiao064 in https://github.com/sgl-project/sglang/pull/4166
- Put utils in ifndef USE_ROCM to fix CI (#4167) by @zhyncs in https://github.com/sgl-project/sglang/pull/4168
- Memory pool fix for upstream change about eagle by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4170
- chore: bump v0.0.3.post7 for sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/4176
- Add an example of using deepseekv3 int8 sglang. by @sleepcoo in https://github.com/sgl-project/sglang/pull/4177
- fix int8 doc link by @zhyncs in https://github.com/sgl-project/sglang/pull/4179
- [Docs] Improve bullets appearance and grammar by @windsonsea in https://github.com/sgl-project/sglang/pull/4174
- ROCm: Flex Attention Enablement with custom backends by @HaiShaw in https://github.com/sgl-project/sglang/pull/4178
- Revert "ROCm: Flex Attention Enablement with custom backends (#4178)" by @zhyncs in https://github.com/sgl-project/sglang/pull/4186
- use same version for ci and pyproject by @zhyncs in https://github.com/sgl-project/sglang/pull/4187
- Fix eagle hang issue for max_new_tokens=1 by @ispobock in https://github.com/sgl-project/sglang/pull/4185
- Update amd ci docker image to v0.4.3.post4-rocm630. by @saienduri in https://github.com/sgl-project/sglang/pull/4189
- New clang format for sgl kernel by @merrymercy in https://github.com/sgl-project/sglang/pull/4194
- Remove the vllm dependency from the moe_align function by @sleepcoo in https://github.com/sgl-project/sglang/pull/4164
- Minor improvement to per_tensor_quant_fp8 by @zcnrex in https://github.com/sgl-project/sglang/pull/4197
- Revert "Minor improvement to per_tensor_quant_fp8 (#4197)" by @zhyncs in https://github.com/sgl-project/sglang/pull/4198
- lazy import attn backends by @merrymercy in https://github.com/sgl-project/sglang/pull/4200
- Fix bench_serving flush cache not recognizing OPENAI_API_KEY by @brighill in https://github.com/sgl-project/sglang/pull/4181
- Use clang format 18 in pr-test-sgl-kernel.yml by @merrymercy in https://github.com/sgl-project/sglang/pull/4203
- Refactor Dockerfile: unify CUDA logic and reduce image size by ~2.6 GB by @kebe7jun in https://github.com/sgl-project/sglang/pull/3749
- Test no vllm custom allreduce by @merrymercy in https://github.com/sgl-project/sglang/pull/4210
- refine quant kernel code style by @BBuf in https://github.com/sgl-project/sglang/pull/4211
- Split test_mla.py into two files (deepseek v2 and deepseek v3) by @merrymercy in https://github.com/sgl-project/sglang/pull/4216
- docs(reasoning content): :memo: deepseek-r1 parser support qwq by @xihuai18 in https://github.com/sgl-project/sglang/pull/4124
- revert pr 3628 to pass test_mla ci by @BBuf in https://github.com/sgl-project/sglang/pull/4219
- use latest sgl-kernel for mla test by @zhyncs in https://github.com/sgl-project/sglang/pull/4222
- Rename files in sgl kernel to avoid nested folder structure by @merrymercy in https://github.com/sgl-project/sglang/pull/4213
- chore: bump v0.0.4 for sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/4223
- Lazily import lora backends by @merrymercy in https://github.com/sgl-project/sglang/pull/4225
- [docker] Distributed Serving with k8s Statefulset ( good example for DeepSeek-R1) by @panpan0000 in https://github.com/sgl-project/sglang/pull/3631
- [docs] Unhide production metrics page by @hebiao064 in https://github.com/sgl-project/sglang/pull/4193
- use sgl-kernel 0.0.4 by @zhyncs in https://github.com/sgl-project/sglang/pull/4224
- Support nextn for flashinfer mla attention backend by @Fridge003 in https://github.com/sgl-project/sglang/pull/4218
- Apply sgl w8a8 fp8 kernel by @HandH1998 in https://github.com/sgl-project/sglang/pull/3148
- Check eagle server args by @Ying1123 in https://github.com/sgl-project/sglang/pull/4217
- update sgl-kernel 3rdparty by @zhyncs in https://github.com/sgl-project/sglang/pull/4228
- Update bench speculative script by @ispobock in https://github.com/sgl-project/sglang/pull/4235
- Fix test of flashinfer mla with nextn by @Fridge003 in https://github.com/sgl-project/sglang/pull/4237
- Move rope and bmm into sgl-kernel by @merrymercy in https://github.com/sgl-project/sglang/pull/4241
- Revert "Check eagle server args" by @merrymercy in https://github.com/sgl-project/sglang/pull/4242
- Minor style fix for sgl-kernel by @merrymercy in https://github.com/sgl-project/sglang/pull/4243
- Auto balance CI tests by @merrymercy in https://github.com/sgl-project/sglang/pull/4238
- Clean up fp8 support by @merrymercy in https://github.com/sgl-project/sglang/pull/4230
- Move activation.cu to sgl-kernel/elementwise by @merrymercy in https://github.com/sgl-project/sglang/pull/4250
- DeepGemm integrate to sgl-kernel by @laixinn in https://github.com/sgl-project/sglang/pull/4165
- [Bug fixed] fixed the crash when enable the dp-attention on the single card by @DavidChan0519 in https://github.com/sgl-project/sglang/pull/3958
- Added example for multimodal embedding by @simveit in https://github.com/sgl-project/sglang/pull/4206
- Simplify tests & Fix trtllm custom allreduce registration by @merrymercy in https://github.com/sgl-project/sglang/pull/4252
- fix the input_ids is None error by @Young1993 in https://github.com/sgl-project/sglang/pull/4144
- fix per_token_group_quant_fp8 illegal memory when num_groups % 16 != 0 by @BBuf in https://github.com/sgl-project/sglang/pull/4231
- Release sgl-kernel v0.0.4.post1 by @merrymercy in https://github.com/sgl-project/sglang/pull/4255
- Fix quantization and nightly tests by @merrymercy in https://github.com/sgl-project/sglang/pull/4258
- increase the timeout of nightly-test.yml by @merrymercy in https://github.com/sgl-project/sglang/pull/4262
- Optimize rope in sgl kernel by @merrymercy in https://github.com/sgl-project/sglang/pull/4267
- Test no vllm custom allreduce by @merrymercy in https://github.com/sgl-project/sglang/pull/4256
- Amd test fp8 by @HandH1998 in https://github.com/sgl-project/sglang/pull/4261
- add THIRDPARTYNOTICES for DeepGEMM by @zhyncs in https://github.com/sgl-project/sglang/pull/4272
- upgrade xgrammar 0.1.15 by @zhyncs in https://github.com/sgl-project/sglang/pull/4275
- Fix nightly eval for neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 by @merrymercy in https://github.com/sgl-project/sglang/pull/4279
- Uupdate cutalss dependency for its bug fix. by @elfiegg in https://github.com/sgl-project/sglang/pull/4277
- update deepgemm by @zhyncs in https://github.com/sgl-project/sglang/pull/4284
- bump sgl-kernel 0.0.4.post2 by @zhyncs in https://github.com/sgl-project/sglang/pull/4288
- Add A800 tuning configs support DeepSeek V3/R1 BF16 and INT8(block-wise) by @lambert0312 in https://github.com/sgl-project/sglang/pull/4136
- update sgl-kernel 0.0.4.post2 by @zhyncs in https://github.com/sgl-project/sglang/pull/4291
- linear support deepgemm by @sleepcoo in https://github.com/sgl-project/sglang/pull/4199
- Update MTP doc by @ispobock in https://github.com/sgl-project/sglang/pull/4290
- Add A100 tuning configs for DeepSeek R1/V3 channel-wise INT8 by @yych0745 in https://github.com/sgl-project/sglang/pull/4287
- update doc by @zhyncs in https://github.com/sgl-project/sglang/pull/4299
- [AMD] Fix rocm sgl-kernel missing modules error by @BruceXcluding in https://github.com/sgl-project/sglang/pull/4311
- Add H20 tuning configs support DeepSeek V3/R1 INT8(block-wise) by @Ximingwang-09 in https://github.com/sgl-project/sglang/pull/4220
- refactor: move image processors to separate files by @mickqian in https://github.com/sgl-project/sglang/pull/4229
- upgrade flashinfer 0.2.3 by @zhyncs in https://github.com/sgl-project/sglang/pull/4317
- unify is_cuda and is_hip by @zhyncs in https://github.com/sgl-project/sglang/pull/4321
- Add A800 tuning configs for DeepSeek R1/V3 channel-wise INT8 by @lambert0312 in https://github.com/sgl-project/sglang/pull/4323
- [Docs] Clean up benchmark_and_profiling.md by @windsonsea in https://github.com/sgl-project/sglang/pull/4297
- refine sgl_moe_align_block_size_benchmark by @BBuf in https://github.com/sgl-project/sglang/pull/4327
- Remove vllm ops scaled fp8 quant and accelerate per token quant by 20-28% by @hebiao064 in https://github.com/sgl-project/sglang/pull/4215
- Add awq dequantize kernel to sgl with 1x to 3x speedup by @zcnrex in https://github.com/sgl-project/sglang/pull/4104
- fix awq_dequantize by @zhyncs in https://github.com/sgl-project/sglang/pull/4333
- release 0.0.4.post3 sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/4331
- upgrade sgl-kernel 0.0.4.post3 by @zhyncs in https://github.com/sgl-project/sglang/pull/4334
- Add INT8 support MTP NextN function by @lambert0312 in https://github.com/sgl-project/sglang/pull/3911
- [Fix] fix _yarn_linear_ramp_mask with device parameter by @Alcanderian in https://github.com/sgl-project/sglang/pull/4337
- remove the unused readline dependency from the Qwen2 model implementa… by @yych0745 in https://github.com/sgl-project/sglang/pull/4340
- model: Support Janus-pro by @mickqian in https://github.com/sgl-project/sglang/pull/3203
- Hierarchical Caching Refactoring and Fixing TP issue by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4082
- Support Blackwell Block Scale FP8 Gemm by @elfiegg in https://github.com/sgl-project/sglang/pull/4278
- typo: Update http_server.py by @WrRan in https://github.com/sgl-project/sglang/pull/4350
- Update nightly tests by @merrymercy in https://github.com/sgl-project/sglang/pull/4352
- [Fix Doc.] Enable internal forwarding when starting the router by @shizhediao in https://github.com/sgl-project/sglang/pull/4355
- Move output processing logic from scheduler.py into a separate file by @merrymercy in https://github.com/sgl-project/sglang/pull/4354
- Fix scheduler proctitle suffix is None by @cnwenf in https://github.com/sgl-project/sglang/pull/4326
- feat: support ep size < 32 for sgl kernel by @shuaills in https://github.com/sgl-project/sglang/pull/4348
- Fix per token fp8 quant precision by @qingquansong in https://github.com/sgl-project/sglang/pull/4362
- Remove the choices in --speculative-eagle-topk argument by @Achazwl in https://github.com/sgl-project/sglang/pull/4329
- docs: add parameter --log-requests-level by @panpan0000 in https://github.com/sgl-project/sglang/pull/4335
- simple bugfix by @WrRan in https://github.com/sgl-project/sglang/pull/4342
- Fix the doc of FR-Spec by @Achazwl in https://github.com/sgl-project/sglang/pull/4295
- [Fix] Check the device backend before calling empty_cache function by @cboss6 in https://github.com/sgl-project/sglang/pull/4212
- [FIX] fix incorrect output when enable both deepgemm and torch compile by @AniZpZ in https://github.com/sgl-project/sglang/pull/4359
- add INT8 example into dsv3 README by @laixinn in https://github.com/sgl-project/sglang/pull/4079
- Avoid duplicated request ids in batch APIs by @tanconghui in https://github.com/sgl-project/sglang/pull/4026
- example: add async offline inference demo by @kuizhiqing in https://github.com/sgl-project/sglang/pull/3961
- Add device detection and count functions to utils. by @vshekhawat-hlab in https://github.com/sgl-project/sglang/pull/3962
- Move
aiohttp
into public dependencies by @stevapple in https://github.com/sgl-project/sglang/pull/3980 - [tools] add fp8 max/min constant in utils by @yiakwy-xpu-ml-framework-team in https://github.com/sgl-project/sglang/pull/3959
- HotFix: json serialization error when using OAI v1/batches endpoint with logprobs by @dcfidalgo in https://github.com/sgl-project/sglang/pull/3896
- [docs] Update outdated description about
torch.compile
by @junliu-mde in https://github.com/sgl-project/sglang/pull/3844 - [Doc] Fix typo in backend/sampling_params by @yang-ybb in https://github.com/sgl-project/sglang/pull/3835
- Ensure Usage Data in Streaming Responses Aligns with vLLM’s Implementation by @HermitSun in https://github.com/sgl-project/sglang/pull/3814
- [moe] fix: correct the cache size in the last chunk by @ch-wan in https://github.com/sgl-project/sglang/pull/3679
- Support page size > 1 by @merrymercy in https://github.com/sgl-project/sglang/pull/4356
- [XPU][CPU] Enable the native path of DeepSeek by @airMeng in https://github.com/sgl-project/sglang/pull/4086
- Revert "[XPU][CPU] Enable the native path of DeepSeek" by @merrymercy in https://github.com/sgl-project/sglang/pull/4367
- Update grafana.json by @dblate in https://github.com/sgl-project/sglang/pull/4374
- fix accuracy issue by @zhyncs in https://github.com/sgl-project/sglang/pull/4376
- bump 0.0.5 sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/4377
- upgrade sgl-kernel 0.0.5 by @zhyncs in https://github.com/sgl-project/sglang/pull/4381
- chore: bump v0.4.4 by @zhyncs in https://github.com/sgl-project/sglang/pull/4041
New Contributors
- @yosoyjay made their first contribution in https://github.com/sgl-project/sglang/pull/3505
- @FrankLeeeee made their first contribution in https://github.com/sgl-project/sglang/pull/3548
- @Jiadalee made their first contribution in https://github.com/sgl-project/sglang/pull/3598
- @whybeyoung made their first contribution in https://github.com/sgl-project/sglang/pull/3624
- @fsx950223 made their first contribution in https://github.com/sgl-project/sglang/pull/3665
- @panpan0000 made their first contribution in https://github.com/sgl-project/sglang/pull/3700
- @ch-wan made their first contribution in https://github.com/sgl-project/sglang/pull/3692
- @Chen-XiaoBing made their first contribution in https://github.com/sgl-project/sglang/pull/3705
- @aoshen524 made their first contribution in https://github.com/sgl-project/sglang/pull/3652
- @trayvonpan made their first contribution in https://github.com/sgl-project/sglang/pull/3588
- @zixuanzhang226 made their first contribution in https://github.com/sgl-project/sglang/pull/3680
- @andjsmi made their first contribution in https://github.com/sgl-project/sglang/pull/3740
- @shahizat made their first contribution in https://github.com/sgl-project/sglang/pull/3761
- @laixinn made their first contribution in https://github.com/sgl-project/sglang/pull/3730
- @He1pa made their first contribution in https://github.com/sgl-project/sglang/pull/3799
- @wilsonwu made their first contribution in https://github.com/sgl-project/sglang/pull/3741
- @yuanheng-zhao made their first contribution in https://github.com/sgl-project/sglang/pull/3641
- @nvcastet made their first contribution in https://github.com/sgl-project/sglang/pull/3709
- @hcyz33 made their first contribution in https://github.com/sgl-project/sglang/pull/3841
- @kebe7jun made their first contribution in https://github.com/sgl-project/sglang/pull/3519
- @JC1DA made their first contribution in https://github.com/sgl-project/sglang/pull/3298
- @Chi-Chu319 made their first contribution in https://github.com/sgl-project/sglang/pull/3898
- @Qiaolin-Yu made their first contribution in https://github.com/sgl-project/sglang/pull/3897
- @xqoasis made their first contribution in https://github.com/sgl-project/sglang/pull/3905
- @KCFindstr made their first contribution in https://github.com/sgl-project/sglang/pull/3866
- @elfiegg made their first contribution in https://github.com/sgl-project/sglang/pull/3966
- @Zhou-sx made their first contribution in https://github.com/sgl-project/sglang/pull/3822
- @xihuai18 made their first contribution in https://github.com/sgl-project/sglang/pull/4000
- @cboss6 made their first contribution in https://github.com/sgl-project/sglang/pull/3954
- @Xiuyu-Li made their first contribution in https://github.com/sgl-project/sglang/pull/3712
- @sgjzfzzf made their first contribution in https://github.com/sgl-project/sglang/pull/3607
- @zeroorhero made their first contribution in https://github.com/sgl-project/sglang/pull/3990
- @samzong made their first contribution in https://github.com/sgl-project/sglang/pull/4101
- @olliestanley made their first contribution in https://github.com/sgl-project/sglang/pull/4142
- @windsonsea made their first contribution in https://github.com/sgl-project/sglang/pull/4162
- @zcnrex made their first contribution in https://github.com/sgl-project/sglang/pull/4197
- @brighill made their first contribution in https://github.com/sgl-project/sglang/pull/4181
- @DavidChan0519 made their first contribution in https://github.com/sgl-project/sglang/pull/3958
- @Young1993 made their first contribution in https://github.com/sgl-project/sglang/pull/4144
- @lambert0312 made their first contribution in https://github.com/sgl-project/sglang/pull/4136
- @yych0745 made their first contribution in https://github.com/sgl-project/sglang/pull/4287
- @Ximingwang-09 made their first contribution in https://github.com/sgl-project/sglang/pull/4220
- @Alcanderian made their first contribution in https://github.com/sgl-project/sglang/pull/4337
- @shizhediao made their first contribution in https://github.com/sgl-project/sglang/pull/4355
- @cnwenf made their first contribution in https://github.com/sgl-project/sglang/pull/4326
- @qingquansong made their first contribution in https://github.com/sgl-project/sglang/pull/4362
- @AniZpZ made their first contribution in https://github.com/sgl-project/sglang/pull/4359
- @tanconghui made their first contribution in https://github.com/sgl-project/sglang/pull/4026
- @kuizhiqing made their first contribution in https://github.com/sgl-project/sglang/pull/3961
- @vshekhawat-hlab made their first contribution in https://github.com/sgl-project/sglang/pull/3962
- @stevapple made their first contribution in https://github.com/sgl-project/sglang/pull/3980
- @dcfidalgo made their first contribution in https://github.com/sgl-project/sglang/pull/3896
- @junliu-mde made their first contribution in https://github.com/sgl-project/sglang/pull/3844
- @yang-ybb made their first contribution in https://github.com/sgl-project/sglang/pull/3835
- @airMeng made their first contribution in https://github.com/sgl-project/sglang/pull/4086
- @dblate made their first contribution in https://github.com/sgl-project/sglang/pull/4374
Full Changelog: https://github.com/sgl-project/sglang/compare/v0.4.3...v0.4.4