SGLang - Browse /v0.4.4 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
README.md	2025-03-13	46.5 kB	0
Release v0.4.4 source code.tar.gz	2025-03-13	3.4 MB	0
Release v0.4.4 source code.zip	2025-03-13	4.0 MB	0
Totals: 3 Items		7.4 MB	0

Highlights

The SGLang team is excited to announce the release of v0.4.4. We will keep improving DeepSeek V3/R1 performance. With the combination of FlashInfer, MTP, DeepGEMM, and Torch Compile optimizations on H200, it can achieve nearly 100 tokens/s, which is currently the fastest open-source implementation. Look out for new optimizations coming soon!

Thanks very much to xAI Team, NVIDIA Team, AMD Team, LinkedIn team, Baseten Team, Meituan Team and the open source community users for their contributions!

Regarding the use of SGLang for DeepSeek R1 inference acceleration, in addition to the users mentioned in the announcement , there are also teams such as Tencent and Ant Group. We are very happy to have received recognition and usage from these teams!

Though surely there will be bugs and fixes that we'll be discovering and quickly patching in the coming days, including today :) Let's build and ship. Please feel free to join our Slack channel https://slack.sglang.ai/ Cheers!

Optimizations

AMD Performance Leadership: SGLang is now the fastest LLM engine for DeepSeek V3/R1 inference on AMD hardware, as confirmed by AMD's technical blog
Enhanced FlashInfer MLA Support: Now fully compatible with radix cache, chunked prefill, and MTP optimizations - enable with --enable-flashinfer-mla
Advanced MTP Capabilities: Both Triton and FlashInfer backends now offer comprehensive Multi-Token Prediction support, easily tunable via the bench_speculative script
DeepGEMM Integration: Full integration of DeepGEMM for NVIDIA Hopper architectures - enable with export SGL_ENABLE_JIT_DEEPGEMM=1
Pioneering INT8 Quantization: First industry implementation of INT8 support for DeepSeek R1 models:
meituan/DeepSeek-R1-Channel-INT8
meituan/DeepSeek-R1-Block-INT8
Hardware Optimizations:
Blackwell architecture Block Scale FP8 GEMM support
Support page size greater than 1 https://github.com/sgl-project/sglang/pull/4356
Optimized W8A8 FP8 implementation with performance gains across all architectures (sm80, sm89, sm90), featuring 15%+ improvement specifically on sm89
Enhanced distributed parallelism capabilities (e.g., two-node configurations with DP 2, TP 8) https://github.com/sgl-project/sglang/pull/4390

Coming soon

Integrate Flash Attention https://github.com/sgl-project/sglang/issues/4385
Integrate FlashMLA https://github.com/sgl-project/sglang/issues/4384
EAGLE 2 optimization https://github.com/sgl-project/sglang/pull/4383
EAGLE 3 day one support https://github.com/sgl-project/sglang/pull/4247
Integrate DeepEP https://github.com/sgl-project/sglang/pull/4232
Prefill and Decoding Disaggregation

What's Changed

update flashinfer-python by @zhyncs in https://github.com/sgl-project/sglang/pull/3557
fix doc by @zhyncs in https://github.com/sgl-project/sglang/pull/3558
Add support for OpenAI API o1 model by @ChuyueSun in https://github.com/sgl-project/sglang/pull/3363
fix sgl-kernel codestyle by @BBuf in https://github.com/sgl-project/sglang/pull/3563
docs: update install by @zhyncs in https://github.com/sgl-project/sglang/pull/3581
Copy config files for MI300X to support in virtualized environments by @yosoyjay in https://github.com/sgl-project/sglang/pull/3505
ROCm docker: triton update by @HaiShaw in https://github.com/sgl-project/sglang/pull/3584
[fix] added support for vlm in offline inference by @FrankLeeeee in https://github.com/sgl-project/sglang/pull/3548
Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 by @ispobock in https://github.com/sgl-project/sglang/pull/3582
[CI] Improve Docs CI Efficiency by @shuaills in https://github.com/sgl-project/sglang/pull/3587
doc: emphasize and notify the usage of chat_template by @mickqian in https://github.com/sgl-project/sglang/pull/3589
fix eagle unit test by @zhyncs in https://github.com/sgl-project/sglang/pull/3591
fix high qps crash when enable mtp by @zhyncs in https://github.com/sgl-project/sglang/pull/3592
fix apply_token_bitmask_inplace_cuda by @zhyncs in https://github.com/sgl-project/sglang/pull/3594
[docs] added favicon to sphinx html by @FrankLeeeee in https://github.com/sgl-project/sglang/pull/3564
fix lockfile and port_registry file permission error by @Jiadalee in https://github.com/sgl-project/sglang/pull/3598
feat: Support Qwen 2.5 vl by @mickqian in https://github.com/sgl-project/sglang/pull/3258
[ROCm] Use tl.range() in block GEMM kernels with num_stages set by host. by @whchung in https://github.com/sgl-project/sglang/pull/3535
Update to latest amd image. by @saienduri in https://github.com/sgl-project/sglang/pull/3597
Benchmark for reasoning models by @simveit in https://github.com/sgl-project/sglang/pull/3532
Draft of updated doc for sampling params. by @simveit in https://github.com/sgl-project/sglang/pull/3260
[docs] Update sampling_params.md by @shuaills in https://github.com/sgl-project/sglang/pull/3617
[docker] added rdma support by @FrankLeeeee in https://github.com/sgl-project/sglang/pull/3619
Revert "[ROCm] Use tl.range() in block GEMM kernels with `num_stage… by @zhyncs in https://github.com/sgl-project/sglang/pull/3632
add mtp unit test by @zhyncs in https://github.com/sgl-project/sglang/pull/3634
update unit test by @zhyncs in https://github.com/sgl-project/sglang/pull/3636
chore: bump v0.4.3.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/3638
h800 deepseek r1 config and support multi-gpu block-gemm tuning by @BBuf in https://github.com/sgl-project/sglang/pull/3639
feat: support flashinfer mla with prefix cache by @zhyncs in https://github.com/sgl-project/sglang/pull/3643
chore: update flashinfer v0.2.1.post2 by @zhyncs in https://github.com/sgl-project/sglang/pull/3644
chore: bump v0.4.3.post2 by @zhyncs in https://github.com/sgl-project/sglang/pull/3645
use transformers 4.48.3 by @zhyncs in https://github.com/sgl-project/sglang/pull/3650
[ROCm] Add additional block quant GEMM tuning configs for AMD GPUs. by @whchung in https://github.com/sgl-project/sglang/pull/3616
[ROCm] Optimal MOE Tuning for AMD Radeon Graphics by @BruceXcluding in https://github.com/sgl-project/sglang/pull/3567
Deploy multi-node inference (LWS method) using sglang in a K8s cluster by @whybeyoung in https://github.com/sgl-project/sglang/pull/3624
Update amd docker image. by @saienduri in https://github.com/sgl-project/sglang/pull/3654
[Feature] Apply Cublas Grouped Gemm kernel by @Fridge003 in https://github.com/sgl-project/sglang/pull/3629
update pr-test by @zhyncs in https://github.com/sgl-project/sglang/pull/3663
Fix draft decode max batch size by @ispobock in https://github.com/sgl-project/sglang/pull/3676
fix: remove dependency on latest transformers impl by @mickqian in https://github.com/sgl-project/sglang/pull/3635
AMD Prefill optimize by @fsx950223 in https://github.com/sgl-project/sglang/pull/3665
fix: apply cache size limit of attention mask for VisionAttention by @mickqian in https://github.com/sgl-project/sglang/pull/3657
set NCCL_IB_GID_INDEX=3 for multi node NVIDIA InfiniBand if needed by @zhyncs in https://github.com/sgl-project/sglang/pull/3698
use warp shuffle style reduce and flashinfer vectorize by @BBuf in https://github.com/sgl-project/sglang/pull/3628
[Docs] Add SkyPilot DeepSeek example by @Michaelvll in https://github.com/sgl-project/sglang/pull/3706
[k8s] remove unnecessary hostIPC for security concern by @panpan0000 in https://github.com/sgl-project/sglang/pull/3700
[moe] optim: reduce memory consumption in fused_moe by @ch-wan in https://github.com/sgl-project/sglang/pull/3692
[Improve] Fix Multi-User Port Allocation Conflicts by @shuaills in https://github.com/sgl-project/sglang/pull/3601
Variance measure for reasoning benchmark by @simveit in https://github.com/sgl-project/sglang/pull/3677
Docs: Fix layout with sub-section by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3710
add control for cutlass fp8 blockwise gemm by @yizhang2077 in https://github.com/sgl-project/sglang/pull/3727
revert BLOCK and num_warps on HIP by @HaiShaw in https://github.com/sgl-project/sglang/pull/3722
Optimize triton attention custom mask by @ispobock in https://github.com/sgl-project/sglang/pull/3731
[Bugfix] Fix scores mask for moe topk by @Chen-XiaoBing in https://github.com/sgl-project/sglang/pull/3705
[Docs] Modify ep related server args and remove cublas part of deepseek by @Fridge003 in https://github.com/sgl-project/sglang/pull/3732
[Fix] Fix bugs and refactor codes in lora for better scalability. by @aoshen524 in https://github.com/sgl-project/sglang/pull/3652
docs: fix 404 link by @trayvonpan in https://github.com/sgl-project/sglang/pull/3588
[docs] added torch.compile cache to dpsk manual by @FrankLeeeee in https://github.com/sgl-project/sglang/pull/3737
AMD/ROCm: update AITER repo to ROCm/aiter by @HaiShaw in https://github.com/sgl-project/sglang/pull/3747
feat: update grouped_topk to support softmax and sigmoid by @zixuanzhang226 in https://github.com/sgl-project/sglang/pull/3680
feat: Add SageMaker support by @andjsmi in https://github.com/sgl-project/sglang/pull/3740
Change description of nvidia jetson docs by @shahizat in https://github.com/sgl-project/sglang/pull/3761
[Fix] fix OpenAI API adapter tokenizer encoding by @shuaills in https://github.com/sgl-project/sglang/pull/3432
[bug] fixed batch api by @FrankLeeeee in https://github.com/sgl-project/sglang/pull/3754
Adjustments to docs by @simveit in https://github.com/sgl-project/sglang/pull/3733
docs: Add offline engine launch example and documentation by @shuaills in https://github.com/sgl-project/sglang/pull/3771
Update offline_engine_api.ipynb by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3773
Support Qwen RM model. by @simveit in https://github.com/sgl-project/sglang/pull/3772
Add support for nvidia modelopt fp8 kv cache by @Edwardf0t1 in https://github.com/sgl-project/sglang/pull/3223
Tiny fix Olmo2 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/3348
fix lm head weights in Qwen models by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3777
Fix weight loader error when LM head weights are tied by @fzyzcjy in https://github.com/sgl-project/sglang/pull/3766
Let DetokenizerManager use TypeBasedDispatcher by @fzyzcjy in https://github.com/sgl-project/sglang/pull/3117
bench: Add a benchmark for vLM: MMMU by @mickqian in https://github.com/sgl-project/sglang/pull/3562
Extract generation_manager from tokenizer_manager by @fzyzcjy in https://github.com/sgl-project/sglang/pull/3115
Rename TokenizerManager to StdOrchestrator by @fzyzcjy in https://github.com/sgl-project/sglang/pull/3116
[Docs]Add instruction for manually stopping nsys profiler by @Fridge003 in https://github.com/sgl-project/sglang/pull/3795
Hierarchical Caching for SGLang by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/2693
Update readme by @merrymercy in https://github.com/sgl-project/sglang/pull/3809
Fix dependency by @merrymercy in https://github.com/sgl-project/sglang/pull/3813
Refactor flashinfer logic for deepseek v3 and fix accuracy bug by @Fridge003 in https://github.com/sgl-project/sglang/pull/3785
Feature DeepSeek V3/R1 INT8 Quantization (block-wise) by @laixinn in https://github.com/sgl-project/sglang/pull/3730
Fix pandas dependency in CI by @merrymercy in https://github.com/sgl-project/sglang/pull/3818
Revert "Rename TokenizerManager to StdOrchestrator" by @merrymercy in https://github.com/sgl-project/sglang/pull/3828
Revert "Extract generation_manager from tokenizer_manager" by @merrymercy in https://github.com/sgl-project/sglang/pull/3829
Fix CI and install docs by @merrymercy in https://github.com/sgl-project/sglang/pull/3821
typos by @WrRan in https://github.com/sgl-project/sglang/pull/3801
doc: fix dead link in router.md by @He1pa in https://github.com/sgl-project/sglang/pull/3799
Fix doc site copyright to current year by @wilsonwu in https://github.com/sgl-project/sglang/pull/3741
[Doc] Fix typo in server-argument description by @yuanheng-zhao in https://github.com/sgl-project/sglang/pull/3641
[ROCm] Enable Fused MLA Triton kernel for DeepSeekV3 by @lcskrishna in https://github.com/sgl-project/sglang/pull/3237
[BugFix]: Add missing clamp to llavavid by @PanJason in https://github.com/sgl-project/sglang/pull/3787
[dist] made timeout configurable by @FrankLeeeee in https://github.com/sgl-project/sglang/pull/3803
Fix allgather ops inside cuda graphs by @nvcastet in https://github.com/sgl-project/sglang/pull/3709
fix capture_bs by @fsx950223 in https://github.com/sgl-project/sglang/pull/3857
[BugFix] Fix crash when receive a req with structed output in DP attention mode. by @hcyz33 in https://github.com/sgl-project/sglang/pull/3841
Fix maximum recursion depth triggered on exception exit by @kebe7jun in https://github.com/sgl-project/sglang/pull/3519
[doc] added quantization doc for dpsk by @FrankLeeeee in https://github.com/sgl-project/sglang/pull/3843
[doc] fixed dpsk quant faq by @FrankLeeeee in https://github.com/sgl-project/sglang/pull/3865
Expert Parallelism (EP) Support for DeepSeek V3/R1 by @sleepcoo in https://github.com/sgl-project/sglang/pull/3602
Revert recent changes by @simveit in https://github.com/sgl-project/sglang/pull/3845
Feature/improve docs by @simveit in https://github.com/sgl-project/sglang/pull/3860
[Feature] Support llguidance for constrained decoding by @JC1DA in https://github.com/sgl-project/sglang/pull/3298
Move dpsk docs forward a step by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3894
Docs: Reorngaize dpsk links by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3900
Implemented frontend docs by @simveit in https://github.com/sgl-project/sglang/pull/3791
[doc] update sponsorship by @whybeyoung in https://github.com/sgl-project/sglang/pull/3903
[Rocm] Fix to the rocm_mla_decode_rope.py returning random result by @Chi-Chu319 in https://github.com/sgl-project/sglang/pull/3898
[doc] Update document for flashinfer mla by @Fridge003 in https://github.com/sgl-project/sglang/pull/3907
Add return hidden state in the native API by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/3897
[Docs] Disable notebook CI when merge to main by @xqoasis in https://github.com/sgl-project/sglang/pull/3905
[Docs] Improve DPSK docs in dark mode by @hebiao064 in https://github.com/sgl-project/sglang/pull/3914
[Doc] Add experimental tag for flashinfer mla by @Fridge003 in https://github.com/sgl-project/sglang/pull/3925
Tuning Script for Feature DeepSeek V3/R1 INT8 Quantization (block-wise) by @laixinn in https://github.com/sgl-project/sglang/pull/3922
xgrammar 0.1.14 by @qeternity in https://github.com/sgl-project/sglang/pull/3593
revert "Docs: Reorngaize dpsk links [#3900]" by @zhyncs in https://github.com/sgl-project/sglang/pull/3933
upgrade flashinfer v0.2.2.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/3934
Fix the doc link for sampling params by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/3861
[feat] Add Vertex AI compatible prediction route for /generate by @KCFindstr in https://github.com/sgl-project/sglang/pull/3866
[MOE] enable efficient moe_alignment multi-blocks execution (3x~6x) by @yiakwy-xpu-ml-framework-team in https://github.com/sgl-project/sglang/pull/3613
Fix bench_serving not recognizing OPENAI_API_KEY by @kebe7jun in https://github.com/sgl-project/sglang/pull/3870
set a strict sgl-kernel version by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3950
[Bugfix] Fix tokenizer_manager not getting 400 when req is too long by @CatherineSue in https://github.com/sgl-project/sglang/pull/3678
[Feature] integrate Structural Tag in xgrammar backend for function calling by @minleminzui in https://github.com/sgl-project/sglang/pull/3566
SGLang + Verl by @fzyzcjy in https://github.com/sgl-project/sglang/pull/3852
Remove unused imports from rocm mla kernel. by @lcskrishna in https://github.com/sgl-project/sglang/pull/3963
Update cutlass dependency by @elfiegg in https://github.com/sgl-project/sglang/pull/3966
[Feature]Support ragged prefill in flashinfer mla backend by @Fridge003 in https://github.com/sgl-project/sglang/pull/3967
Docs: add type hint to smapling parameters by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3975
Add redline to highlight main process by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3977
rename FunctionCallReqInput to ParseFunctionCallReq by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3976
Docs: add special warning to engine docs by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3979
Revert "[MOE] enable efficient moe_alignment multi-blocks execution (3x~6x)" by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3982
Move return_hidden_states to the generate input by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/3985
Update CODEOWNERS by @merrymercy in https://github.com/sgl-project/sglang/pull/3989
add deepgemm and sglang fp8 block-wise gemm benchmark by @BBuf in https://github.com/sgl-project/sglang/pull/3893
fix typo by @BBuf in https://github.com/sgl-project/sglang/pull/3991
Fix all gather torch compile by @ispobock in https://github.com/sgl-project/sglang/pull/3992
Add accuracy test for TP torch compile by @ispobock in https://github.com/sgl-project/sglang/pull/3994
Enable custom AR for AMD GPUs and maintain it in sgl-kernel by @hubertlu-tw in https://github.com/sgl-project/sglang/pull/3406
Add Benchmark for DeepGEMM Group GEMM by @hebiao064 in https://github.com/sgl-project/sglang/pull/3993
[feat] add small vocab table for eagle's draft model[1]. by @Zhou-sx in https://github.com/sgl-project/sglang/pull/3822
Add fast decode plan for flashinfer mla by @Fridge003 in https://github.com/sgl-project/sglang/pull/3987
Revert "Add fast decode plan for flashinfer mla" by @merrymercy in https://github.com/sgl-project/sglang/pull/4008
Add examples to token-in-token-out for LLM by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4010
Fix nightly-test CI by @yinfan98 in https://github.com/sgl-project/sglang/pull/3826
Optimize Triton Kernel of Group GEMM in DeepGEMM Benchmark by @hebiao064 in https://github.com/sgl-project/sglang/pull/4014
Improve code styles by @merrymercy in https://github.com/sgl-project/sglang/pull/4021
Clean up custom allreduce by @merrymercy in https://github.com/sgl-project/sglang/pull/4029
remove cache configs in model definitions by @merrymercy in https://github.com/sgl-project/sglang/pull/4031
Update metrics documentation by @binarycrayon in https://github.com/sgl-project/sglang/pull/3264
Reorganize c++ source files in sgl-kernel with multiple folders by @merrymercy in https://github.com/sgl-project/sglang/pull/4025
Reorganize python source files in sgl-kernel with multiple files by @merrymercy in https://github.com/sgl-project/sglang/pull/4027
Misc clean up; Remove the support of jump forward by @merrymercy in https://github.com/sgl-project/sglang/pull/4032
Docs: Fix sampling parameter by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4034
Remove outdated test utils and fix links for the doc of sampling params by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/3999
Add examples in sampling parameters by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4039
Share target model embed and head weights for nextn by @ispobock in https://github.com/sgl-project/sglang/pull/4033
Add a link to the roadmap in README.md by @merrymercy in https://github.com/sgl-project/sglang/pull/4043
docs: update README by @zhyncs in https://github.com/sgl-project/sglang/pull/4044
Fix assert options.num_stages != 0 error in the latest ROCm build image by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/4049
Reasoning parser by @xihuai18 in https://github.com/sgl-project/sglang/pull/4000
HotFix for [#3988] using blockwise_int8 by @xihuai18 in https://github.com/sgl-project/sglang/pull/4023
Fix breakage problem when using custom_ar by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/4052
ROCm: update aiter and its usage to fused moe (bloat16, fp8, fp8 block-quant) by @HaiShaw in https://github.com/sgl-project/sglang/pull/4053
Fix debug_tensor_dump_output_folder optional key missing by @Qubitium in https://github.com/sgl-project/sglang/pull/4046
Remove grafana dashboard's datasource uid by @kebe7jun in https://github.com/sgl-project/sglang/pull/4051
[Fix & Style] Refactor the grammar backend to reduce human errors and improve readability by @DarkSharpness in https://github.com/sgl-project/sglang/pull/4030
[XCCL] Use xccl for xpu backend since xccl is ready in latest PyTorch. by @cboss6 in https://github.com/sgl-project/sglang/pull/3954
sgl-router - issues on routing and project build. (#3870) by @michaelfeil in https://github.com/sgl-project/sglang/pull/3948
fix: support gelu_new activation function in gpt2 by @Xiuyu-Li in https://github.com/sgl-project/sglang/pull/3712
remove unused max_jobs by @sgjzfzzf in https://github.com/sgl-project/sglang/pull/3607
[Feature] Add test for speculative_token_map by @Achazwl in https://github.com/sgl-project/sglang/pull/4016
Revert "Fix nightly-test CI" by @merrymercy in https://github.com/sgl-project/sglang/pull/4065
Update nextn ci test by @ispobock in https://github.com/sgl-project/sglang/pull/4071
Simplify eagle tests and TP sync in grammar backend by @merrymercy in https://github.com/sgl-project/sglang/pull/4066
Add examples for returning hidden states when using the server by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/4074
[Minor] more code cleanup by @merrymercy in https://github.com/sgl-project/sglang/pull/4077
test: add vlm to token in & out example by @mickqian in https://github.com/sgl-project/sglang/pull/3941
[QUANT] Add GPTQModel Dynamic Quantization + lm_head Quantization by @Qubitium in https://github.com/sgl-project/sglang/pull/3790
bench: add dataset param for bench_multiturn by @zeroorhero in https://github.com/sgl-project/sglang/pull/3990
ROCM: AITER BLOCK GEMM by @BruceXcluding in https://github.com/sgl-project/sglang/pull/4075
[Eagle] Refactor eagle speculative decoding by @Ying1123 in https://github.com/sgl-project/sglang/pull/3986
Fix the moe padding conditional logic by @HaiShaw in https://github.com/sgl-project/sglang/pull/4081
[Revision] Add fast decode plan for flashinfer mla by @Fridge003 in https://github.com/sgl-project/sglang/pull/4012
Fix triton kernel illegal memory issue for eagle by @ispobock in https://github.com/sgl-project/sglang/pull/4100
Add update_weights_from_disk endpoint to Engine by @jhinpan in https://github.com/sgl-project/sglang/pull/4102
Add DeepSeek optimization ablations documentation by @M0gician in https://github.com/sgl-project/sglang/pull/4107
reorganize dpsk docs by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4108
Add examples for server token-in-token-out by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/4103
revert deepseek docs by @zhyncs in https://github.com/sgl-project/sglang/pull/4109
Create release-docker-amd-nightly.yml by @saienduri in https://github.com/sgl-project/sglang/pull/4105
remove testing on PR workflow change by @saienduri in https://github.com/sgl-project/sglang/pull/4110
Debug radixcache: refactor recursive helper methods by @luzengxiangcn in https://github.com/sgl-project/sglang/pull/3029
Online serving benchmarks of real datasets for hierarchical KV caching by @PanJason in https://github.com/sgl-project/sglang/pull/3211
fix cross-reference error and spelling mistakes by @samzong in https://github.com/sgl-project/sglang/pull/4101
fix Non-consecutive header level increase in docs/router/router.md by @samzong in https://github.com/sgl-project/sglang/pull/4099
chore: bump v0.4.3.post3 by @zhyncs in https://github.com/sgl-project/sglang/pull/4114
[Hoxfix] Fix incomplete token_to_kv_pool refactor by @Edenzzzz in https://github.com/sgl-project/sglang/pull/4121
Remove prefill-only-one-req by @merrymercy in https://github.com/sgl-project/sglang/pull/4117
Add a pointer to the real KV cache pool by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4113
feat: support docs auto live-reload with sphinx-autobuild by @samzong in https://github.com/sgl-project/sglang/pull/4111
EAGLE docs by @simveit in https://github.com/sgl-project/sglang/pull/4038
Add codeowners for eagle implementations by @Ying1123 in https://github.com/sgl-project/sglang/pull/4131
Add tag suffix to nightly docker builds. by @saienduri in https://github.com/sgl-project/sglang/pull/4129
remove unused max_jobs in setup_rocm.py by @sgjzfzzf in https://github.com/sgl-project/sglang/pull/4126
Split the init of scheduler as smaller functions. Improve the eagle tests by @merrymercy in https://github.com/sgl-project/sglang/pull/4128
[Minor] make the __init__ function of model_runner.py shorter by @merrymercy in https://github.com/sgl-project/sglang/pull/4132
AMD/ROCm: update base image string by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/4137
Update CODEOWNER by @merrymercy in https://github.com/sgl-project/sglang/pull/4138
fix bench serving bug by @Lzhang-hub in https://github.com/sgl-project/sglang/pull/4135
Fix a draft model accuracy bug in eagle; support step=1; return logprob in eagle by @merrymercy in https://github.com/sgl-project/sglang/pull/4134
Fix nightly ci Gsm8k & Fix flashinfer backend kvcache quant by @yinfan98 in https://github.com/sgl-project/sglang/pull/4147
Fix constrained generation errors by adding datasets dependency by @olliestanley in https://github.com/sgl-project/sglang/pull/4142
Release v0.4.3.post4 by @merrymercy in https://github.com/sgl-project/sglang/pull/4140
[docs] fix HF reference script command by @adarshxs in https://github.com/sgl-project/sglang/pull/4148
Docs: add torch compile cache by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4151
Hot fix small vocal eagle in docs by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4154
ROCm: enable trillion-parameter MoE models with INT4-FP8 single node by @HaiShaw in https://github.com/sgl-project/sglang/pull/4152
Add Support for Qwen2-VL Multi-modal Embedding Models by @Titan-p in https://github.com/sgl-project/sglang/pull/3694
[quant kernel] sgl-kernel support per_tensor_quant fp8 by @BBuf in https://github.com/sgl-project/sglang/pull/3786
Add sgl_per_token_quant_fp8 by @hebiao064 in https://github.com/sgl-project/sglang/pull/4089
[Feature] DeepSeek V3/R1 INT8 Quantization (channel-wise) by @HandH1998 in https://github.com/sgl-project/sglang/pull/3888
[Refactor] Reducing code duplication across FP8 CUDA quantization kernels by @hebiao064 in https://github.com/sgl-project/sglang/pull/4163
[Docs] Fix links and grammar issues by @windsonsea in https://github.com/sgl-project/sglang/pull/4162
Remove non-existent AMD header include by @hebiao064 in https://github.com/sgl-project/sglang/pull/4166
Put utils in ifndef USE_ROCM to fix CI (#4167) by @zhyncs in https://github.com/sgl-project/sglang/pull/4168
Memory pool fix for upstream change about eagle by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4170
chore: bump v0.0.3.post7 for sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/4176
Add an example of using deepseekv3 int8 sglang. by @sleepcoo in https://github.com/sgl-project/sglang/pull/4177
fix int8 doc link by @zhyncs in https://github.com/sgl-project/sglang/pull/4179
[Docs] Improve bullets appearance and grammar by @windsonsea in https://github.com/sgl-project/sglang/pull/4174
ROCm: Flex Attention Enablement with custom backends by @HaiShaw in https://github.com/sgl-project/sglang/pull/4178
Revert "ROCm: Flex Attention Enablement with custom backends (#4178)" by @zhyncs in https://github.com/sgl-project/sglang/pull/4186
use same version for ci and pyproject by @zhyncs in https://github.com/sgl-project/sglang/pull/4187
Fix eagle hang issue for max_new_tokens=1 by @ispobock in https://github.com/sgl-project/sglang/pull/4185
Update amd ci docker image to v0.4.3.post4-rocm630. by @saienduri in https://github.com/sgl-project/sglang/pull/4189
New clang format for sgl kernel by @merrymercy in https://github.com/sgl-project/sglang/pull/4194
Remove the vllm dependency from the moe_align function by @sleepcoo in https://github.com/sgl-project/sglang/pull/4164
Minor improvement to per_tensor_quant_fp8 by @zcnrex in https://github.com/sgl-project/sglang/pull/4197
Revert "Minor improvement to per_tensor_quant_fp8 (#4197)" by @zhyncs in https://github.com/sgl-project/sglang/pull/4198
lazy import attn backends by @merrymercy in https://github.com/sgl-project/sglang/pull/4200
Fix bench_serving flush cache not recognizing OPENAI_API_KEY by @brighill in https://github.com/sgl-project/sglang/pull/4181
Use clang format 18 in pr-test-sgl-kernel.yml by @merrymercy in https://github.com/sgl-project/sglang/pull/4203
Refactor Dockerfile: unify CUDA logic and reduce image size by ~2.6 GB by @kebe7jun in https://github.com/sgl-project/sglang/pull/3749
Test no vllm custom allreduce by @merrymercy in https://github.com/sgl-project/sglang/pull/4210
refine quant kernel code style by @BBuf in https://github.com/sgl-project/sglang/pull/4211
Split test_mla.py into two files (deepseek v2 and deepseek v3) by @merrymercy in https://github.com/sgl-project/sglang/pull/4216
docs(reasoning content): :memo: deepseek-r1 parser support qwq by @xihuai18 in https://github.com/sgl-project/sglang/pull/4124
revert pr 3628 to pass test_mla ci by @BBuf in https://github.com/sgl-project/sglang/pull/4219
use latest sgl-kernel for mla test by @zhyncs in https://github.com/sgl-project/sglang/pull/4222
Rename files in sgl kernel to avoid nested folder structure by @merrymercy in https://github.com/sgl-project/sglang/pull/4213
chore: bump v0.0.4 for sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/4223
Lazily import lora backends by @merrymercy in https://github.com/sgl-project/sglang/pull/4225
[docker] Distributed Serving with k8s Statefulset ( good example for DeepSeek-R1) by @panpan0000 in https://github.com/sgl-project/sglang/pull/3631
[docs] Unhide production metrics page by @hebiao064 in https://github.com/sgl-project/sglang/pull/4193
use sgl-kernel 0.0.4 by @zhyncs in https://github.com/sgl-project/sglang/pull/4224
Support nextn for flashinfer mla attention backend by @Fridge003 in https://github.com/sgl-project/sglang/pull/4218
Apply sgl w8a8 fp8 kernel by @HandH1998 in https://github.com/sgl-project/sglang/pull/3148
Check eagle server args by @Ying1123 in https://github.com/sgl-project/sglang/pull/4217
update sgl-kernel 3rdparty by @zhyncs in https://github.com/sgl-project/sglang/pull/4228
Update bench speculative script by @ispobock in https://github.com/sgl-project/sglang/pull/4235
Fix test of flashinfer mla with nextn by @Fridge003 in https://github.com/sgl-project/sglang/pull/4237
Move rope and bmm into sgl-kernel by @merrymercy in https://github.com/sgl-project/sglang/pull/4241
Revert "Check eagle server args" by @merrymercy in https://github.com/sgl-project/sglang/pull/4242
Minor style fix for sgl-kernel by @merrymercy in https://github.com/sgl-project/sglang/pull/4243
Auto balance CI tests by @merrymercy in https://github.com/sgl-project/sglang/pull/4238
Clean up fp8 support by @merrymercy in https://github.com/sgl-project/sglang/pull/4230
Move activation.cu to sgl-kernel/elementwise by @merrymercy in https://github.com/sgl-project/sglang/pull/4250
DeepGemm integrate to sgl-kernel by @laixinn in https://github.com/sgl-project/sglang/pull/4165
[Bug fixed] fixed the crash when enable the dp-attention on the single card by @DavidChan0519 in https://github.com/sgl-project/sglang/pull/3958
Added example for multimodal embedding by @simveit in https://github.com/sgl-project/sglang/pull/4206
Simplify tests & Fix trtllm custom allreduce registration by @merrymercy in https://github.com/sgl-project/sglang/pull/4252
fix the input_ids is None error by @Young1993 in https://github.com/sgl-project/sglang/pull/4144
fix per_token_group_quant_fp8 illegal memory when num_groups % 16 != 0 by @BBuf in https://github.com/sgl-project/sglang/pull/4231
Release sgl-kernel v0.0.4.post1 by @merrymercy in https://github.com/sgl-project/sglang/pull/4255
Fix quantization and nightly tests by @merrymercy in https://github.com/sgl-project/sglang/pull/4258
increase the timeout of nightly-test.yml by @merrymercy in https://github.com/sgl-project/sglang/pull/4262
Optimize rope in sgl kernel by @merrymercy in https://github.com/sgl-project/sglang/pull/4267
Test no vllm custom allreduce by @merrymercy in https://github.com/sgl-project/sglang/pull/4256
Amd test fp8 by @HandH1998 in https://github.com/sgl-project/sglang/pull/4261
add THIRDPARTYNOTICES for DeepGEMM by @zhyncs in https://github.com/sgl-project/sglang/pull/4272
upgrade xgrammar 0.1.15 by @zhyncs in https://github.com/sgl-project/sglang/pull/4275
Fix nightly eval for neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 by @merrymercy in https://github.com/sgl-project/sglang/pull/4279
Uupdate cutalss dependency for its bug fix. by @elfiegg in https://github.com/sgl-project/sglang/pull/4277
update deepgemm by @zhyncs in https://github.com/sgl-project/sglang/pull/4284
bump sgl-kernel 0.0.4.post2 by @zhyncs in https://github.com/sgl-project/sglang/pull/4288
Add A800 tuning configs support DeepSeek V3/R1 BF16 and INT8(block-wise) by @lambert0312 in https://github.com/sgl-project/sglang/pull/4136
update sgl-kernel 0.0.4.post2 by @zhyncs in https://github.com/sgl-project/sglang/pull/4291
linear support deepgemm by @sleepcoo in https://github.com/sgl-project/sglang/pull/4199
Update MTP doc by @ispobock in https://github.com/sgl-project/sglang/pull/4290
Add A100 tuning configs for DeepSeek R1/V3 channel-wise INT8 by @yych0745 in https://github.com/sgl-project/sglang/pull/4287
update doc by @zhyncs in https://github.com/sgl-project/sglang/pull/4299
[AMD] Fix rocm sgl-kernel missing modules error by @BruceXcluding in https://github.com/sgl-project/sglang/pull/4311
Add H20 tuning configs support DeepSeek V3/R1 INT8(block-wise) by @Ximingwang-09 in https://github.com/sgl-project/sglang/pull/4220
refactor: move image processors to separate files by @mickqian in https://github.com/sgl-project/sglang/pull/4229
upgrade flashinfer 0.2.3 by @zhyncs in https://github.com/sgl-project/sglang/pull/4317
unify is_cuda and is_hip by @zhyncs in https://github.com/sgl-project/sglang/pull/4321
Add A800 tuning configs for DeepSeek R1/V3 channel-wise INT8 by @lambert0312 in https://github.com/sgl-project/sglang/pull/4323
[Docs] Clean up benchmark_and_profiling.md by @windsonsea in https://github.com/sgl-project/sglang/pull/4297
refine sgl_moe_align_block_size_benchmark by @BBuf in https://github.com/sgl-project/sglang/pull/4327
Remove vllm ops scaled fp8 quant and accelerate per token quant by 20-28% by @hebiao064 in https://github.com/sgl-project/sglang/pull/4215
Add awq dequantize kernel to sgl with 1x to 3x speedup by @zcnrex in https://github.com/sgl-project/sglang/pull/4104
fix awq_dequantize by @zhyncs in https://github.com/sgl-project/sglang/pull/4333
release 0.0.4.post3 sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/4331
upgrade sgl-kernel 0.0.4.post3 by @zhyncs in https://github.com/sgl-project/sglang/pull/4334
Add INT8 support MTP NextN function by @lambert0312 in https://github.com/sgl-project/sglang/pull/3911
[Fix] fix _yarn_linear_ramp_mask with device parameter by @Alcanderian in https://github.com/sgl-project/sglang/pull/4337
remove the unused readline dependency from the Qwen2 model implementa… by @yych0745 in https://github.com/sgl-project/sglang/pull/4340
model: Support Janus-pro by @mickqian in https://github.com/sgl-project/sglang/pull/3203
Hierarchical Caching Refactoring and Fixing TP issue by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4082
Support Blackwell Block Scale FP8 Gemm by @elfiegg in https://github.com/sgl-project/sglang/pull/4278
typo: Update http_server.py by @WrRan in https://github.com/sgl-project/sglang/pull/4350
Update nightly tests by @merrymercy in https://github.com/sgl-project/sglang/pull/4352
[Fix Doc.] Enable internal forwarding when starting the router by @shizhediao in https://github.com/sgl-project/sglang/pull/4355
Move output processing logic from scheduler.py into a separate file by @merrymercy in https://github.com/sgl-project/sglang/pull/4354
Fix scheduler proctitle suffix is None by @cnwenf in https://github.com/sgl-project/sglang/pull/4326
feat: support ep size < 32 for sgl kernel by @shuaills in https://github.com/sgl-project/sglang/pull/4348
Fix per token fp8 quant precision by @qingquansong in https://github.com/sgl-project/sglang/pull/4362
Remove the choices in --speculative-eagle-topk argument by @Achazwl in https://github.com/sgl-project/sglang/pull/4329
docs: add parameter --log-requests-level by @panpan0000 in https://github.com/sgl-project/sglang/pull/4335
simple bugfix by @WrRan in https://github.com/sgl-project/sglang/pull/4342
Fix the doc of FR-Spec by @Achazwl in https://github.com/sgl-project/sglang/pull/4295
[Fix] Check the device backend before calling empty_cache function by @cboss6 in https://github.com/sgl-project/sglang/pull/4212
[FIX] fix incorrect output when enable both deepgemm and torch compile by @AniZpZ in https://github.com/sgl-project/sglang/pull/4359
add INT8 example into dsv3 README by @laixinn in https://github.com/sgl-project/sglang/pull/4079
Avoid duplicated request ids in batch APIs by @tanconghui in https://github.com/sgl-project/sglang/pull/4026
example: add async offline inference demo by @kuizhiqing in https://github.com/sgl-project/sglang/pull/3961
Add device detection and count functions to utils. by @vshekhawat-hlab in https://github.com/sgl-project/sglang/pull/3962
Move aiohttp into public dependencies by @stevapple in https://github.com/sgl-project/sglang/pull/3980
[tools] add fp8 max/min constant in utils by @yiakwy-xpu-ml-framework-team in https://github.com/sgl-project/sglang/pull/3959
HotFix: json serialization error when using OAI v1/batches endpoint with logprobs by @dcfidalgo in https://github.com/sgl-project/sglang/pull/3896
[docs] Update outdated description about torch.compile by @junliu-mde in https://github.com/sgl-project/sglang/pull/3844
[Doc] Fix typo in backend/sampling_params by @yang-ybb in https://github.com/sgl-project/sglang/pull/3835
Ensure Usage Data in Streaming Responses Aligns with vLLM’s Implementation by @HermitSun in https://github.com/sgl-project/sglang/pull/3814
[moe] fix: correct the cache size in the last chunk by @ch-wan in https://github.com/sgl-project/sglang/pull/3679
Support page size > 1 by @merrymercy in https://github.com/sgl-project/sglang/pull/4356
[XPU][CPU] Enable the native path of DeepSeek by @airMeng in https://github.com/sgl-project/sglang/pull/4086
Revert "[XPU][CPU] Enable the native path of DeepSeek" by @merrymercy in https://github.com/sgl-project/sglang/pull/4367
Update grafana.json by @dblate in https://github.com/sgl-project/sglang/pull/4374
fix accuracy issue by @zhyncs in https://github.com/sgl-project/sglang/pull/4376
bump 0.0.5 sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/4377
upgrade sgl-kernel 0.0.5 by @zhyncs in https://github.com/sgl-project/sglang/pull/4381
chore: bump v0.4.4 by @zhyncs in https://github.com/sgl-project/sglang/pull/4041

New Contributors

@yosoyjay made their first contribution in https://github.com/sgl-project/sglang/pull/3505
@FrankLeeeee made their first contribution in https://github.com/sgl-project/sglang/pull/3548
@Jiadalee made their first contribution in https://github.com/sgl-project/sglang/pull/3598
@whybeyoung made their first contribution in https://github.com/sgl-project/sglang/pull/3624
@fsx950223 made their first contribution in https://github.com/sgl-project/sglang/pull/3665
@panpan0000 made their first contribution in https://github.com/sgl-project/sglang/pull/3700
@ch-wan made their first contribution in https://github.com/sgl-project/sglang/pull/3692
@Chen-XiaoBing made their first contribution in https://github.com/sgl-project/sglang/pull/3705
@aoshen524 made their first contribution in https://github.com/sgl-project/sglang/pull/3652
@trayvonpan made their first contribution in https://github.com/sgl-project/sglang/pull/3588
@zixuanzhang226 made their first contribution in https://github.com/sgl-project/sglang/pull/3680
@andjsmi made their first contribution in https://github.com/sgl-project/sglang/pull/3740
@shahizat made their first contribution in https://github.com/sgl-project/sglang/pull/3761
@laixinn made their first contribution in https://github.com/sgl-project/sglang/pull/3730
@He1pa made their first contribution in https://github.com/sgl-project/sglang/pull/3799
@wilsonwu made their first contribution in https://github.com/sgl-project/sglang/pull/3741
@yuanheng-zhao made their first contribution in https://github.com/sgl-project/sglang/pull/3641
@nvcastet made their first contribution in https://github.com/sgl-project/sglang/pull/3709
@hcyz33 made their first contribution in https://github.com/sgl-project/sglang/pull/3841
@kebe7jun made their first contribution in https://github.com/sgl-project/sglang/pull/3519
@JC1DA made their first contribution in https://github.com/sgl-project/sglang/pull/3298
@Chi-Chu319 made their first contribution in https://github.com/sgl-project/sglang/pull/3898
@Qiaolin-Yu made their first contribution in https://github.com/sgl-project/sglang/pull/3897
@xqoasis made their first contribution in https://github.com/sgl-project/sglang/pull/3905
@KCFindstr made their first contribution in https://github.com/sgl-project/sglang/pull/3866
@elfiegg made their first contribution in https://github.com/sgl-project/sglang/pull/3966
@Zhou-sx made their first contribution in https://github.com/sgl-project/sglang/pull/3822
@xihuai18 made their first contribution in https://github.com/sgl-project/sglang/pull/4000
@cboss6 made their first contribution in https://github.com/sgl-project/sglang/pull/3954
@Xiuyu-Li made their first contribution in https://github.com/sgl-project/sglang/pull/3712
@sgjzfzzf made their first contribution in https://github.com/sgl-project/sglang/pull/3607
@zeroorhero made their first contribution in https://github.com/sgl-project/sglang/pull/3990
@samzong made their first contribution in https://github.com/sgl-project/sglang/pull/4101
@olliestanley made their first contribution in https://github.com/sgl-project/sglang/pull/4142
@windsonsea made their first contribution in https://github.com/sgl-project/sglang/pull/4162
@zcnrex made their first contribution in https://github.com/sgl-project/sglang/pull/4197
@brighill made their first contribution in https://github.com/sgl-project/sglang/pull/4181
@DavidChan0519 made their first contribution in https://github.com/sgl-project/sglang/pull/3958
@Young1993 made their first contribution in https://github.com/sgl-project/sglang/pull/4144
@lambert0312 made their first contribution in https://github.com/sgl-project/sglang/pull/4136
@yych0745 made their first contribution in https://github.com/sgl-project/sglang/pull/4287
@Ximingwang-09 made their first contribution in https://github.com/sgl-project/sglang/pull/4220
@Alcanderian made their first contribution in https://github.com/sgl-project/sglang/pull/4337
@shizhediao made their first contribution in https://github.com/sgl-project/sglang/pull/4355
@cnwenf made their first contribution in https://github.com/sgl-project/sglang/pull/4326
@qingquansong made their first contribution in https://github.com/sgl-project/sglang/pull/4362
@AniZpZ made their first contribution in https://github.com/sgl-project/sglang/pull/4359
@tanconghui made their first contribution in https://github.com/sgl-project/sglang/pull/4026
@kuizhiqing made their first contribution in https://github.com/sgl-project/sglang/pull/3961
@vshekhawat-hlab made their first contribution in https://github.com/sgl-project/sglang/pull/3962
@stevapple made their first contribution in https://github.com/sgl-project/sglang/pull/3980
@dcfidalgo made their first contribution in https://github.com/sgl-project/sglang/pull/3896
@junliu-mde made their first contribution in https://github.com/sgl-project/sglang/pull/3844
@yang-ybb made their first contribution in https://github.com/sgl-project/sglang/pull/3835
@airMeng made their first contribution in https://github.com/sgl-project/sglang/pull/4086
@dblate made their first contribution in https://github.com/sgl-project/sglang/pull/4374

Full Changelog: https://github.com/sgl-project/sglang/compare/v0.4.3...v0.4.4

Source: README.md, updated 2025-03-13

SGLang Files

SGLang is a fast serving framework for large language models

Highlights

Optimizations

Coming soon

What's Changed

New Contributors

SGLang Files

SGLang is a fast serving framework for large language models

Get an email when there's a new version of SGLang

Highlights

Optimizations

Coming soon

What's Changed

New Contributors