The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
README.md	2025-04-07	39.8 kB	0
Release v0.4.5 source code.tar.gz	2025-04-07	3.6 MB	0
Release v0.4.5 source code.zip	2025-04-07	4.3 MB	0
Totals: 3 Items		8.0 MB	0

Highlights

The SGLang team is excited to the release of v0.4.5! This version introduces several significant features, including Llama 4 support, FlashAttention 3 backend, EAGLE3 speculative decoding, DeepEP integration, and disaggregated prefill and decoding.

New Features

Llama 4 Support: We supported Llama 4 model with accuracy matching official benchmark numbers, achieving a zero-shot score of 75.2 on the MMLU Pro dataset for Llama-4-Scout-17B-16E-Instruct model and 80.7 for Llama-4-Maverick-17B-128E-Instruct model. https://github.com/sgl-project/sglang/pull/5092
FlashAttention 3 Backend: Our implementation of the FlashAttention 3 backend delivers significant acceleration for long-context tasks. https://github.com/sgl-project/sglang/issues/4709
EAGLE3 Speculative Decoding: We’re proud to be the first to support EAGLE3 speculative decoding, offering substantial gains in decoding throughput. Learn more in our documentation and the EAGLE3 paper. https://github.com/sgl-project/sglang/pull/4247
DeepEP Integration: By incorporating DeepEP, we enhanced performance for MoE inference.
Disaggregated Prefill and Decoding: We introduced a prototype for disaggregated prefill and decoding, with plans for further optimizations.

Thanks very much to the NVIDIA team, LinkedIn team, EAGLE team, Oracle team, Meituan team, and our incredible open-source community for their invaluable contributions!

Coming Soon

Disaggregated Prefill and Decoding: https://github.com/sgl-project/sglang/issues/4655
Llama 4 Optimization: https://github.com/sgl-project/sglang/issues/5118
EP Enhancement: https://github.com/sgl-project/sglang/issues/4734
FA3 Enhancement: https://github.com/sgl-project/sglang/issues/4709

We’re thrilled about these advancements and eager to hear your feedback! Join us on our Slack channel at slack.sglang.ai to connect and share your thoughts. Cheers!

What's Changed

Fix a regression introduced by overlapping KV cache writing by @merrymercy in https://github.com/sgl-project/sglang/pull/4375
Update ci_install_dependency.sh to use accelerate 1.4.0 by @merrymercy in https://github.com/sgl-project/sglang/pull/4392
Improve DP attention by @merrymercy in https://github.com/sgl-project/sglang/pull/4390
Fix auto merge & add back get_flat_data_by_layer by @merrymercy in https://github.com/sgl-project/sglang/pull/4393
Add some fused elementwise kernels for grok-1 by @merrymercy in https://github.com/sgl-project/sglang/pull/4398
Fix Llama3.3 tool call support by @CatherineSue in https://github.com/sgl-project/sglang/pull/4320
Fix the output of hidden states after HTTP requests by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/4269
Add a dummy grok test case by @merrymercy in https://github.com/sgl-project/sglang/pull/4399
Hot fix for hicache with new page aligned radixtree by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4397
bump v0.4.4.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/4402
Update CODEOWNERS by @merrymercy in https://github.com/sgl-project/sglang/pull/4403
Hierarchical Caching supports MLA by @zeroorhero in https://github.com/sgl-project/sglang/pull/4009
cleanup deps 1/n by @zhyncs in https://github.com/sgl-project/sglang/pull/4400
feat(remote_model): support variable remote backend for model loader by @DellCurry in https://github.com/sgl-project/sglang/pull/3964
[bug] fix duplicate variable MAX_PIXELS in qwen_vl.py by @qibaoyuan in https://github.com/sgl-project/sglang/pull/4419
[Doc] fix wrong flag in deepseek documentation by @lausannel in https://github.com/sgl-project/sglang/pull/4427
Add moe topk softmax templated from vllm by @qingquansong in https://github.com/sgl-project/sglang/pull/4302
bump v0.0.5.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/4437
Fix maximum recursion depth triggered on exception exit by @merrymercy in https://github.com/sgl-project/sglang/pull/4438
use topk_softmax with sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/4439
docs: hot fix torch compile cache by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4442
ci: update transformers==4.48.3 by @mickqian in https://github.com/sgl-project/sglang/pull/4451
Fix test_create_kvindices unit test by @sleepcoo in https://github.com/sgl-project/sglang/pull/4452
[Fix] Fix errors when using the device except cuda. by @cboss6 in https://github.com/sgl-project/sglang/pull/4455
docs: Add Llama 3.3 to supported models by @JiangJiaWei1103 in https://github.com/sgl-project/sglang/pull/4453
Update bench_serving.py by @xu-song in https://github.com/sgl-project/sglang/pull/4454
bugfix: Update sampling_params.py by @WrRan in https://github.com/sgl-project/sglang/pull/4413
typos: Update sampling_params.md by @WrRan in https://github.com/sgl-project/sglang/pull/4391
Auto-detect device if not specified in server arguments. by @vshekhawat-hlab in https://github.com/sgl-project/sglang/pull/4423
Add support for upcoming QwenMoe by @michaelfeil in https://github.com/sgl-project/sglang/pull/4447
perf: update fused moe config by @mickqian in https://github.com/sgl-project/sglang/pull/4459
typos by @WrRan in https://github.com/sgl-project/sglang/pull/4368
Fix minor style by @merrymercy in https://github.com/sgl-project/sglang/pull/4460
cleanup deps 2/n by @zhyncs in https://github.com/sgl-project/sglang/pull/4464
feat: Add FlashMLA submodule by @shuaills in https://github.com/sgl-project/sglang/pull/4449
[Fix] use torch.cat instead of torch.concat to prevent entering the Autograd backends. by @Alcanderian in https://github.com/sgl-project/sglang/pull/4466
Fix finish step for pr tests and notebook tests by @merrymercy in https://github.com/sgl-project/sglang/pull/4467
Remove filter for pr-tests by @merrymercy in https://github.com/sgl-project/sglang/pull/4468
Add greedy verification kernel by @Ying1123 in https://github.com/sgl-project/sglang/pull/4383
Release sgl-kernel v0.0.5.post2 by @merrymercy in https://github.com/sgl-project/sglang/pull/4469
Revert "feat: Add FlashMLA submodule (#4449)" by @zhyncs in https://github.com/sgl-project/sglang/pull/4470
[Eagle] Remove the greedy branch and some redundant code by @Ying1123 in https://github.com/sgl-project/sglang/pull/4363
Support FlashMLA backend by @sleepcoo in https://github.com/sgl-project/sglang/pull/4472
fix custom allreduce performance/accuracy problem by @yizhang2077 in https://github.com/sgl-project/sglang/pull/4477
400 on empty input_ids by @yinghai in https://github.com/sgl-project/sglang/pull/4481
Update CODEOWNERS by @merrymercy in https://github.com/sgl-project/sglang/pull/4484
Statistical Analysis of the Output Stability of the Deepseek Model by @tanzelin430 in https://github.com/sgl-project/sglang/pull/4202
model: support gemma-3-it by @mickqian in https://github.com/sgl-project/sglang/pull/4424
Initialize image processor for skip-tokenizer-init codepath by @yinghai in https://github.com/sgl-project/sglang/pull/4479
Fix: modelscope env comment by @huiwq1990 in https://github.com/sgl-project/sglang/pull/4474
Fix: Complete int32 to int64 conversion by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4465
[ROCm] enable moe topk softmax in amd by @yiakwy-xpu-ml-framework-team in https://github.com/sgl-project/sglang/pull/4448
Feat/support code completion by @woodx9 in https://github.com/sgl-project/sglang/pull/3612
Add endpoint for file support, purely to speed up processing of input_embeds. by @RinRin-32 in https://github.com/sgl-project/sglang/pull/2797
Set xgrammar as the default grammar backend by @minleminzui in https://github.com/sgl-project/sglang/pull/4386
Fix router test by @ByronHsu in https://github.com/sgl-project/sglang/pull/4483
[Fix] use torch.inference_mode() instead of torch.no_grad() by @Alcanderian in https://github.com/sgl-project/sglang/pull/4372
[Feature] Support Deepseek-VL2 by @ccw1996 in https://github.com/sgl-project/sglang/pull/2798
config: Update fused moe config by @mickqian in https://github.com/sgl-project/sglang/pull/4493
Support serving DeepSeek-R1-Channel-INT8 with 32 L40S. by @solrex in https://github.com/sgl-project/sglang/pull/4418
Support Online Quantization for W8A8 by @hebiao064 in https://github.com/sgl-project/sglang/pull/4485
Tool call with text by @xihuai18 in https://github.com/sgl-project/sglang/pull/4067
Nicer standalone engine inferface by @yinghai in https://github.com/sgl-project/sglang/pull/4480
[Fix] Resolve GPU Memory Leak in update_weights_from_tensor by @U-rara in https://github.com/sgl-project/sglang/pull/4446
[Doc] add doc for quantization w8a8_fp8 or w8a8_int8 by @HandH1998 in https://github.com/sgl-project/sglang/pull/4495
Fix data parallel + tensor parallel by @merrymercy in https://github.com/sgl-project/sglang/pull/4499
[ROCm] fix dtype by @yiakwy-xpu-ml-framework-team in https://github.com/sgl-project/sglang/pull/4510
Remove redundant type conversion by @merrymercy in https://github.com/sgl-project/sglang/pull/4513
Update readme by @merrymercy in https://github.com/sgl-project/sglang/pull/4517
[sgl-router] improvement to avoid hang by @yinghai in https://github.com/sgl-project/sglang/pull/4482
Revert "feat: update grouped_topk to support softmax and sigmoid" by @ispobock in https://github.com/sgl-project/sglang/pull/4505
bump v0.0.5.post3 by @zhyncs in https://github.com/sgl-project/sglang/pull/4520
upgrade sgl-kernel 0.0.5.post3 by @zhyncs in https://github.com/sgl-project/sglang/pull/4522
sglang quant module remove vllm dependency by @BBuf in https://github.com/sgl-project/sglang/pull/4507
Unit test for Hierarchical Caching by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4486
refactor: rewrite bench-mmmu-sglang by @mickqian in https://github.com/sgl-project/sglang/pull/4458
fix: second_per_grid_ts should be used to get mrope position by @mickqian in https://github.com/sgl-project/sglang/pull/3682
[Hotfix] solve fp8 w8a8 ci test fail by @BBuf in https://github.com/sgl-project/sglang/pull/4531
remove useless backend forward in rotary_embedding by @BBuf in https://github.com/sgl-project/sglang/pull/4500
Fix the incorrect args in benchmark_and_profiling.md by @tianyuzhou95 in https://github.com/sgl-project/sglang/pull/4542
cleanup deps 3/n by @zhyncs in https://github.com/sgl-project/sglang/pull/4541
Add deepseek v2 torch compile pr test by @ispobock in https://github.com/sgl-project/sglang/pull/4538
use sgl custom all reduce by @zhyncs in https://github.com/sgl-project/sglang/pull/4441
[Fix] Type annotation correction for UpdateWeightsFromTensorReqInput by @U-rara in https://github.com/sgl-project/sglang/pull/4532
[Feature] Support EAGLE 3 by @chromecast56 in https://github.com/sgl-project/sglang/pull/4247
Reduce computation and communication in DP attention by @ch-wan in https://github.com/sgl-project/sglang/pull/4521
[Feature] Support Tensor Parallelism and Weight Slicing for Lora by @aoshen524 in https://github.com/sgl-project/sglang/pull/4274
Optimize Triton decoding kernel for dynamic workload by @Alcanderian in https://github.com/sgl-project/sglang/pull/4553
[Fix] Fix raw_bs bug when using flashinfer mla and eagle by @Fridge003 in https://github.com/sgl-project/sglang/pull/4557
Create col-major and tma-aligned x_scale for deep_gemm.gemm_fp8_fp8_bf16_nt by @strgrb in https://github.com/sgl-project/sglang/pull/4515
[Feature] Integrate DeepEP into SGLang by @liz-badada in https://github.com/sgl-project/sglang/pull/4232
Support FlashMLA backend cuda graph by @sleepcoo in https://github.com/sgl-project/sglang/pull/4514
Add clang-format to pre-commit config by @Hongbosherlock in https://github.com/sgl-project/sglang/pull/4583
[fix] fix initialization of _ENABLE_TORCH_INFERENCE_MODE by @Alcanderian in https://github.com/sgl-project/sglang/pull/4549
avoid cudaStreamSynchronize in DeepSeekV2AttentionMLA by @strgrb in https://github.com/sgl-project/sglang/pull/4577
Support n in OpenAI API completions by @ChuyueSun in https://github.com/sgl-project/sglang/pull/3446
[fix] fix illegal mem access and clean up triton attention backend by @Alcanderian in https://github.com/sgl-project/sglang/pull/4571
Enable setting sglang logger from Env Variable SGLANG_LOGGING_CONFIG_PATH by @guoyuhong in https://github.com/sgl-project/sglang/pull/4592
Update doc for MTP and DP attention by @ispobock in https://github.com/sgl-project/sglang/pull/4622
Support fp8 gemm for blackwell by @wenscarl in https://github.com/sgl-project/sglang/pull/4558
fix SUPPORT_CUTLASS_BLOCK_FP8 flag by @ch-wan in https://github.com/sgl-project/sglang/pull/4640
Set deepgemm to the default value in the hopper architecture. by @sleepcoo in https://github.com/sgl-project/sglang/pull/4613
[docs] Add links and fix grammars in deploy_on_k8s.md by @windsonsea in https://github.com/sgl-project/sglang/pull/4641
Align completion and chat_completion response to OpenAI API by @guoyuhong in https://github.com/sgl-project/sglang/pull/4637
[PD] Release initial code by @ByronHsu in https://github.com/sgl-project/sglang/pull/4654
fix: fix ipython running error for Engine due to outlines nest_asyncio by @minleminzui in https://github.com/sgl-project/sglang/pull/4582
update news for README by @zhyncs in https://github.com/sgl-project/sglang/pull/4664
Speed up per token and per tensor quant by 15% by @zcnrex in https://github.com/sgl-project/sglang/pull/4639
[quantization] fix channelwise conversion with scalar weight scale by @yundai424 in https://github.com/sgl-project/sglang/pull/4596
Correcting default configuration when benchmarking fused_moe by @penguin-wwy in https://github.com/sgl-project/sglang/pull/4665
[1/3] fix dsv3 awq issue by @AniZpZ in https://github.com/sgl-project/sglang/pull/4556
[Docs] Update docs for gemma3 and VLM chat templates by @adarshxs in https://github.com/sgl-project/sglang/pull/4674
[CI fix] test skipping modelopt on AMD by @adarshxs in https://github.com/sgl-project/sglang/pull/4677
fix flaky ut by @zhyncs in https://github.com/sgl-project/sglang/pull/4670
Add EAGLE mtbench benchmark script by @ispobock in https://github.com/sgl-project/sglang/pull/4676
Bug fix for metrics counter by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4660
[Bug Fix] Add partial rotary factor support for Phi-4 and upgrade to transformers v4.50.0 by @adarshxs in https://github.com/sgl-project/sglang/pull/3984
Optimize Permute Kernel in DeepEP by @xutizhou in https://github.com/sgl-project/sglang/pull/4643
fix typo SGLang supports three grammar backends by @BroadbentJim in https://github.com/sgl-project/sglang/pull/4679
close gemma2 in test_verl_engine.py temporarily by @yizhang2077 in https://github.com/sgl-project/sglang/pull/4685
Multiple tiny code cleanups by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4608
Support async in DeepEP by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4610
refactor: bug fixes and refactor for vlm by @mickqian in https://github.com/sgl-project/sglang/pull/4661
Move mem_state update into debug mode by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4525
Fix RotaryEmbedding when using Triton backend for EXAONE-3.5-2.4B by @lkm2835 in https://github.com/sgl-project/sglang/pull/4064
Unify variable naming: replace is_in_free_group with is_not_in_free_group by @c1lovez1 in https://github.com/sgl-project/sglang/pull/4698
[ROCm] Enable MTP (NextN) on AMD GPU by @alexsun07 in https://github.com/sgl-project/sglang/pull/4631
Support FA3 as Attention backend by using --attention-backend fa3 by @hebiao064 in https://github.com/sgl-project/sglang/pull/4680
rename benchmark_deepgemm_fp8_group_gemm.py by @tbzhang in https://github.com/sgl-project/sglang/pull/4605
[Quant Kernel] refactored per token group quant fp8 to support int8 up-to 2x faster by @zcnrex in https://github.com/sgl-project/sglang/pull/4396
Support dynamic version name in sglang's pyproject.toml by @guoyuhong in https://github.com/sgl-project/sglang/pull/4720
update pyproject by @zhyncs in https://github.com/sgl-project/sglang/pull/4731
[PD] Remove invalid parameter by @XucSh in https://github.com/sgl-project/sglang/pull/4721
Fix EAGLE3 for llama3.3 70b by @ispobock in https://github.com/sgl-project/sglang/pull/4716
Fix circular imports in gptq.py and unblock test explorer by @hebiao064 in https://github.com/sgl-project/sglang/pull/4736
[Model] Support Qwen2ForSequenceClassification by @Ximingwang-09 in https://github.com/sgl-project/sglang/pull/4609
Support FP4 gemm (1/2) by @trevor-m in https://github.com/sgl-project/sglang/pull/3899
Add DeepEP tests into CI by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4737
model: Minicpmo by @mickqian in https://github.com/sgl-project/sglang/pull/3023
support cu128 sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/4744
[Benchmark] tilelang vs deepgemm vs w8a8_block_fp8_matmul by @zcnrex in https://github.com/sgl-project/sglang/pull/4735
Super tiny fix typo by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4738
fix FlashMLA cudagraph config by @sleepcoo in https://github.com/sgl-project/sglang/pull/4691
Speedup warmup when DP > 1 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4695
Add endpoints to dump selected expert ids by @yuhsuan-t in https://github.com/sgl-project/sglang/pull/4435
add dsv3 int8 test by @HandH1998 in https://github.com/sgl-project/sglang/pull/4705
[Feature] Support "strict" in function calling by @DarkSharpness in https://github.com/sgl-project/sglang/pull/4310
Revert "Add DeepEP tests into CI (#4737)" by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4751
Fix test_expert_distribution failure by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4752
Fix warmup error when dp=1 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4753
Add retry for flaky tests in CI by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4755
[Fix] Fix unexpected idx bug of Phi-3-small by @Fridge003 in https://github.com/sgl-project/sglang/pull/4728
Warn users when release_memory_occupation is called without memory saver enabled by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4566
fix(typo): fix reply to replay in base_attn_backend.py by @Thysrael in https://github.com/sgl-project/sglang/pull/4784
Support recording experts workload in QWen2-MoE by @ch-wan in https://github.com/sgl-project/sglang/pull/4775
Fix popen_launch_server wait for 20 minutes when child process exits by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4777
Use metadata to detect version of package by @kebe7jun in https://github.com/sgl-project/sglang/pull/4782
Fix shared memory OOM on sm86 GPUs. by @Conless in https://github.com/sgl-project/sglang/pull/4797
Support compressed tensors fp8w8a8 by @BBuf in https://github.com/sgl-project/sglang/pull/4743
bump v0.4.4.post2 by @zhyncs in https://github.com/sgl-project/sglang/pull/4669
[3/3] fix dsv3 awq issue by @laixinn in https://github.com/sgl-project/sglang/pull/4719
Update supported_models.md: adding open-r1 Olympic Code 32B by HuggingFace by @didier-durand in https://github.com/sgl-project/sglang/pull/4628
Align finish reason and stream mode in openai api by @xihuai18 in https://github.com/sgl-project/sglang/pull/4388
support clip embedding model by @Titan-p in https://github.com/sgl-project/sglang/pull/4506
update xgrammar 0.1.17 by @zhyncs in https://github.com/sgl-project/sglang/pull/4804
Patch PyTorch's bug that cross-process tensor transfer will lead to wrong device by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4565
[FA3 Attn Backend] Remove Unnecessary Device Sync for FA3 by @hebiao064 in https://github.com/sgl-project/sglang/pull/4745
support cmake for sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/4706
Use apply_rope_with_cos_sin_cache_inplace for DeepSeek by @strgrb in https://github.com/sgl-project/sglang/pull/4764
Fix ut mla-test-1-gpu-amd by @strgrb in https://github.com/sgl-project/sglang/pull/4813
Remove Unintended Capture Batch Sizes in AMD HIP Graph Runner by @gmlwns2000 in https://github.com/sgl-project/sglang/pull/4638
[k8s] Clarified the usage of shared memory. by @jsuchome in https://github.com/sgl-project/sglang/pull/4341
gemma3: impl get_attention_sliding_window_size for attn init by @vhain in https://github.com/sgl-project/sglang/pull/4823
add partial_json_parser and einops by @zhyncs in https://github.com/sgl-project/sglang/pull/4827
fix the release doc dependency issue by @zhyncs in https://github.com/sgl-project/sglang/pull/4828
Update doc for DeepSeek-V3-0324 by @ispobock in https://github.com/sgl-project/sglang/pull/4825
deps: lazy import optional dependencies gguf and torchvision by @vhain in https://github.com/sgl-project/sglang/pull/4826
Update MMMU Benchmark instructions by @ravi03071991 in https://github.com/sgl-project/sglang/pull/4694
Fix the nightly eval by lowering the threshold of neuralmagic/gemma-2-2b-it-FP8 by @merrymercy in https://github.com/sgl-project/sglang/pull/4830
Basic Cleanup by @danielholanda in https://github.com/sgl-project/sglang/pull/4833
Support (1 <= dp < tp) in the dp attention in DeepEP by @tarinkk in https://github.com/sgl-project/sglang/pull/4770
[Fix] Add compressed_tensors as deps by @ocss884 in https://github.com/sgl-project/sglang/pull/4819
Fix error due to CustomAllreduce setup failure by @kebe7jun in https://github.com/sgl-project/sglang/pull/4815
use default for torch.ops by @zhyncs in https://github.com/sgl-project/sglang/pull/4835
[CI] Remove unused imports with Ruff to pre-commit config, only to benchmarks/docs/examples folder by @b8zhong in https://github.com/sgl-project/sglang/pull/3969
[Misc] Fix issues reported by torchfix by @b8zhong in https://github.com/sgl-project/sglang/pull/4837
Include context length in /v1/models response. by @jondurbin in https://github.com/sgl-project/sglang/pull/4809
[Fix] self.worker assignment in TpModelWorker and refactor references by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/4788
Fix the lora adapter when lora path is none by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/4799
fix: fix typo of comments in w8a8_fp8.py by @ZhuJiaqi9905 in https://github.com/sgl-project/sglang/pull/4843
Remove retry in nightly tests by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4846
Fix CI of test_patch_torch by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4844
IPv6 support by @vincent-4 in https://github.com/sgl-project/sglang/pull/3949
ci: add condition for daily docker build by @warjiang in https://github.com/sgl-project/sglang/pull/4487
[Fix] fix output_top_logprobs is not exist by @lambert0312 in https://github.com/sgl-project/sglang/pull/4597
fix: when use SGLANG_PORT this env,port is str by @lengrongfu in https://github.com/sgl-project/sglang/pull/4528
Support Page Size > 1 for FA3 by @hebiao064 in https://github.com/sgl-project/sglang/pull/4832
Fix Engine error when enabling DP attention by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4648
fix: Inappropriate lack of Optional type on OpenAI ChatCompletionRequest by @BroadbentJim in https://github.com/sgl-project/sglang/pull/4681
Support controlling nsys start and end range programmatically by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4688
Remove empty tool function name by @kebe7jun in https://github.com/sgl-project/sglang/pull/4704
Fix missing arguments in SchedulePolicy and RadixCache initialization in tests. by @vshekhawat-hlab in https://github.com/sgl-project/sglang/pull/4712
get the python version from env by @DavidChan0519 in https://github.com/sgl-project/sglang/pull/4729
Fix torch.cuda.MemPool() internal assertion failure by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4687
Super tiny remove unused code by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4750
Support with_stack and record_shapes in profiler by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4740
test: reduce mem_fraction_static for gemma3 vision test by @vhain in https://github.com/sgl-project/sglang/pull/4840
Fix CI tests by @merrymercy in https://github.com/sgl-project/sglang/pull/4853
Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed by @qingquansong in https://github.com/sgl-project/sglang/pull/4855
Revert "get the python version from env (#4729)" by @zhyncs in https://github.com/sgl-project/sglang/pull/4863
[Feature] add multi-rank support for Lora by @jcbjcbjc in https://github.com/sgl-project/sglang/pull/4492
Clean up import vllm in quantization/init.py by @merrymercy in https://github.com/sgl-project/sglang/pull/4834
Fix wrong variable name when stopping memory profile by @Fr4nk1inCs in https://github.com/sgl-project/sglang/pull/4772
[Feat] support deepgemm for cmake by @yinfan98 in https://github.com/sgl-project/sglang/pull/4864
Make torch compile configurable for biased_grouped_topk by @qingquansong in https://github.com/sgl-project/sglang/pull/4749
update sgl-kernel test ci by @zhyncs in https://github.com/sgl-project/sglang/pull/4866
fix sampling issue by @zhyncs in https://github.com/sgl-project/sglang/pull/4871
bump sgl-kernel 0.0.5.post4 by @zhyncs in https://github.com/sgl-project/sglang/pull/4768
fix sgl-kernel cu118 build by @zhyncs in https://github.com/sgl-project/sglang/pull/4872
[Feature] Support FA3 backend for MLA by @Fridge003 in https://github.com/sgl-project/sglang/pull/4831
upgrade sgl-kernel 0.0.5.post4 by @zhyncs in https://github.com/sgl-project/sglang/pull/4873
update torch compile doc by @ispobock in https://github.com/sgl-project/sglang/pull/4874
bump v0.4.4.post3 by @zhyncs in https://github.com/sgl-project/sglang/pull/4878
Fix BadRequestError wrong arguments and remove openai dependency by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4882
Improve stack trace of retry errors by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4845
Tiny fix doc error by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4795
[Docs] Update DeepGemm at README.md by @yinfan98 in https://github.com/sgl-project/sglang/pull/4886
Update CODEOWNERS by @zhyncs in https://github.com/sgl-project/sglang/pull/4889
Delete test_deep_gemm.py by @yinfan98 in https://github.com/sgl-project/sglang/pull/4891
Add deepseek style fused moe group gate selection kernel by @qingquansong in https://github.com/sgl-project/sglang/pull/4530
quick fix: add default for new kernel by @yinfan98 in https://github.com/sgl-project/sglang/pull/4898
remove setup for sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/4899
[Misc] Clean m.def and add Development Tips by @yinfan98 in https://github.com/sgl-project/sglang/pull/4890
fix allreduce test by @yizhang2077 in https://github.com/sgl-project/sglang/pull/4909
Support page size > 1 + eagle by @merrymercy in https://github.com/sgl-project/sglang/pull/4908
Fix retract for page size > 1 by @merrymercy in https://github.com/sgl-project/sglang/pull/4914
[Feature] use pytest for sgl-kernel by @adarshxs in https://github.com/sgl-project/sglang/pull/4896
fix bmm fp8 by @zhyncs in https://github.com/sgl-project/sglang/pull/4926
Fix the timeout for unit-test-2-gpu in pr-test.yml by @merrymercy in https://github.com/sgl-project/sglang/pull/4927
Fix 2-gpu CI test and suppress some warnings by @merrymercy in https://github.com/sgl-project/sglang/pull/4930
[feat] add fa3 in sgl-kernel by @yinfan98 in https://github.com/sgl-project/sglang/pull/4902
Fix sglang frontend's incorrect dependency on torch by @seplos in https://github.com/sgl-project/sglang/pull/4931
[Fix] avoid stream sync and torch compile in prefill for fa3 backend by @Fridge003 in https://github.com/sgl-project/sglang/pull/4932
cleanup sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/4933
[Fix] Improve Lora tests and reduce CI runtime by @Fridge003 in https://github.com/sgl-project/sglang/pull/4925
Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4883
[Fix] Add torch compile for torch.clamp back by @Fridge003 in https://github.com/sgl-project/sglang/pull/4936
Fix oom error for large page size by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4913
[feat] interface for platforms abstraction by @Alcanderian in https://github.com/sgl-project/sglang/pull/4928
[Fix] revert clean m.def for cudagraph by @yinfan98 in https://github.com/sgl-project/sglang/pull/4944
refactor: multimodal data by @mickqian in https://github.com/sgl-project/sglang/pull/4754
bump sgl-kernel v0.0.6 by @zhyncs in https://github.com/sgl-project/sglang/pull/4950
[Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu by @guoyuhong in https://github.com/sgl-project/sglang/pull/4953
use fa3 in sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/4954
Revert PR 4764 & 4813 related to R1 RoPE by @guoyuhong in https://github.com/sgl-project/sglang/pull/4959
[Feature] Support DeepEP Low Latency by @liz-badada in https://github.com/sgl-project/sglang/pull/4767
update bench_serving by @zhyncs in https://github.com/sgl-project/sglang/pull/4958
Prevent memory leak of retract_decode when page_size > 1 by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4977
[VLM RLHF] Take Image input for verl vlm rollout by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/4915
Large page size aligned hierarchical caching by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4581
bug fix for hicache host eviction by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4989
sgl scaled_fp8_quant support output padding by @BBuf in https://github.com/sgl-project/sglang/pull/4861
Add Eagle Speculative Decoding to FA3 Backend by @qingquansong in https://github.com/sgl-project/sglang/pull/4951
Update tokenizer_manager.py by @yangky11 in https://github.com/sgl-project/sglang/pull/5008
[sgl-kernel] per token group quant support COLUMN MAJOR by @BBuf in https://github.com/sgl-project/sglang/pull/4817
update cutlass tag by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/5011
Feature/revise docs ci by @renxinx in https://github.com/sgl-project/sglang/pull/5009
fix: fix illegal cuda memory access at fused_moe_kernel by @saltyfish66 in https://github.com/sgl-project/sglang/pull/4727
[Build] Support build sgl-kernel with ccache by @guoyuhong in https://github.com/sgl-project/sglang/pull/5020
fix deepgemm as well by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/5030
try to fix ci oserror by @BBuf in https://github.com/sgl-project/sglang/pull/5024
Replace enable_flashinfer_mla argument with attention_backend by @Fridge003 in https://github.com/sgl-project/sglang/pull/5005
Small refactor DeepEPMode to clean up code a bit by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4992
[Fix] fix fa3 build at cu118 by @yinfan98 in https://github.com/sgl-project/sglang/pull/5036
Revert "Replace enable_flashinfer_mla argument with attention_backend" by @merrymercy in https://github.com/sgl-project/sglang/pull/5048
bump sgl-kernel v0.0.7 by @zhyncs in https://github.com/sgl-project/sglang/pull/5046
update eagle-3 docs by @simveit in https://github.com/sgl-project/sglang/pull/4796
Add LlavaLlamaForCausaLM in MultiModal Processors by @ravi03071991 in https://github.com/sgl-project/sglang/pull/5039
Update the retry count by @zhyncs in https://github.com/sgl-project/sglang/pull/5051
upgrade sgl-kernel v0.0.7 by @zhyncs in https://github.com/sgl-project/sglang/pull/5049
[2/3] fix dsv3 awq issue by @AniZpZ in https://github.com/sgl-project/sglang/pull/4625
Feature/revise docs ci by @renxinx in https://github.com/sgl-project/sglang/pull/5056
Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 by @M0gician in https://github.com/sgl-project/sglang/pull/5057
[fix] remove cuda_device_count_stateless by @Alcanderian in https://github.com/sgl-project/sglang/pull/5060
Small refactor DeepEPDispatcher into subclasses by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4994
Support async DeepEP by splitting into two stages by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4995
Cleanup unused resources after DeepEP operation by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4996
Add DeepSeek V3/R1 shared experts fusion by @BBuf in https://github.com/sgl-project/sglang/pull/4918
[deepep] fix: shared experts are not initialized when shared experts fusion is disabled by @ch-wan in https://github.com/sgl-project/sglang/pull/5072
fix dummy-load deepseekv2 by @inkcherry in https://github.com/sgl-project/sglang/pull/4535
support sgl-kernel on blackwell by @zhyncs in https://github.com/sgl-project/sglang/pull/5074
FA3 Spec Decoding to support top k = 1 and add cuda graph support by @hebiao064 in https://github.com/sgl-project/sglang/pull/5050
[Revision] Replace enable_flashinfer_mla argument with attention_backend by @Fridge003 in https://github.com/sgl-project/sglang/pull/5052
upgrade transformers 4.51.0 by @zhyncs in https://github.com/sgl-project/sglang/pull/5088
sgl-kernel transfer custom allreduce from trt kernel to vllm kernel by @yizhang2077 in https://github.com/sgl-project/sglang/pull/5079
bump sgl-kernel 0.0.8 by @zhyncs in https://github.com/sgl-project/sglang/pull/5089
python transfer custom allreduce from trt kernel to vllm kernel by @yizhang2077 in https://github.com/sgl-project/sglang/pull/5080
bump v0.4.4.post4 by @zhyncs in https://github.com/sgl-project/sglang/pull/5091
Fix: Reduce the number of document ci attempts to avoid long ci running by @minleminzui in https://github.com/sgl-project/sglang/pull/5097
Add Llama4 support by @CatherineSue in https://github.com/sgl-project/sglang/pull/5092
Fix refactor error - fp8.py by @HaiShaw in https://github.com/sgl-project/sglang/pull/5106
bump v0.4.5 by @zhyncs in https://github.com/sgl-project/sglang/pull/5117

New Contributors

@DellCurry made their first contribution in https://github.com/sgl-project/sglang/pull/3964
@lausannel made their first contribution in https://github.com/sgl-project/sglang/pull/4427
@JiangJiaWei1103 made their first contribution in https://github.com/sgl-project/sglang/pull/4453
@xu-song made their first contribution in https://github.com/sgl-project/sglang/pull/4454
@yinghai made their first contribution in https://github.com/sgl-project/sglang/pull/4481
@tanzelin430 made their first contribution in https://github.com/sgl-project/sglang/pull/4202
@huiwq1990 made their first contribution in https://github.com/sgl-project/sglang/pull/4474
@woodx9 made their first contribution in https://github.com/sgl-project/sglang/pull/3612
@ccw1996 made their first contribution in https://github.com/sgl-project/sglang/pull/2798
@solrex made their first contribution in https://github.com/sgl-project/sglang/pull/4418
@U-rara made their first contribution in https://github.com/sgl-project/sglang/pull/4446
@tianyuzhou95 made their first contribution in https://github.com/sgl-project/sglang/pull/4542
@chromecast56 made their first contribution in https://github.com/sgl-project/sglang/pull/4247
@strgrb made their first contribution in https://github.com/sgl-project/sglang/pull/4515
@liz-badada made their first contribution in https://github.com/sgl-project/sglang/pull/4232
@Hongbosherlock made their first contribution in https://github.com/sgl-project/sglang/pull/4583
@guoyuhong made their first contribution in https://github.com/sgl-project/sglang/pull/4592
@wenscarl made their first contribution in https://github.com/sgl-project/sglang/pull/4558
@penguin-wwy made their first contribution in https://github.com/sgl-project/sglang/pull/4665
@xutizhou made their first contribution in https://github.com/sgl-project/sglang/pull/4643
@BroadbentJim made their first contribution in https://github.com/sgl-project/sglang/pull/4679
@lkm2835 made their first contribution in https://github.com/sgl-project/sglang/pull/4064
@c1lovez1 made their first contribution in https://github.com/sgl-project/sglang/pull/4698
@alexsun07 made their first contribution in https://github.com/sgl-project/sglang/pull/4631
@tbzhang made their first contribution in https://github.com/sgl-project/sglang/pull/4605
@XucSh made their first contribution in https://github.com/sgl-project/sglang/pull/4721
@yuhsuan-t made their first contribution in https://github.com/sgl-project/sglang/pull/4435
@Thysrael made their first contribution in https://github.com/sgl-project/sglang/pull/4784
@Conless made their first contribution in https://github.com/sgl-project/sglang/pull/4797
@gmlwns2000 made their first contribution in https://github.com/sgl-project/sglang/pull/4638
@jsuchome made their first contribution in https://github.com/sgl-project/sglang/pull/4341
@danielholanda made their first contribution in https://github.com/sgl-project/sglang/pull/4833
@tarinkk made their first contribution in https://github.com/sgl-project/sglang/pull/4770
@ocss884 made their first contribution in https://github.com/sgl-project/sglang/pull/4819
@b8zhong made their first contribution in https://github.com/sgl-project/sglang/pull/3969
@jondurbin made their first contribution in https://github.com/sgl-project/sglang/pull/4809
@JustinTong0323 made their first contribution in https://github.com/sgl-project/sglang/pull/4788
@ZhuJiaqi9905 made their first contribution in https://github.com/sgl-project/sglang/pull/4843
@vincent-4 made their first contribution in https://github.com/sgl-project/sglang/pull/3949
@warjiang made their first contribution in https://github.com/sgl-project/sglang/pull/4487
@lengrongfu made their first contribution in https://github.com/sgl-project/sglang/pull/4528
@jcbjcbjc made their first contribution in https://github.com/sgl-project/sglang/pull/4492
@Fr4nk1inCs made their first contribution in https://github.com/sgl-project/sglang/pull/4772
@seplos made their first contribution in https://github.com/sgl-project/sglang/pull/4931
@yangky11 made their first contribution in https://github.com/sgl-project/sglang/pull/5008
@renxinx made their first contribution in https://github.com/sgl-project/sglang/pull/5009
@saltyfish66 made their first contribution in https://github.com/sgl-project/sglang/pull/4727
@inkcherry made their first contribution in https://github.com/sgl-project/sglang/pull/4535

Full Changelog: https://github.com/sgl-project/sglang/compare/v0.4.4...v0.4.5

Source: README.md, updated 2025-04-07

SGLang Files

SGLang is a fast serving framework for large language models

Highlights

New Features

Coming Soon

What's Changed

New Contributors

SGLang Files

SGLang is a fast serving framework for large language models

Get an email when there's a new version of SGLang

Highlights

New Features

Coming Soon

What's Changed

New Contributors