Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
README.md | 2025-04-07 | 39.8 kB | |
Release v0.4.5 source code.tar.gz | 2025-04-07 | 3.6 MB | |
Release v0.4.5 source code.zip | 2025-04-07 | 4.3 MB | |
Totals: 3 Items | 8.0 MB | 0 |
Highlights
The SGLang team is excited to the release of v0.4.5! This version introduces several significant features, including Llama 4 support, FlashAttention 3 backend, EAGLE3 speculative decoding, DeepEP integration, and disaggregated prefill and decoding.
New Features
-
Llama 4 Support: We supported Llama 4 model with accuracy matching official benchmark numbers, achieving a zero-shot score of 75.2 on the MMLU Pro dataset for
Llama-4-Scout-17B-16E-Instruct
model and 80.7 forLlama-4-Maverick-17B-128E-Instruct
model. https://github.com/sgl-project/sglang/pull/5092 -
FlashAttention 3 Backend: Our implementation of the FlashAttention 3 backend delivers significant acceleration for long-context tasks. https://github.com/sgl-project/sglang/issues/4709
-
EAGLE3 Speculative Decoding: We’re proud to be the first to support EAGLE3 speculative decoding, offering substantial gains in decoding throughput. Learn more in our documentation and the EAGLE3 paper. https://github.com/sgl-project/sglang/pull/4247
-
DeepEP Integration: By incorporating DeepEP, we enhanced performance for MoE inference.
-
Disaggregated Prefill and Decoding: We introduced a prototype for disaggregated prefill and decoding, with plans for further optimizations.
Thanks very much to the NVIDIA team, LinkedIn team, EAGLE team, Oracle team, Meituan team, and our incredible open-source community for their invaluable contributions!
Coming Soon
-
Disaggregated Prefill and Decoding: https://github.com/sgl-project/sglang/issues/4655
-
Llama 4 Optimization: https://github.com/sgl-project/sglang/issues/5118
-
EP Enhancement: https://github.com/sgl-project/sglang/issues/4734
-
FA3 Enhancement: https://github.com/sgl-project/sglang/issues/4709
We’re thrilled about these advancements and eager to hear your feedback! Join us on our Slack channel at slack.sglang.ai to connect and share your thoughts. Cheers!
What's Changed
- Fix a regression introduced by overlapping KV cache writing by @merrymercy in https://github.com/sgl-project/sglang/pull/4375
- Update ci_install_dependency.sh to use accelerate 1.4.0 by @merrymercy in https://github.com/sgl-project/sglang/pull/4392
- Improve DP attention by @merrymercy in https://github.com/sgl-project/sglang/pull/4390
- Fix auto merge & add back get_flat_data_by_layer by @merrymercy in https://github.com/sgl-project/sglang/pull/4393
- Add some fused elementwise kernels for grok-1 by @merrymercy in https://github.com/sgl-project/sglang/pull/4398
- Fix Llama3.3 tool call support by @CatherineSue in https://github.com/sgl-project/sglang/pull/4320
- Fix the output of hidden states after HTTP requests by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/4269
- Add a dummy grok test case by @merrymercy in https://github.com/sgl-project/sglang/pull/4399
- Hot fix for hicache with new page aligned radixtree by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4397
- bump v0.4.4.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/4402
- Update CODEOWNERS by @merrymercy in https://github.com/sgl-project/sglang/pull/4403
- Hierarchical Caching supports MLA by @zeroorhero in https://github.com/sgl-project/sglang/pull/4009
- cleanup deps 1/n by @zhyncs in https://github.com/sgl-project/sglang/pull/4400
- feat(remote_model): support variable remote backend for model loader by @DellCurry in https://github.com/sgl-project/sglang/pull/3964
- [bug] fix duplicate variable MAX_PIXELS in qwen_vl.py by @qibaoyuan in https://github.com/sgl-project/sglang/pull/4419
- [Doc] fix wrong flag in deepseek documentation by @lausannel in https://github.com/sgl-project/sglang/pull/4427
- Add moe topk softmax templated from vllm by @qingquansong in https://github.com/sgl-project/sglang/pull/4302
- bump v0.0.5.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/4437
- Fix maximum recursion depth triggered on exception exit by @merrymercy in https://github.com/sgl-project/sglang/pull/4438
- use topk_softmax with sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/4439
- docs: hot fix torch compile cache by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4442
- ci: update transformers==4.48.3 by @mickqian in https://github.com/sgl-project/sglang/pull/4451
- Fix test_create_kvindices unit test by @sleepcoo in https://github.com/sgl-project/sglang/pull/4452
- [Fix] Fix errors when using the device except cuda. by @cboss6 in https://github.com/sgl-project/sglang/pull/4455
- docs: Add Llama 3.3 to supported models by @JiangJiaWei1103 in https://github.com/sgl-project/sglang/pull/4453
- Update bench_serving.py by @xu-song in https://github.com/sgl-project/sglang/pull/4454
- bugfix: Update sampling_params.py by @WrRan in https://github.com/sgl-project/sglang/pull/4413
- typos: Update sampling_params.md by @WrRan in https://github.com/sgl-project/sglang/pull/4391
- Auto-detect device if not specified in server arguments. by @vshekhawat-hlab in https://github.com/sgl-project/sglang/pull/4423
- Add support for upcoming QwenMoe by @michaelfeil in https://github.com/sgl-project/sglang/pull/4447
- perf: update fused moe config by @mickqian in https://github.com/sgl-project/sglang/pull/4459
- typos by @WrRan in https://github.com/sgl-project/sglang/pull/4368
- Fix minor style by @merrymercy in https://github.com/sgl-project/sglang/pull/4460
- cleanup deps 2/n by @zhyncs in https://github.com/sgl-project/sglang/pull/4464
- feat: Add FlashMLA submodule by @shuaills in https://github.com/sgl-project/sglang/pull/4449
- [Fix] use
torch.cat
instead oftorch.concat
to prevent entering theAutograd
backends. by @Alcanderian in https://github.com/sgl-project/sglang/pull/4466 - Fix finish step for pr tests and notebook tests by @merrymercy in https://github.com/sgl-project/sglang/pull/4467
- Remove filter for pr-tests by @merrymercy in https://github.com/sgl-project/sglang/pull/4468
- Add greedy verification kernel by @Ying1123 in https://github.com/sgl-project/sglang/pull/4383
- Release sgl-kernel v0.0.5.post2 by @merrymercy in https://github.com/sgl-project/sglang/pull/4469
- Revert "feat: Add FlashMLA submodule (#4449)" by @zhyncs in https://github.com/sgl-project/sglang/pull/4470
- [Eagle] Remove the greedy branch and some redundant code by @Ying1123 in https://github.com/sgl-project/sglang/pull/4363
- Support FlashMLA backend by @sleepcoo in https://github.com/sgl-project/sglang/pull/4472
- fix custom allreduce performance/accuracy problem by @yizhang2077 in https://github.com/sgl-project/sglang/pull/4477
- 400 on empty input_ids by @yinghai in https://github.com/sgl-project/sglang/pull/4481
- Update CODEOWNERS by @merrymercy in https://github.com/sgl-project/sglang/pull/4484
- Statistical Analysis of the Output Stability of the Deepseek Model by @tanzelin430 in https://github.com/sgl-project/sglang/pull/4202
- model: support gemma-3-it by @mickqian in https://github.com/sgl-project/sglang/pull/4424
- Initialize image processor for skip-tokenizer-init codepath by @yinghai in https://github.com/sgl-project/sglang/pull/4479
- Fix: modelscope env comment by @huiwq1990 in https://github.com/sgl-project/sglang/pull/4474
- Fix: Complete int32 to int64 conversion by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4465
- [ROCm] enable moe topk softmax in amd by @yiakwy-xpu-ml-framework-team in https://github.com/sgl-project/sglang/pull/4448
- Feat/support code completion by @woodx9 in https://github.com/sgl-project/sglang/pull/3612
- Add endpoint for file support, purely to speed up processing of input_embeds. by @RinRin-32 in https://github.com/sgl-project/sglang/pull/2797
- Set xgrammar as the default grammar backend by @minleminzui in https://github.com/sgl-project/sglang/pull/4386
- Fix router test by @ByronHsu in https://github.com/sgl-project/sglang/pull/4483
- [Fix] use
torch.inference_mode()
instead oftorch.no_grad()
by @Alcanderian in https://github.com/sgl-project/sglang/pull/4372 - [Feature] Support Deepseek-VL2 by @ccw1996 in https://github.com/sgl-project/sglang/pull/2798
- config: Update fused moe config by @mickqian in https://github.com/sgl-project/sglang/pull/4493
- Support serving DeepSeek-R1-Channel-INT8 with 32 L40S. by @solrex in https://github.com/sgl-project/sglang/pull/4418
- Support Online Quantization for W8A8 by @hebiao064 in https://github.com/sgl-project/sglang/pull/4485
- Tool call with text by @xihuai18 in https://github.com/sgl-project/sglang/pull/4067
- Nicer standalone engine inferface by @yinghai in https://github.com/sgl-project/sglang/pull/4480
- [Fix] Resolve GPU Memory Leak in update_weights_from_tensor by @U-rara in https://github.com/sgl-project/sglang/pull/4446
- [Doc] add doc for quantization w8a8_fp8 or w8a8_int8 by @HandH1998 in https://github.com/sgl-project/sglang/pull/4495
- Fix data parallel + tensor parallel by @merrymercy in https://github.com/sgl-project/sglang/pull/4499
- [ROCm] fix dtype by @yiakwy-xpu-ml-framework-team in https://github.com/sgl-project/sglang/pull/4510
- Remove redundant type conversion by @merrymercy in https://github.com/sgl-project/sglang/pull/4513
- Update readme by @merrymercy in https://github.com/sgl-project/sglang/pull/4517
- [sgl-router] improvement to avoid hang by @yinghai in https://github.com/sgl-project/sglang/pull/4482
- Revert "feat: update grouped_topk to support softmax and sigmoid" by @ispobock in https://github.com/sgl-project/sglang/pull/4505
- bump v0.0.5.post3 by @zhyncs in https://github.com/sgl-project/sglang/pull/4520
- upgrade sgl-kernel 0.0.5.post3 by @zhyncs in https://github.com/sgl-project/sglang/pull/4522
- sglang quant module remove vllm dependency by @BBuf in https://github.com/sgl-project/sglang/pull/4507
- Unit test for Hierarchical Caching by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4486
- refactor: rewrite bench-mmmu-sglang by @mickqian in https://github.com/sgl-project/sglang/pull/4458
- fix: second_per_grid_ts should be used to get mrope position by @mickqian in https://github.com/sgl-project/sglang/pull/3682
- [Hotfix] solve fp8 w8a8 ci test fail by @BBuf in https://github.com/sgl-project/sglang/pull/4531
- remove useless backend forward in rotary_embedding by @BBuf in https://github.com/sgl-project/sglang/pull/4500
- Fix the incorrect args in benchmark_and_profiling.md by @tianyuzhou95 in https://github.com/sgl-project/sglang/pull/4542
- cleanup deps 3/n by @zhyncs in https://github.com/sgl-project/sglang/pull/4541
- Add deepseek v2 torch compile pr test by @ispobock in https://github.com/sgl-project/sglang/pull/4538
- use sgl custom all reduce by @zhyncs in https://github.com/sgl-project/sglang/pull/4441
- [Fix] Type annotation correction for UpdateWeightsFromTensorReqInput by @U-rara in https://github.com/sgl-project/sglang/pull/4532
- [Feature] Support EAGLE 3 by @chromecast56 in https://github.com/sgl-project/sglang/pull/4247
- Reduce computation and communication in DP attention by @ch-wan in https://github.com/sgl-project/sglang/pull/4521
- [Feature] Support Tensor Parallelism and Weight Slicing for Lora by @aoshen524 in https://github.com/sgl-project/sglang/pull/4274
- Optimize Triton decoding kernel for dynamic workload by @Alcanderian in https://github.com/sgl-project/sglang/pull/4553
- [Fix] Fix raw_bs bug when using flashinfer mla and eagle by @Fridge003 in https://github.com/sgl-project/sglang/pull/4557
- Create col-major and tma-aligned x_scale for deep_gemm.gemm_fp8_fp8_bf16_nt by @strgrb in https://github.com/sgl-project/sglang/pull/4515
- [Feature] Integrate DeepEP into SGLang by @liz-badada in https://github.com/sgl-project/sglang/pull/4232
- Support FlashMLA backend cuda graph by @sleepcoo in https://github.com/sgl-project/sglang/pull/4514
- Add clang-format to pre-commit config by @Hongbosherlock in https://github.com/sgl-project/sglang/pull/4583
- [fix] fix initialization of _ENABLE_TORCH_INFERENCE_MODE by @Alcanderian in https://github.com/sgl-project/sglang/pull/4549
- avoid cudaStreamSynchronize in DeepSeekV2AttentionMLA by @strgrb in https://github.com/sgl-project/sglang/pull/4577
- Support
n
in OpenAI API completions by @ChuyueSun in https://github.com/sgl-project/sglang/pull/3446 - [fix] fix illegal mem access and clean up triton attention backend by @Alcanderian in https://github.com/sgl-project/sglang/pull/4571
- Enable setting sglang logger from Env Variable
SGLANG_LOGGING_CONFIG_PATH
by @guoyuhong in https://github.com/sgl-project/sglang/pull/4592 - Update doc for MTP and DP attention by @ispobock in https://github.com/sgl-project/sglang/pull/4622
- Support fp8 gemm for blackwell by @wenscarl in https://github.com/sgl-project/sglang/pull/4558
- fix SUPPORT_CUTLASS_BLOCK_FP8 flag by @ch-wan in https://github.com/sgl-project/sglang/pull/4640
- Set deepgemm to the default value in the hopper architecture. by @sleepcoo in https://github.com/sgl-project/sglang/pull/4613
- [docs] Add links and fix grammars in deploy_on_k8s.md by @windsonsea in https://github.com/sgl-project/sglang/pull/4641
- Align completion and chat_completion response to OpenAI API by @guoyuhong in https://github.com/sgl-project/sglang/pull/4637
- [PD] Release initial code by @ByronHsu in https://github.com/sgl-project/sglang/pull/4654
- fix: fix ipython running error for Engine due to outlines nest_asyncio by @minleminzui in https://github.com/sgl-project/sglang/pull/4582
- update news for README by @zhyncs in https://github.com/sgl-project/sglang/pull/4664
- Speed up per token and per tensor quant by 15% by @zcnrex in https://github.com/sgl-project/sglang/pull/4639
- [quantization] fix channelwise conversion with scalar weight scale by @yundai424 in https://github.com/sgl-project/sglang/pull/4596
- Correcting default configuration when benchmarking fused_moe by @penguin-wwy in https://github.com/sgl-project/sglang/pull/4665
- [1/3] fix dsv3 awq issue by @AniZpZ in https://github.com/sgl-project/sglang/pull/4556
- [Docs] Update docs for gemma3 and VLM chat templates by @adarshxs in https://github.com/sgl-project/sglang/pull/4674
- [CI fix] test skipping modelopt on AMD by @adarshxs in https://github.com/sgl-project/sglang/pull/4677
- fix flaky ut by @zhyncs in https://github.com/sgl-project/sglang/pull/4670
- Add EAGLE mtbench benchmark script by @ispobock in https://github.com/sgl-project/sglang/pull/4676
- Bug fix for metrics counter by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4660
- [Bug Fix] Add partial rotary factor support for Phi-4 and upgrade to transformers v4.50.0 by @adarshxs in https://github.com/sgl-project/sglang/pull/3984
- Optimize Permute Kernel in DeepEP by @xutizhou in https://github.com/sgl-project/sglang/pull/4643
- fix typo SGLang supports three grammar backends by @BroadbentJim in https://github.com/sgl-project/sglang/pull/4679
- close gemma2 in test_verl_engine.py temporarily by @yizhang2077 in https://github.com/sgl-project/sglang/pull/4685
- Multiple tiny code cleanups by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4608
- Support async in DeepEP by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4610
- refactor: bug fixes and refactor for vlm by @mickqian in https://github.com/sgl-project/sglang/pull/4661
- Move mem_state update into debug mode by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4525
- Fix RotaryEmbedding when using Triton backend for EXAONE-3.5-2.4B by @lkm2835 in https://github.com/sgl-project/sglang/pull/4064
- Unify variable naming: replace is_in_free_group with is_not_in_free_group by @c1lovez1 in https://github.com/sgl-project/sglang/pull/4698
- [ROCm] Enable MTP (NextN) on AMD GPU by @alexsun07 in https://github.com/sgl-project/sglang/pull/4631
- Support FA3 as Attention backend by using
--attention-backend fa3
by @hebiao064 in https://github.com/sgl-project/sglang/pull/4680 - rename benchmark_deepgemm_fp8_group_gemm.py by @tbzhang in https://github.com/sgl-project/sglang/pull/4605
- [Quant Kernel] refactored per token group quant fp8 to support int8 up-to 2x faster by @zcnrex in https://github.com/sgl-project/sglang/pull/4396
- Support dynamic version name in sglang's pyproject.toml by @guoyuhong in https://github.com/sgl-project/sglang/pull/4720
- update pyproject by @zhyncs in https://github.com/sgl-project/sglang/pull/4731
- [PD] Remove invalid parameter by @XucSh in https://github.com/sgl-project/sglang/pull/4721
- Fix EAGLE3 for llama3.3 70b by @ispobock in https://github.com/sgl-project/sglang/pull/4716
- Fix circular imports in gptq.py and unblock test explorer by @hebiao064 in https://github.com/sgl-project/sglang/pull/4736
- [Model] Support Qwen2ForSequenceClassification by @Ximingwang-09 in https://github.com/sgl-project/sglang/pull/4609
- Support FP4 gemm (1/2) by @trevor-m in https://github.com/sgl-project/sglang/pull/3899
- Add DeepEP tests into CI by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4737
- model: Minicpmo by @mickqian in https://github.com/sgl-project/sglang/pull/3023
- support cu128 sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/4744
- [Benchmark] tilelang vs deepgemm vs w8a8_block_fp8_matmul by @zcnrex in https://github.com/sgl-project/sglang/pull/4735
- Super tiny fix typo by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4738
- fix FlashMLA cudagraph config by @sleepcoo in https://github.com/sgl-project/sglang/pull/4691
- Speedup warmup when DP > 1 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4695
- Add endpoints to dump selected expert ids by @yuhsuan-t in https://github.com/sgl-project/sglang/pull/4435
- add dsv3 int8 test by @HandH1998 in https://github.com/sgl-project/sglang/pull/4705
- [Feature] Support "strict" in function calling by @DarkSharpness in https://github.com/sgl-project/sglang/pull/4310
- Revert "Add DeepEP tests into CI (#4737)" by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4751
- Fix test_expert_distribution failure by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4752
- Fix warmup error when dp=1 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4753
- Add retry for flaky tests in CI by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4755
- [Fix] Fix unexpected idx bug of Phi-3-small by @Fridge003 in https://github.com/sgl-project/sglang/pull/4728
- Warn users when release_memory_occupation is called without memory saver enabled by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4566
- fix(typo): fix
reply
toreplay
inbase_attn_backend.py
by @Thysrael in https://github.com/sgl-project/sglang/pull/4784 - Support recording experts workload in QWen2-MoE by @ch-wan in https://github.com/sgl-project/sglang/pull/4775
- Fix popen_launch_server wait for 20 minutes when child process exits by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4777
- Use metadata to detect version of package by @kebe7jun in https://github.com/sgl-project/sglang/pull/4782
- Fix shared memory OOM on sm86 GPUs. by @Conless in https://github.com/sgl-project/sglang/pull/4797
- Support compressed tensors fp8w8a8 by @BBuf in https://github.com/sgl-project/sglang/pull/4743
- bump v0.4.4.post2 by @zhyncs in https://github.com/sgl-project/sglang/pull/4669
- [3/3] fix dsv3 awq issue by @laixinn in https://github.com/sgl-project/sglang/pull/4719
- Update supported_models.md: adding open-r1 Olympic Code 32B by HuggingFace by @didier-durand in https://github.com/sgl-project/sglang/pull/4628
- Align finish reason and stream mode in openai api by @xihuai18 in https://github.com/sgl-project/sglang/pull/4388
- support clip embedding model by @Titan-p in https://github.com/sgl-project/sglang/pull/4506
- update xgrammar 0.1.17 by @zhyncs in https://github.com/sgl-project/sglang/pull/4804
- Patch PyTorch's bug that cross-process tensor transfer will lead to wrong device by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4565
- [FA3 Attn Backend] Remove Unnecessary Device Sync for FA3 by @hebiao064 in https://github.com/sgl-project/sglang/pull/4745
- support cmake for sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/4706
- Use apply_rope_with_cos_sin_cache_inplace for DeepSeek by @strgrb in https://github.com/sgl-project/sglang/pull/4764
- Fix ut mla-test-1-gpu-amd by @strgrb in https://github.com/sgl-project/sglang/pull/4813
- Remove Unintended Capture Batch Sizes in AMD HIP Graph Runner by @gmlwns2000 in https://github.com/sgl-project/sglang/pull/4638
- [k8s] Clarified the usage of shared memory. by @jsuchome in https://github.com/sgl-project/sglang/pull/4341
- gemma3: impl
get_attention_sliding_window_size
for attn init by @vhain in https://github.com/sgl-project/sglang/pull/4823 - add partial_json_parser and einops by @zhyncs in https://github.com/sgl-project/sglang/pull/4827
- fix the release doc dependency issue by @zhyncs in https://github.com/sgl-project/sglang/pull/4828
- Update doc for DeepSeek-V3-0324 by @ispobock in https://github.com/sgl-project/sglang/pull/4825
- deps: lazy import optional dependencies
gguf
andtorchvision
by @vhain in https://github.com/sgl-project/sglang/pull/4826 - Update MMMU Benchmark instructions by @ravi03071991 in https://github.com/sgl-project/sglang/pull/4694
- Fix the nightly eval by lowering the threshold of
neuralmagic/gemma-2-2b-it-FP8
by @merrymercy in https://github.com/sgl-project/sglang/pull/4830 - Basic Cleanup by @danielholanda in https://github.com/sgl-project/sglang/pull/4833
- Support (1 <= dp < tp) in the dp attention in DeepEP by @tarinkk in https://github.com/sgl-project/sglang/pull/4770
- [Fix] Add compressed_tensors as deps by @ocss884 in https://github.com/sgl-project/sglang/pull/4819
- Fix error due to CustomAllreduce setup failure by @kebe7jun in https://github.com/sgl-project/sglang/pull/4815
- use default for torch.ops by @zhyncs in https://github.com/sgl-project/sglang/pull/4835
- [CI] Remove unused imports with Ruff to pre-commit config, only to benchmarks/docs/examples folder by @b8zhong in https://github.com/sgl-project/sglang/pull/3969
- [Misc] Fix issues reported by torchfix by @b8zhong in https://github.com/sgl-project/sglang/pull/4837
- Include context length in /v1/models response. by @jondurbin in https://github.com/sgl-project/sglang/pull/4809
- [Fix]
self.worker
assignment inTpModelWorker
and refactor references by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/4788 - Fix the lora adapter when lora path is none by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/4799
- fix: fix typo of comments in w8a8_fp8.py by @ZhuJiaqi9905 in https://github.com/sgl-project/sglang/pull/4843
- Remove retry in nightly tests by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4846
- Fix CI of test_patch_torch by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4844
- IPv6 support by @vincent-4 in https://github.com/sgl-project/sglang/pull/3949
- ci: add condition for daily docker build by @warjiang in https://github.com/sgl-project/sglang/pull/4487
- [Fix] fix output_top_logprobs is not exist by @lambert0312 in https://github.com/sgl-project/sglang/pull/4597
- fix: when use SGLANG_PORT this env,port is str by @lengrongfu in https://github.com/sgl-project/sglang/pull/4528
- Support Page Size > 1 for FA3 by @hebiao064 in https://github.com/sgl-project/sglang/pull/4832
- Fix Engine error when enabling DP attention by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4648
- fix: Inappropriate lack of Optional type on OpenAI ChatCompletionRequest by @BroadbentJim in https://github.com/sgl-project/sglang/pull/4681
- Support controlling nsys start and end range programmatically by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4688
- Remove empty tool function name by @kebe7jun in https://github.com/sgl-project/sglang/pull/4704
- Fix missing arguments in SchedulePolicy and RadixCache initialization in tests. by @vshekhawat-hlab in https://github.com/sgl-project/sglang/pull/4712
- get the python version from env by @DavidChan0519 in https://github.com/sgl-project/sglang/pull/4729
- Fix torch.cuda.MemPool() internal assertion failure by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4687
- Super tiny remove unused code by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4750
- Support with_stack and record_shapes in profiler by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4740
- test: reduce
mem_fraction_static
for gemma3 vision test by @vhain in https://github.com/sgl-project/sglang/pull/4840 - Fix CI tests by @merrymercy in https://github.com/sgl-project/sglang/pull/4853
- Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed by @qingquansong in https://github.com/sgl-project/sglang/pull/4855
- Revert "get the python version from env (#4729)" by @zhyncs in https://github.com/sgl-project/sglang/pull/4863
- [Feature] add multi-rank support for Lora by @jcbjcbjc in https://github.com/sgl-project/sglang/pull/4492
- Clean up
import vllm
in quantization/init.py by @merrymercy in https://github.com/sgl-project/sglang/pull/4834 - Fix wrong variable name when stopping memory profile by @Fr4nk1inCs in https://github.com/sgl-project/sglang/pull/4772
- [Feat] support deepgemm for cmake by @yinfan98 in https://github.com/sgl-project/sglang/pull/4864
- Make torch compile configurable for biased_grouped_topk by @qingquansong in https://github.com/sgl-project/sglang/pull/4749
- update sgl-kernel test ci by @zhyncs in https://github.com/sgl-project/sglang/pull/4866
- fix sampling issue by @zhyncs in https://github.com/sgl-project/sglang/pull/4871
- bump sgl-kernel 0.0.5.post4 by @zhyncs in https://github.com/sgl-project/sglang/pull/4768
- fix sgl-kernel cu118 build by @zhyncs in https://github.com/sgl-project/sglang/pull/4872
- [Feature] Support FA3 backend for MLA by @Fridge003 in https://github.com/sgl-project/sglang/pull/4831
- upgrade sgl-kernel 0.0.5.post4 by @zhyncs in https://github.com/sgl-project/sglang/pull/4873
- update torch compile doc by @ispobock in https://github.com/sgl-project/sglang/pull/4874
- bump v0.4.4.post3 by @zhyncs in https://github.com/sgl-project/sglang/pull/4878
- Fix BadRequestError wrong arguments and remove openai dependency by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4882
- Improve stack trace of retry errors by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4845
- Tiny fix doc error by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4795
- [Docs] Update DeepGemm at README.md by @yinfan98 in https://github.com/sgl-project/sglang/pull/4886
- Update CODEOWNERS by @zhyncs in https://github.com/sgl-project/sglang/pull/4889
- Delete test_deep_gemm.py by @yinfan98 in https://github.com/sgl-project/sglang/pull/4891
- Add deepseek style fused moe group gate selection kernel by @qingquansong in https://github.com/sgl-project/sglang/pull/4530
- quick fix: add default for new kernel by @yinfan98 in https://github.com/sgl-project/sglang/pull/4898
- remove setup for sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/4899
- [Misc] Clean m.def and add Development Tips by @yinfan98 in https://github.com/sgl-project/sglang/pull/4890
- fix allreduce test by @yizhang2077 in https://github.com/sgl-project/sglang/pull/4909
- Support page size > 1 + eagle by @merrymercy in https://github.com/sgl-project/sglang/pull/4908
- Fix retract for page size > 1 by @merrymercy in https://github.com/sgl-project/sglang/pull/4914
- [Feature] use pytest for sgl-kernel by @adarshxs in https://github.com/sgl-project/sglang/pull/4896
- fix bmm fp8 by @zhyncs in https://github.com/sgl-project/sglang/pull/4926
- Fix the timeout for unit-test-2-gpu in pr-test.yml by @merrymercy in https://github.com/sgl-project/sglang/pull/4927
- Fix 2-gpu CI test and suppress some warnings by @merrymercy in https://github.com/sgl-project/sglang/pull/4930
- [feat] add fa3 in sgl-kernel by @yinfan98 in https://github.com/sgl-project/sglang/pull/4902
- Fix sglang frontend's incorrect dependency on torch by @seplos in https://github.com/sgl-project/sglang/pull/4931
- [Fix] avoid stream sync and torch compile in prefill for fa3 backend by @Fridge003 in https://github.com/sgl-project/sglang/pull/4932
- cleanup sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/4933
- [Fix] Improve Lora tests and reduce CI runtime by @Fridge003 in https://github.com/sgl-project/sglang/pull/4925
- Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4883
- [Fix] Add torch compile for torch.clamp back by @Fridge003 in https://github.com/sgl-project/sglang/pull/4936
- Fix oom error for large page size by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4913
- [feat] interface for platforms abstraction by @Alcanderian in https://github.com/sgl-project/sglang/pull/4928
- [Fix] revert clean m.def for cudagraph by @yinfan98 in https://github.com/sgl-project/sglang/pull/4944
- refactor: multimodal data by @mickqian in https://github.com/sgl-project/sglang/pull/4754
- bump sgl-kernel v0.0.6 by @zhyncs in https://github.com/sgl-project/sglang/pull/4950
- [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu by @guoyuhong in https://github.com/sgl-project/sglang/pull/4953
- use fa3 in sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/4954
- Revert PR 4764 & 4813 related to R1 RoPE by @guoyuhong in https://github.com/sgl-project/sglang/pull/4959
- [Feature] Support DeepEP Low Latency by @liz-badada in https://github.com/sgl-project/sglang/pull/4767
- update bench_serving by @zhyncs in https://github.com/sgl-project/sglang/pull/4958
- Prevent memory leak of retract_decode when page_size > 1 by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4977
- [VLM RLHF] Take Image input for verl vlm rollout by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/4915
- Large page size aligned hierarchical caching by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4581
- bug fix for hicache host eviction by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4989
- sgl scaled_fp8_quant support output padding by @BBuf in https://github.com/sgl-project/sglang/pull/4861
- Add Eagle Speculative Decoding to FA3 Backend by @qingquansong in https://github.com/sgl-project/sglang/pull/4951
- Update tokenizer_manager.py by @yangky11 in https://github.com/sgl-project/sglang/pull/5008
- [sgl-kernel] per token group quant support COLUMN MAJOR by @BBuf in https://github.com/sgl-project/sglang/pull/4817
- update cutlass tag by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/5011
- Feature/revise docs ci by @renxinx in https://github.com/sgl-project/sglang/pull/5009
- fix: fix illegal cuda memory access at fused_moe_kernel by @saltyfish66 in https://github.com/sgl-project/sglang/pull/4727
- [Build] Support build sgl-kernel with ccache by @guoyuhong in https://github.com/sgl-project/sglang/pull/5020
- fix deepgemm as well by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/5030
- try to fix ci oserror by @BBuf in https://github.com/sgl-project/sglang/pull/5024
- Replace enable_flashinfer_mla argument with attention_backend by @Fridge003 in https://github.com/sgl-project/sglang/pull/5005
- Small refactor DeepEPMode to clean up code a bit by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4992
- [Fix] fix fa3 build at cu118 by @yinfan98 in https://github.com/sgl-project/sglang/pull/5036
- Revert "Replace enable_flashinfer_mla argument with attention_backend" by @merrymercy in https://github.com/sgl-project/sglang/pull/5048
- bump sgl-kernel v0.0.7 by @zhyncs in https://github.com/sgl-project/sglang/pull/5046
- update eagle-3 docs by @simveit in https://github.com/sgl-project/sglang/pull/4796
- Add LlavaLlamaForCausaLM in MultiModal Processors by @ravi03071991 in https://github.com/sgl-project/sglang/pull/5039
- Update the retry count by @zhyncs in https://github.com/sgl-project/sglang/pull/5051
- upgrade sgl-kernel v0.0.7 by @zhyncs in https://github.com/sgl-project/sglang/pull/5049
- [2/3] fix dsv3 awq issue by @AniZpZ in https://github.com/sgl-project/sglang/pull/4625
- Feature/revise docs ci by @renxinx in https://github.com/sgl-project/sglang/pull/5056
- Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 by @M0gician in https://github.com/sgl-project/sglang/pull/5057
- [fix] remove
cuda_device_count_stateless
by @Alcanderian in https://github.com/sgl-project/sglang/pull/5060 - Small refactor DeepEPDispatcher into subclasses by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4994
- Support async DeepEP by splitting into two stages by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4995
- Cleanup unused resources after DeepEP operation by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4996
- Add DeepSeek V3/R1 shared experts fusion by @BBuf in https://github.com/sgl-project/sglang/pull/4918
- [deepep] fix: shared experts are not initialized when shared experts fusion is disabled by @ch-wan in https://github.com/sgl-project/sglang/pull/5072
- fix dummy-load deepseekv2 by @inkcherry in https://github.com/sgl-project/sglang/pull/4535
- support sgl-kernel on blackwell by @zhyncs in https://github.com/sgl-project/sglang/pull/5074
- FA3 Spec Decoding to support top k = 1 and add cuda graph support by @hebiao064 in https://github.com/sgl-project/sglang/pull/5050
- [Revision] Replace enable_flashinfer_mla argument with attention_backend by @Fridge003 in https://github.com/sgl-project/sglang/pull/5052
- upgrade transformers 4.51.0 by @zhyncs in https://github.com/sgl-project/sglang/pull/5088
- sgl-kernel transfer custom allreduce from trt kernel to vllm kernel by @yizhang2077 in https://github.com/sgl-project/sglang/pull/5079
- bump sgl-kernel 0.0.8 by @zhyncs in https://github.com/sgl-project/sglang/pull/5089
- python transfer custom allreduce from trt kernel to vllm kernel by @yizhang2077 in https://github.com/sgl-project/sglang/pull/5080
- bump v0.4.4.post4 by @zhyncs in https://github.com/sgl-project/sglang/pull/5091
- Fix: Reduce the number of document ci attempts to avoid long ci running by @minleminzui in https://github.com/sgl-project/sglang/pull/5097
- Add Llama4 support by @CatherineSue in https://github.com/sgl-project/sglang/pull/5092
- Fix refactor error - fp8.py by @HaiShaw in https://github.com/sgl-project/sglang/pull/5106
- bump v0.4.5 by @zhyncs in https://github.com/sgl-project/sglang/pull/5117
New Contributors
- @DellCurry made their first contribution in https://github.com/sgl-project/sglang/pull/3964
- @lausannel made their first contribution in https://github.com/sgl-project/sglang/pull/4427
- @JiangJiaWei1103 made their first contribution in https://github.com/sgl-project/sglang/pull/4453
- @xu-song made their first contribution in https://github.com/sgl-project/sglang/pull/4454
- @yinghai made their first contribution in https://github.com/sgl-project/sglang/pull/4481
- @tanzelin430 made their first contribution in https://github.com/sgl-project/sglang/pull/4202
- @huiwq1990 made their first contribution in https://github.com/sgl-project/sglang/pull/4474
- @woodx9 made their first contribution in https://github.com/sgl-project/sglang/pull/3612
- @ccw1996 made their first contribution in https://github.com/sgl-project/sglang/pull/2798
- @solrex made their first contribution in https://github.com/sgl-project/sglang/pull/4418
- @U-rara made their first contribution in https://github.com/sgl-project/sglang/pull/4446
- @tianyuzhou95 made their first contribution in https://github.com/sgl-project/sglang/pull/4542
- @chromecast56 made their first contribution in https://github.com/sgl-project/sglang/pull/4247
- @strgrb made their first contribution in https://github.com/sgl-project/sglang/pull/4515
- @liz-badada made their first contribution in https://github.com/sgl-project/sglang/pull/4232
- @Hongbosherlock made their first contribution in https://github.com/sgl-project/sglang/pull/4583
- @guoyuhong made their first contribution in https://github.com/sgl-project/sglang/pull/4592
- @wenscarl made their first contribution in https://github.com/sgl-project/sglang/pull/4558
- @penguin-wwy made their first contribution in https://github.com/sgl-project/sglang/pull/4665
- @xutizhou made their first contribution in https://github.com/sgl-project/sglang/pull/4643
- @BroadbentJim made their first contribution in https://github.com/sgl-project/sglang/pull/4679
- @lkm2835 made their first contribution in https://github.com/sgl-project/sglang/pull/4064
- @c1lovez1 made their first contribution in https://github.com/sgl-project/sglang/pull/4698
- @alexsun07 made their first contribution in https://github.com/sgl-project/sglang/pull/4631
- @tbzhang made their first contribution in https://github.com/sgl-project/sglang/pull/4605
- @XucSh made their first contribution in https://github.com/sgl-project/sglang/pull/4721
- @yuhsuan-t made their first contribution in https://github.com/sgl-project/sglang/pull/4435
- @Thysrael made their first contribution in https://github.com/sgl-project/sglang/pull/4784
- @Conless made their first contribution in https://github.com/sgl-project/sglang/pull/4797
- @gmlwns2000 made their first contribution in https://github.com/sgl-project/sglang/pull/4638
- @jsuchome made their first contribution in https://github.com/sgl-project/sglang/pull/4341
- @danielholanda made their first contribution in https://github.com/sgl-project/sglang/pull/4833
- @tarinkk made their first contribution in https://github.com/sgl-project/sglang/pull/4770
- @ocss884 made their first contribution in https://github.com/sgl-project/sglang/pull/4819
- @b8zhong made their first contribution in https://github.com/sgl-project/sglang/pull/3969
- @jondurbin made their first contribution in https://github.com/sgl-project/sglang/pull/4809
- @JustinTong0323 made their first contribution in https://github.com/sgl-project/sglang/pull/4788
- @ZhuJiaqi9905 made their first contribution in https://github.com/sgl-project/sglang/pull/4843
- @vincent-4 made their first contribution in https://github.com/sgl-project/sglang/pull/3949
- @warjiang made their first contribution in https://github.com/sgl-project/sglang/pull/4487
- @lengrongfu made their first contribution in https://github.com/sgl-project/sglang/pull/4528
- @jcbjcbjc made their first contribution in https://github.com/sgl-project/sglang/pull/4492
- @Fr4nk1inCs made their first contribution in https://github.com/sgl-project/sglang/pull/4772
- @seplos made their first contribution in https://github.com/sgl-project/sglang/pull/4931
- @yangky11 made their first contribution in https://github.com/sgl-project/sglang/pull/5008
- @renxinx made their first contribution in https://github.com/sgl-project/sglang/pull/5009
- @saltyfish66 made their first contribution in https://github.com/sgl-project/sglang/pull/4727
- @inkcherry made their first contribution in https://github.com/sgl-project/sglang/pull/4535
Full Changelog: https://github.com/sgl-project/sglang/compare/v0.4.4...v0.4.5