Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
README.md | 2025-06-10 | 72.0 kB | |
Release v0.4.7 source code.tar.gz | 2025-06-10 | 4.1 MB | |
Release v0.4.7 source code.zip | 2025-06-10 | 5.1 MB | |
Totals: 3 Items | 9.2 MB | 0 |
Highlights
-
The previously PD disaggregation and large-scale EP functionalities from the blog post have now been fully merged into the latest release.
-
The blog has been successfully reproduced by over six industry teams, including the TensorRT LLM team.
-
SGLang’s large-scale EP is now actively used by leading organizations such as Cursor, Qwen, Alimama, Alibaba Cloud, iFlytek, and more. It has been deployed and validated at large scale, running on GPU clusters with thousands of devices.
-
PD disaggregation and large-scale EP, in addition to supporting DeepSeek V3/R1, now also support Qwen 3 in the latest release.
-
Full Blackwell support for DeepSeek V3/R1, Llama 4, and Qwen 3. Further optimizations are underway.
-
SGLang's DeepSeek V3/R1 now achieves 190 TPS on single H200, outperforming other frameworks by over 50%.
We extend our sincere thanks to the following contributors, listed in alphabetical order: Alibaba Cloud, AMD Team, Ant Group, Baseten Team, Cursor Team, Dynamo Team, EAGLE Team, FlashInfer Team, Google Vertex AI Team, iFlytek MaaS Team, Intel Team, LinkedIn Team, Meituan Team, Microsoft Copilot Team, Mooncake Team, NVIDIA Team, Oracle Team, Qwen Team, Voltage Park Team and open source community users. Your support and collaboration are deeply appreciated!
What's Changed
- Update nightly-test.yml by @merrymercy in https://github.com/sgl-project/sglang/pull/5797
- [CI] Improve github summary & enable fa3 for more models by @merrymercy in https://github.com/sgl-project/sglang/pull/5796
- [Docs] update grafana setup guide in production metrics by @PopSoda2002 in https://github.com/sgl-project/sglang/pull/5643
- [Misc] add structure logging, write to file and log tracing for SGL R… by @slin1237 in https://github.com/sgl-project/sglang/pull/5741
- Improve overlap scheduling by @hnyls2002 in https://github.com/sgl-project/sglang/pull/5788
- Add Cutlass MLA attention backend by @trevor-m in https://github.com/sgl-project/sglang/pull/5390
- chore: upgrade sgl-kernel 0.1.0 by @zhyncs in https://github.com/sgl-project/sglang/pull/5690
- Dockerfile.dev pip scikit_build_core by @BBuf in https://github.com/sgl-project/sglang/pull/5807
- Add a doc to fix sgl-kernel build link error in py39 with ccache by @BBuf in https://github.com/sgl-project/sglang/pull/5809
- Turn on overlap scheduler for multimodal models by @merrymercy in https://github.com/sgl-project/sglang/pull/5771
- Tiny refactor DefaultModelLoader.Source by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5482
- [Docs] Replace lists with tables for cleanup and readability in server_arguments by @windsonsea in https://github.com/sgl-project/sglang/pull/5276
- Revert "Tiny refactor DefaultModelLoader.Source" by @merrymercy in https://github.com/sgl-project/sglang/pull/5825
- Feat: add support for thinking mode via chat_template_kwargs.enable_t… by @minleminzui in https://github.com/sgl-project/sglang/pull/5551
- fix: fix the error where the content is None when reasoning and tool … by @minleminzui in https://github.com/sgl-project/sglang/pull/5838
- feat: Add fused moe triton config for qwen3 moe on h100 by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/5833
- fused moe triton tuning script support qwen3 by @BBuf in https://github.com/sgl-project/sglang/pull/5842
- feat: Add fused moe triton config for qwen3bf16 moe on h20 by @yhyang201 in https://github.com/sgl-project/sglang/pull/5839
- [PD] support pd fake transfer for warmup by @whybeyoung in https://github.com/sgl-project/sglang/pull/5726
- [qwen3] qwen3moe_tune_h20 fp8 tp4 by @whybeyoung in https://github.com/sgl-project/sglang/pull/5846
- [Doc] Recover history of server_arguments.md by @Fridge003 in https://github.com/sgl-project/sglang/pull/5851
- feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 by @GeLee-Q in https://github.com/sgl-project/sglang/pull/5850
- [CI] test chunked prefill more by @merrymercy in https://github.com/sgl-project/sglang/pull/5798
- ROCm: update AITER by @HaiShaw in https://github.com/sgl-project/sglang/pull/5816
- [Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel by @yinfan98 in https://github.com/sgl-project/sglang/pull/5847
- [Fix] Missing bootstrap_port field by @xutianyi1999 in https://github.com/sgl-project/sglang/pull/5823
- feat: update is_fa3_default_architecture by @zhyncs in https://github.com/sgl-project/sglang/pull/5854
- add fused moe config for qwen3moe fp8/bf16 by @yizhang2077 in https://github.com/sgl-project/sglang/pull/5849
- chore: bump v0.4.6.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/5845
- Support
max_completion_tokens
for OpenAIChatCompletions by @CatherineSue in https://github.com/sgl-project/sglang/pull/5857 - simplify fused_moe config logging by @BBuf in https://github.com/sgl-project/sglang/pull/5801
- [CI] tune the test order to warmup the server by @merrymercy in https://github.com/sgl-project/sglang/pull/5860
- Cutlass MLA decode - fix dtype error by @trevor-m in https://github.com/sgl-project/sglang/pull/5868
- cutlass 3.9 supported to improve fp8_blockwise_gemm by @BBuf in https://github.com/sgl-project/sglang/pull/5820
- [Feature] support auto chat template by @woodx9 in https://github.com/sgl-project/sglang/pull/4949
- Feat: support cuda graph for LoRA by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/4115
- Add qwen3 30b fused moe config by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/5859
- [Fix] Fix a bug for flashmla to run R1 model by @pengcuo in https://github.com/sgl-project/sglang/pull/5875
- Add A800 fused moe config for qwen3 30b by @lambert0312 in https://github.com/sgl-project/sglang/pull/5880
- [Misc] add service discovery for sgl router by @slin1237 in https://github.com/sgl-project/sglang/pull/5865
- [fix]: PyO3 macOS linking and consolidate on tracing for logging by @slin1237 in https://github.com/sgl-project/sglang/pull/5856
- chore: update Dockerfile by @zhyncs in https://github.com/sgl-project/sglang/pull/5894
- [Docs] Update docs for Qwen3 and Qwen3MoE by @adarshxs in https://github.com/sgl-project/sglang/pull/5836
- Tables instead of bulletpoints for sampling doc by @simveit in https://github.com/sgl-project/sglang/pull/5841
- chore: update CODEOWNERS by @zhyncs in https://github.com/sgl-project/sglang/pull/5895
- [FEATURE] Enhance platform compatibility for ARM by @johnnynunez in https://github.com/sgl-project/sglang/pull/5746
- [CI] Add test_function_calling.py to run_suite.py by @CatherineSue in https://github.com/sgl-project/sglang/pull/5896
- Auto set draft model path for MTP by @ispobock in https://github.com/sgl-project/sglang/pull/5793
- [fix] relax mem_fraction_static for h200 by @Alcanderian in https://github.com/sgl-project/sglang/pull/5893
- feat: support pythonic tool call and index in tool call streaming by @CatherineSue in https://github.com/sgl-project/sglang/pull/5725
- [Bugfix]: fix missing queue_time_start for requests from grammar_queue by @CatherineSue in https://github.com/sgl-project/sglang/pull/5696
- Add AMD MI300x Nightly Testing. by @saienduri in https://github.com/sgl-project/sglang/pull/5861
- chore: use torch 2.6 for sgl-kernel build by @zhyncs in https://github.com/sgl-project/sglang/pull/5898
- Fix check_env script by @lambert0312 in https://github.com/sgl-project/sglang/pull/5901
- [PD] Fix Assertion failed: /DeepEP/csrc/kernels/internode.cu:483, condition: ibgda_get_state()->num_rc_per_pe >= num_channels [#134] by @whybeyoung in https://github.com/sgl-project/sglang/pull/5830
- Bump Flashinfer to 0.2.5 by @Fridge003 in https://github.com/sgl-project/sglang/pull/5870
- [Fix] Unload lora in HF_Runner if needed by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/5899
- Add A800 fused moe config for qwen3 235b by @lambert0312 in https://github.com/sgl-project/sglang/pull/5900
- Add sm_120 for blackwell by @zhjunqin in https://github.com/sgl-project/sglang/pull/5903
- [Feature] add support kimi vl model by @liwenju0 in https://github.com/sgl-project/sglang/pull/5383
- support vlm benchmark profile by @yizhang2077 in https://github.com/sgl-project/sglang/pull/5905
- [fix] kimi-vl test in test_vision_openai_server.py by @Alcanderian in https://github.com/sgl-project/sglang/pull/5910
- [Misc] use parallel build for cmake in sgl-kernel by @yinfan98 in https://github.com/sgl-project/sglang/pull/5919
- [qwen3] support qwen3 ep moe by @laixinn in https://github.com/sgl-project/sglang/pull/5917
- Add TP2 MOE benchmarks for AMD. by @saienduri in https://github.com/sgl-project/sglang/pull/5909
- [Feat] Scale up fa3 kernel to sm8x arch by @yinfan98 in https://github.com/sgl-project/sglang/pull/5912
- chore: bump sgl-kernel 0.1.1 by @zhyncs in https://github.com/sgl-project/sglang/pull/5932
- chore: upgrade sgl-kernel 0.1.1 by @zhyncs in https://github.com/sgl-project/sglang/pull/5933
- Remove unused method
calculate_num_image_tokens
from qwen2_vl.py by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/5783 - [PP] Add pipeline parallelism by @Ying1123 in https://github.com/sgl-project/sglang/pull/5724
- Fix lora batch processing when input lora_path contains None by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/5930
- add Thor & Spark by @johnnynunez in https://github.com/sgl-project/sglang/pull/5915
- fix: correct stream response when enable_thinking is set to false by @minleminzui in https://github.com/sgl-project/sglang/pull/5881
- fix: update model runner by @zhyncs in https://github.com/sgl-project/sglang/pull/5934
- chore: bump v0.4.6.post2 by @zhyncs in https://github.com/sgl-project/sglang/pull/5939
- Support XiaomiMiMo/MiMo model inference by @ryang-max in https://github.com/sgl-project/sglang/pull/5921
- [PD] Vectorise group_concurrent_contiguous in NumPy by @yuan-luo in https://github.com/sgl-project/sglang/pull/5834
- Remove extra contiguous by @ispobock in https://github.com/sgl-project/sglang/pull/5953
- Update ci test and doc for MTP api change by @ispobock in https://github.com/sgl-project/sglang/pull/5952
- docs: Fix Qwen model typo by @JiangJiaWei1103 in https://github.com/sgl-project/sglang/pull/5944
- Optimize a pad operation to accelerate 25us by @hebiao064 in https://github.com/sgl-project/sglang/pull/5945
- Properly return error response in vertex_generate HTTP endpoint by @KCFindstr in https://github.com/sgl-project/sglang/pull/5956
- feat: add concurrency evaluation logic in mmmu benchmark by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/5782
- Add 1 gpu perf and 2 gpu accuracy tests for AMD MI300x CI. by @saienduri in https://github.com/sgl-project/sglang/pull/5960
- feat: Refactor DeepSeekV3 function call by @CatherineSue in https://github.com/sgl-project/sglang/pull/5908
- Remove token in token out in Native API by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/5967
- Support InternVL3 by @xiaomin-D in https://github.com/sgl-project/sglang/pull/5350
- Support MMMU benchmark for InternVL by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/5968
- FA3 speed up: skip len operation and get batch size directly from forward batch by @lifuhuang in https://github.com/sgl-project/sglang/pull/5969
- [PD] NIXL backend Prefill TP & Decode TP+DP by @jokerwyt in https://github.com/sgl-project/sglang/pull/5681
- Fix set kv cache multi-stream by @ispobock in https://github.com/sgl-project/sglang/pull/5975
- Overlap qk norm with two streams by @ispobock in https://github.com/sgl-project/sglang/pull/5977
- fix: only upgrade nccl for cu128 by @zhyncs in https://github.com/sgl-project/sglang/pull/5986
- Fix Phi3 serving which was broke by earlier change by @hebiao064 in https://github.com/sgl-project/sglang/pull/5991
- [perf] H100 DeepSeek-V3 fused moe tuned config by @Alcanderian in https://github.com/sgl-project/sglang/pull/5998
- [Fix] Suppress dynamo logging when using flashinfer backend with torch compile by @Fridge003 in https://github.com/sgl-project/sglang/pull/5992
- [Minor] Fix duplicate method definitions in conversation.py by @lifuhuang in https://github.com/sgl-project/sglang/pull/6012
- Fix flaky issues of lora and add multi batch tests by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/5957
- Add
chat_template_kwargs
documentation by @vincentzed in https://github.com/sgl-project/sglang/pull/5679 - fix: fix broadcast_pyobj breaking VerlEngine by @ocss884 in https://github.com/sgl-project/sglang/pull/5997
- [PD] Allow customizing reserved tokens to avoid KV cache waste by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6002
- Update dev container config to support live code sync and improve docker setup guide by @lifuhuang in https://github.com/sgl-project/sglang/pull/6018
- [PD] Optimize disaggregation ib device help info by @ShangmingCai in https://github.com/sgl-project/sglang/pull/5781
- [Test] Add flashmla attention backend test by @PopSoda2002 in https://github.com/sgl-project/sglang/pull/5587
- Fix "Avoid computing lse in Ragged Prefill when there's no prefix match" by @Edenzzzz in https://github.com/sgl-project/sglang/pull/5555
- feat: Add a unified merge_state API by @DefTruth in https://github.com/sgl-project/sglang/pull/5428
- feat: append more comprehensive fields in messages instead of merely role and content by @minleminzui in https://github.com/sgl-project/sglang/pull/5996
- [Security][Bug] Prevent binding to all TCP interfaces by @adarshxs in https://github.com/sgl-project/sglang/pull/5752
- Fix prefill OOM error in the case of large page size by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/5081
- Fix problem of large page size with chunked prefill by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/6046
- docs: add Google Cloud Vertex AI in Adoption and Sponsorship by @zhyncs in https://github.com/sgl-project/sglang/pull/6047
- docs: add new blog by @zhyncs in https://github.com/sgl-project/sglang/pull/6048
- Fix not "import os" by @hnyls2002 in https://github.com/sgl-project/sglang/pull/6057
- Better PD initialization by @hnyls2002 in https://github.com/sgl-project/sglang/pull/5751
- fix: deepep dockerfile, use pip install deepep. by @HanHan009527 in https://github.com/sgl-project/sglang/pull/5885
- [Fix] Fix and rename flashmla CI test by @Fridge003 in https://github.com/sgl-project/sglang/pull/6045
- chore: upgrade cutlass 3.9.2 by @zhyncs in https://github.com/sgl-project/sglang/pull/6004
- Fix sgl-kernel build on aarch64 platforms by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/6062
- Add DeepEP to CI PR Test by @liz-badada in https://github.com/sgl-project/sglang/pull/5655
- fix custom_allreduce namespace by @BBuf in https://github.com/sgl-project/sglang/pull/6039
- feat: add release workflow for SGLang kernels on aarch64 by @johnnynunez in https://github.com/sgl-project/sglang/pull/6010
- [Feature] Support for Ascend NPU backend by @botieking98 in https://github.com/sgl-project/sglang/pull/3853
- Fix the timeout for 8 gpu tests by @merrymercy in https://github.com/sgl-project/sglang/pull/6084
- Hint users DeepEP normal mode is incompatible with CUDA Graph by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5014
- Super tiny fix doc by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5233
- [Doc]Fix description for dp_size argument by @Fridge003 in https://github.com/sgl-project/sglang/pull/6063
- feat(engine): add bootstrap parameters to generate methods (dynamo) by @ishandhanani in https://github.com/sgl-project/sglang/pull/6075
- [refactor] slightly tidy fp8 module by @Alcanderian in https://github.com/sgl-project/sglang/pull/5993
- Clean up fa3 test from 8 gpus by @hebiao064 in https://github.com/sgl-project/sglang/pull/6105
- Deferring 8 GPU test by @ch-wan in https://github.com/sgl-project/sglang/pull/6102
- Update doc for MLA attention backends by @Fridge003 in https://github.com/sgl-project/sglang/pull/6034
- Clean logs for DeepSeek-V3 launching by @Fridge003 in https://github.com/sgl-project/sglang/pull/6079
- [CI]Add performance CI for VLM by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/6038
- adding Triton configs for DeepSeekV3 FusedMoE kernel on Blackwell by @Fridge003 in https://github.com/sgl-project/sglang/pull/6111
- optimize pad operations in fa3 to accelarate 100+us by @zminglei in https://github.com/sgl-project/sglang/pull/6077
- Overlap shared expert and routed expert computations by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5121
- Tiny refactor ModelConfig.from_server_args by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5219
- Tiny refactor weight loading logic by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5232
- [PD] Add control to slow down a server by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5572
- Change AMD test threshold by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6091
- DeepEP normal support deepgemm-contiguous by @sleepcoo in https://github.com/sgl-project/sglang/pull/5626
- [fix] fix pyproject.toml dependencies by @Alcanderian in https://github.com/sgl-project/sglang/pull/6119
- [Feature] Add FlashAttention3 as a backend for VisionAttention by @Othame in https://github.com/sgl-project/sglang/pull/5764
- [perf] dsv3 bmm fallback to bf16 by @Alcanderian in https://github.com/sgl-project/sglang/pull/5662
- [AMD] switch to custom allreduce regardless of MSCCL setting on ROCm by @hubertlu-tw in https://github.com/sgl-project/sglang/pull/6097
- [sgl-kernel] fix: fix cu118 compile error by @yinfan98 in https://github.com/sgl-project/sglang/pull/6123
- upgrade xgrammar to 0.1.19 by @Ubospica in https://github.com/sgl-project/sglang/pull/6129
- Remove unecessary is_fa3_supported check by @hebiao064 in https://github.com/sgl-project/sglang/pull/6112
- chore: bump sgl-kernel 0.1.2 by @zhyncs in https://github.com/sgl-project/sglang/pull/6131
- docs: update README by @zhyncs in https://github.com/sgl-project/sglang/pull/6132
- [Fix] Incorrect Memory Allocation on CUDA:0 by Non-Zero CUDA Processes in TP/DP by @yhyang201 in https://github.com/sgl-project/sglang/pull/5745
- Cutlass MLA: Disable split kv due to https://github.com/NVIDIA/cutlass/issues/2274 by @trevor-m in https://github.com/sgl-project/sglang/pull/6101
- opt flashinfer mla cat by @xu-yfei in https://github.com/sgl-project/sglang/pull/5822
- Update amd nightly concurrency. by @saienduri in https://github.com/sgl-project/sglang/pull/6141
- sampling_params: add thinking_budget by @thyecust in https://github.com/sgl-project/sglang/pull/6089
- [Bugfix] Fix Llama4 gibberish output with long context and CUDA graph by @CatherineSue in https://github.com/sgl-project/sglang/pull/6162
- fix bug that gpu0 occupies more memory when hicache is turned on by @huangtingwei9988 in https://github.com/sgl-project/sglang/pull/5778
- chore: bump v0.4.6.post3 by @zhyncs in https://github.com/sgl-project/sglang/pull/6165
- KV‑Cache (MHA, MLA): add missing start_layer / end_layer fields to MHATokenToKVPoolHost and MLATokenToKVPoolHost by @Simon-Li in https://github.com/sgl-project/sglang/pull/6016
- [fix] fix determine_n_share_experts_fusion by @Alcanderian in https://github.com/sgl-project/sglang/pull/6118
- Fix and Clean up chat-template requirement for VLM by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/6114
- [Docs]Delete duplicate content by @Ximingwang-09 in https://github.com/sgl-project/sglang/pull/6146
- Revert "feat: add thinking_budget (#6089)" by @zhyncs in https://github.com/sgl-project/sglang/pull/6181
- Added async_encode method to Engine by @shimizust in https://github.com/sgl-project/sglang/pull/4701
- Fix data parallel perf regression by @merrymercy in https://github.com/sgl-project/sglang/pull/6183
- Fix request abortion by @merrymercy in https://github.com/sgl-project/sglang/pull/6184
- Add typo checker in pre-commit by @applesaucethebun in https://github.com/sgl-project/sglang/pull/6179
- Remove duplicate IO Struct test by @emmanuel-ferdman in https://github.com/sgl-project/sglang/pull/6180
- [PD] Add simple unit test for disaggregation feature by @ShangmingCai in https://github.com/sgl-project/sglang/pull/5654
- [CI] Disabled deepep tests temporarily because it takes too much time. by @merrymercy in https://github.com/sgl-project/sglang/pull/6186
- feat: support loogle eval by @zhyncs in https://github.com/sgl-project/sglang/pull/6190
- [fix] remove mixtral from is_fa3_default_architecture by @Alcanderian in https://github.com/sgl-project/sglang/pull/6191
- fix: handle None multimodal_inputs during merging and filtering batches in disaggregation decode mode by @GaoYusong in https://github.com/sgl-project/sglang/pull/6169
- chore: upgrade deepgemm by @zhyncs in https://github.com/sgl-project/sglang/pull/6073
- chore: bump sgl-kernel v0.1.2.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/6195
- chore: upgrade sgl-kernel v0.1.2.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/6196
- Handle empty input string for embedding models by @ravi03071991 in https://github.com/sgl-project/sglang/pull/5621
- doc: fix the erroneous documents and example codes about Alibaba-NLP/gme-Qwen2-VL-2B-Instruct by @minleminzui in https://github.com/sgl-project/sglang/pull/6199
- [Docs] minor Qwen3 and reasoning parser docs fix by @adarshxs in https://github.com/sgl-project/sglang/pull/6032
- Improve structured outputs: fix race condition, server crash, metrics and style by @merrymercy in https://github.com/sgl-project/sglang/pull/6188
- [CI] Reorganize the 8 gpu tests by @merrymercy in https://github.com/sgl-project/sglang/pull/6192
- Add dev-deepep docker image by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6198
- Replace time.time() to time.perf_counter() for benchmarking. by @lifuhuang in https://github.com/sgl-project/sglang/pull/6178
- Update README.md by @merrymercy in https://github.com/sgl-project/sglang/pull/6202
- Fix release-docs.yml to not use python 3.9 by @merrymercy in https://github.com/sgl-project/sglang/pull/6204
- Fix start_profile does not support with_stack and record_shapes by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6043
- [doc] add a note for --n-share-experts-fusion args by @BBuf in https://github.com/sgl-project/sglang/pull/6154
- Performing Vocabulary Parallelism for LM Head across Attention TP Groups by @ch-wan in https://github.com/sgl-project/sglang/pull/5558
- Update AMD CI docker to v0.4.6.post3-rocm630. by @saienduri in https://github.com/sgl-project/sglang/pull/6213
- Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs by @merrymercy in https://github.com/sgl-project/sglang/pull/6201
- [CI] Fix PD mooncake dependency error by @ShangmingCai in https://github.com/sgl-project/sglang/pull/6212
- [CI] Re-enable pd disaggregation test by @ShangmingCai in https://github.com/sgl-project/sglang/pull/6231
- fix some typos by @applesaucethebun in https://github.com/sgl-project/sglang/pull/6209
- [Docs] Add docs for
SGLANG_
andSGL_
environment variables by @b8zhong in https://github.com/sgl-project/sglang/pull/6206 - [PP] Fix init_memory_pool desync & add PP for mixtral by @Ying1123 in https://github.com/sgl-project/sglang/pull/6223
- Revert "fix some typos" by @merrymercy in https://github.com/sgl-project/sglang/pull/6244
- chore: add hf_xet dep by @zhyncs in https://github.com/sgl-project/sglang/pull/6243
- Update AMD nightly deps. by @saienduri in https://github.com/sgl-project/sglang/pull/6241
- [PD] Add support for different TP sizes per DP rank by @ShangmingCai in https://github.com/sgl-project/sglang/pull/5922
- Support incremental streaming of logprob/token_ids between scheduler and detokenizer by @merrymercy in https://github.com/sgl-project/sglang/pull/6225
- fix typo by @zhyncs in https://github.com/sgl-project/sglang/pull/6248
- Support tuning moe for llama 4 model by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6042
- Skip the flaky test_stateful_custom_logit_processor by @merrymercy in https://github.com/sgl-project/sglang/pull/6251
- [Llama4] Add docs note about enable multimodal by @b8zhong in https://github.com/sgl-project/sglang/pull/6235
- [VERL Use Case] Add torch_memory_saver into deps by @hebiao064 in https://github.com/sgl-project/sglang/pull/6247
- Fix two issues related to
--moe-dense-tp-size=1
by @ch-wan in https://github.com/sgl-project/sglang/pull/5657 - model(vlm): pixtral by @KivenChen in https://github.com/sgl-project/sglang/pull/5084
- [misc] deep_gemm fallback to NVRTC when NVCC not found by @Alcanderian in https://github.com/sgl-project/sglang/pull/6252
- Enable MI325X AMD CI. by @saienduri in https://github.com/sgl-project/sglang/pull/6259
- chore: bump v0.4.6.post4 by @zhyncs in https://github.com/sgl-project/sglang/pull/6245
- [CPU] Add CMakeLists.txt for sgl-kernel by @blzheng in https://github.com/sgl-project/sglang/pull/6115
- perf: optimize local_block_table memory allocation by @CatherineSue in https://github.com/sgl-project/sglang/pull/6273
- Fix a bug in schedule_policy by @Ying1123 in https://github.com/sgl-project/sglang/pull/6276
- [Bug] Fix accidental logger override caused by internVL. by @lifuhuang in https://github.com/sgl-project/sglang/pull/6282
- doc: update developer guide regarding mllms by @mickqian in https://github.com/sgl-project/sglang/pull/6138
- docs: fix a bad redirect by @b8zhong in https://github.com/sgl-project/sglang/pull/6300
- Enable unit tests for AMD CI. by @saienduri in https://github.com/sgl-project/sglang/pull/6283
- [AMD] Fix Llama 4 Scout and Maverick accuracy issues on MI300X by @hubertlu-tw in https://github.com/sgl-project/sglang/pull/6274
- feat: add flush cache to EngineBase and HttpServerEngineAdapter by @ocss884 in https://github.com/sgl-project/sglang/pull/6009
- [fix][RL] Remove the incorrect barrier in init_weights_update_group by @zhuzilin in https://github.com/sgl-project/sglang/pull/5914
- [Feat] Support FlashMLA backend with MTP and FP8 KV cache by @quinnrong94 in https://github.com/sgl-project/sglang/pull/6109
- [misc] remove redundant platform codes by @Alcanderian in https://github.com/sgl-project/sglang/pull/6298
- Add fp8 gemm kernel for CPU in sgl-kernel and add gemm UT by @chunyuan-w in https://github.com/sgl-project/sglang/pull/6216
- Fix gpu mem check on CPU by @yiliu30 in https://github.com/sgl-project/sglang/pull/6317
- Reduce MoE memory usage by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6147
- Fix lora bench by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/6302
- Minor improvements of TokenizerManager / health check by @merrymercy in https://github.com/sgl-project/sglang/pull/6327
- Upgrade CUTLASS 4.0 by @elfiegg in https://github.com/sgl-project/sglang/pull/6336
- Support precomputed multimodal features for Qwen-VL and Gemma3 models. by @ysulsky in https://github.com/sgl-project/sglang/pull/6136
- [Fix] Improve dependencies for Blackwell image by @Fridge003 in https://github.com/sgl-project/sglang/pull/6334
- [2/2] Add python wrapper for CUTLASS FP8 Blockscale MoE Kernel. by @elfiegg in https://github.com/sgl-project/sglang/pull/5694
- feat: add dp attention support for Qwen 2/3 MoE models, fixes [#6088] by @Fr4nk1inCs in https://github.com/sgl-project/sglang/pull/6121
- Update CODEOWNERS by @merrymercy in https://github.com/sgl-project/sglang/pull/6359
- [Minor] cleanup unused imports by @merrymercy in https://github.com/sgl-project/sglang/pull/6358
- Fix amd ci by @merrymercy in https://github.com/sgl-project/sglang/pull/6360
- docs: update readme by @zhyncs in https://github.com/sgl-project/sglang/pull/6361
- model(vlm): mistral 3.1 by @KivenChen in https://github.com/sgl-project/sglang/pull/5099
- Fix one wasted kernel in DeepSeek and minor refactor by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6316
- Minor code cleanup refactor for DeepSeek models by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6324
- chore: bump sgl-kernel v0.1.3 by @zhyncs in https://github.com/sgl-project/sglang/pull/6368
- perf: Optimize local attention memory allocation in FlashAttentionBackend by @CatherineSue in https://github.com/sgl-project/sglang/pull/6356
- docs: Update the MD files by @vincentzed in https://github.com/sgl-project/sglang/pull/6373
- [router] Add /list_workers endpoint to router by @zhuzilin in https://github.com/sgl-project/sglang/pull/6366
- Speed up when having padding tokens in DeepEP by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6175
- Use monotonic clock for interval measurement by @lifuhuang in https://github.com/sgl-project/sglang/pull/6211
- [fix] illegal memory in _fwd_kernel_ep_scatter_2 and _fwd_kernel_ep_gather by @xutizhou in https://github.com/sgl-project/sglang/pull/6348
- Fix stop_profile does not wait for finishing by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4741
- Support outputing details for bench_serving by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6107
- Tiny refactor bench_serving to improve extensibility by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6134
- Tiny refactor bench_serving to extract RequestFuncOutput.init_new by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6108
- Support custom DeepEP tuning config by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6257
- Fix expert distribution recorder and profiler command stuck forever by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6284
- Reland tiny refactor DefaultModelLoader.Source by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6041
- Add expert distribution APIs for engine by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6290
- fix: allow
launch_dummy_health_check_server
to start inside of running asyncio loop by @ishandhanani in https://github.com/sgl-project/sglang/pull/6330 - [Fix Chat API] add request id for chat/completion for tracing by @whybeyoung in https://github.com/sgl-project/sglang/pull/6364
- Fix CI tests by @merrymercy in https://github.com/sgl-project/sglang/pull/6362
- chore: upgrade sgl-kernel v0.1.3 by @zhyncs in https://github.com/sgl-project/sglang/pull/6377
- Do not use FA3 for mistral by @merrymercy in https://github.com/sgl-project/sglang/pull/6379
- refactor: minor refactors regarding multimodal processing by @mickqian in https://github.com/sgl-project/sglang/pull/6187
- Add pipeline parallelism for Qwen2 and Qwen3 Model by @libratiger in https://github.com/sgl-project/sglang/pull/6250
- Clean up AMD CI. by @saienduri in https://github.com/sgl-project/sglang/pull/6365
- feat: add long context example by @zhyncs in https://github.com/sgl-project/sglang/pull/6391
- The Gemma template is missing a newline after the user role. by @ysulsky in https://github.com/sgl-project/sglang/pull/6331
- chore: tiny remove duplicated code by @doujiang24 in https://github.com/sgl-project/sglang/pull/6392
- Add 4-GPU runner tests and split existing tests by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6383
- Add fp8 shared_expert kernel for CPU in sgl-kernel and add UT by @chunyuan-w in https://github.com/sgl-project/sglang/pull/6339
- [fix] fix fa3 forward_decode with spec_decode by @Alcanderian in https://github.com/sgl-project/sglang/pull/6395
- Add missing model to doc by @applesaucethebun in https://github.com/sgl-project/sglang/pull/6396
- [OAI] Add rid tracing for v1/embeddings and fix rid type in Chat by @CatherineSue in https://github.com/sgl-project/sglang/pull/6397
- [Misc] Implement RankZeroFilter for rank-specific logging in model_runner.py by @CatherineSue in https://github.com/sgl-project/sglang/pull/6333
- refactor: Extract repeated member variables in KVCache subclasses to base class. by @wangxiyu191 in https://github.com/sgl-project/sglang/pull/6323
- Refactor DeepSeek MoE layer to unify the two forward branches by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6325
- vlm: tensor hash kernel by @mickqian in https://github.com/sgl-project/sglang/pull/5974
- [Bugfix] Fix field error in v1_embedding_request by @CatherineSue in https://github.com/sgl-project/sglang/pull/6400
- Fix request id error by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6401
- Implement
return_hidden_states
for the OpenAI API by @kyle-pena-kuzco in https://github.com/sgl-project/sglang/pull/6137 - Fix nodeepgemm init by @sleepcoo in https://github.com/sgl-project/sglang/pull/6417
- Improve supported models doc by @simveit in https://github.com/sgl-project/sglang/pull/6430
- Fix throughput threshold for amd ci test by @Fridge003 in https://github.com/sgl-project/sglang/pull/6414
- [Metrics] Add KV events publishing by @trevor-m in https://github.com/sgl-project/sglang/pull/6098
- [BUG] fix stop_profile crash by @yizhang2077 in https://github.com/sgl-project/sglang/pull/6431
- Revert "Implement
return_hidden_states
for the OpenAI API (#6137)" by @zhyncs in https://github.com/sgl-project/sglang/pull/6440 - Expert distribution recording without overhead for EPLB by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4957
- Refactor communication logic of DeepSeek for extensibility and understandability by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6321
- Remove
Cargo.lock
, add it into .gitignore by @hnyls2002 in https://github.com/sgl-project/sglang/pull/6438 - Refactor DeepSeek logic into atomic operations by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6326
- Support loading weights when physical experts are different from logical experts by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6386
- Support DeepSeek EPLB algorithm with static distributions by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6387
- Address performance regression: disable multiple streams on ROCm by @HaiShaw in https://github.com/sgl-project/sglang/pull/6412
- [QuickFix] fix gptq model initialize by @yinfan98 in https://github.com/sgl-project/sglang/pull/6429
- Update extend/decode attention kernel for CPU in sgl-kernel and add UTs by @yanbing-j in https://github.com/sgl-project/sglang/pull/6405
- [doc] add note for get_num_kv_splits in triton_backend by @Alcanderian in https://github.com/sgl-project/sglang/pull/6444
- Support dispatching logical to physical experts by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6385
- Fix master CI for DeepSeek by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6447
- [docs] Fix torch version by @Edenzzzz in https://github.com/sgl-project/sglang/pull/6472
- Disable all two stream overlap on amd by @merrymercy in https://github.com/sgl-project/sglang/pull/6475
- Refactor group_concurrent_contiguous in NIXL by @yuan-luo in https://github.com/sgl-project/sglang/pull/6214
- aiter attention-backend (default enabled on AMD/ROCm) by @HaiShaw in https://github.com/sgl-project/sglang/pull/6381
- Implement Siglip Vision model, and support BNB quantization for gemma3-mm by @guapisolo in https://github.com/sgl-project/sglang/pull/5339
- [router] support http2 in router by @zhuzilin in https://github.com/sgl-project/sglang/pull/6487
- [RL] allow weight updation with dp attention enabled by @zhuzilin in https://github.com/sgl-project/sglang/pull/6311
- Refactor DeepSeek attention dispatching by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6476
- Fix num_qps_per_rank computation when providing custom DeepEP configuration by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6468
- Tiny add stage assertions to DeepEPDispatcher to avoid misuse by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6467
- Support redundant experts in expert parallel by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6461
- Tiny make Lint CI show diff by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6445
- Let bench_one_batch_server use sharegpt data to make expert distribution more natural by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5573
- Fix bench_one_batch_server by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6503
- [Fix]Fix capture fail bug for DeepSeek by @Fridge003 in https://github.com/sgl-project/sglang/pull/6275
- [CPU] Fix build issue by @blzheng in https://github.com/sgl-project/sglang/pull/6419
- fix: EXAONE when using tie_word_embeddings by @lkm2835 in https://github.com/sgl-project/sglang/pull/5759
- doc: Update README.md with adding deepwiki badge to enable weekly auto-refresh by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/6508
- Recover from corrupted cache file in bench serving by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6510
- Apply constraint grammar to EAGLE by @ispobock in https://github.com/sgl-project/sglang/pull/6499
- [1/2] Support Qserve by @HandH1998 in https://github.com/sgl-project/sglang/pull/6457
- [PD] Add doc and simplify sender.send by @ByronHsu in https://github.com/sgl-project/sglang/pull/6019
- [PD] Abort request if transfer fails by @ByronHsu in https://github.com/sgl-project/sglang/pull/6504
- Add main for merge state tests by @yuan-luo in https://github.com/sgl-project/sglang/pull/6492
- Support updating expert locations dynamically by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6388
- [RL] Remove the w13 weight_scale and input_scale for UnquantizedEPMoE… by @zhuzilin in https://github.com/sgl-project/sglang/pull/6308
- Support logging expert balancedness metrics by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6482
- Support dynamically rebalancing experts using EPLB by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6469
- Fix missing http status import for PD failure handler by @ShangmingCai in https://github.com/sgl-project/sglang/pull/6520
- chore: bump sgl-kernel v0.1.4 by @zhyncs in https://github.com/sgl-project/sglang/pull/6522
- Support qwen3 deepep by @sleepcoo in https://github.com/sgl-project/sglang/pull/6120
- chore: upgrade sgl-kernel v0.1.4 by @zhyncs in https://github.com/sgl-project/sglang/pull/6532
- Support XiaomiMiMo inference with mtp by @ryang-max in https://github.com/sgl-project/sglang/pull/6059
- misc: fix accept_length by @zhyncs in https://github.com/sgl-project/sglang/pull/6536
- [PD] Fix failure abort by @ByronHsu in https://github.com/sgl-project/sglang/pull/6535
- [VLM] Support chunk prefill for VLM by @CatherineSue in https://github.com/sgl-project/sglang/pull/6355
- Add fp8 qkv_proj_with_rope kernel for CPU in sgl-kernel and add UT by @blzheng in https://github.com/sgl-project/sglang/pull/6493
- Add fp8 fused_experts kernel for CPU in sgl-kernel and add UT by @chunyuan-w in https://github.com/sgl-project/sglang/pull/6404
- Update sgl-kernel UTs for activation/topk/norm/rope kernels by @yanbing-j in https://github.com/sgl-project/sglang/pull/6452
- Fix topk inference performance reduce by @lambert0312 in https://github.com/sgl-project/sglang/pull/6474
- [PD] support spec decode by @ByronHsu in https://github.com/sgl-project/sglang/pull/6507
- [2/2] Support Qserve by @HandH1998 in https://github.com/sgl-project/sglang/pull/6521
- [PD] Support logprob & Add failure test by @ByronHsu in https://github.com/sgl-project/sglang/pull/6558
- fix: remove content=none test when tool called by @shuaills in https://github.com/sgl-project/sglang/pull/6347
- Update cmdline --enable-dp-attention help string for Qwen 2/3 Moe models. by @MiterV1 in https://github.com/sgl-project/sglang/pull/6524
- Bugfix: handle flatten_batch constraint for multiple images by @CatherineSue in https://github.com/sgl-project/sglang/pull/6562
- support eplb for qwen3 by @yizhang2077 in https://github.com/sgl-project/sglang/pull/6533
- feat(Tool Calling): Support
required
and specific function mode by @CatherineSue in https://github.com/sgl-project/sglang/pull/6550 - [PD] Support structured output by @ByronHsu in https://github.com/sgl-project/sglang/pull/6560
- [FIX]remove ServerArgs duplicate code by @pc-neo in https://github.com/sgl-project/sglang/pull/6485
- Fix accuracy is zero when enabling moe-dense-tp-size as in large scale EP by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6567
- chore: bump v0.4.6.post5 by @zhyncs in https://github.com/sgl-project/sglang/pull/6566
- Temporarily disable MI325x 8 gpu testing. by @saienduri in https://github.com/sgl-project/sglang/pull/6576
- Fix GPU OOM by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/6564
- Refactor attention into multiple stages by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6477
- Add back DeepSeek non-TBO branches by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6578
- Utilize static dispatching for communicator by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6577
- Support overlapping two batches by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4068
- Refactor vlm embedding routine to use precomputed feature by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/6543
- [OAI] Support non-normalized logprobs in OpenAI server by @CatherineSue in https://github.com/sgl-project/sglang/pull/5961
- Support Phi-4 Multi-Modal (text + vision only) by @lifuhuang in https://github.com/sgl-project/sglang/pull/6494
- Sgl-router Prometheus metrics endpoint and usage track metrics by @upfixer in https://github.com/sgl-project/sglang/pull/6537
- added support for tied weights in qwen pipeline parallelism by @FrankLeeeee in https://github.com/sgl-project/sglang/pull/6546
- Hint users when weight update timeouts by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6570
- Fix some issues with current docs. by @simveit in https://github.com/sgl-project/sglang/pull/6588
- [PD] Fix prefill_servers in mini_lb by @wangxiyu191 in https://github.com/sgl-project/sglang/pull/6527
- Fix bench_serving does not support changing warmup requests by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6439
- Support fake perfectly balanced EP dispatch algorithm by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6571
- Fix profiling will crash the server when using num_steps by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6586
- Improve performance of two batch overlap in some imbalanced cases by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6593
- Logging and minor fixes to two batch overlap and EPLB by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6595
- Tiny change killall_sglang.sh by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6596
- Auto handle PD disaggregation in bench_serving by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6587
- Support accurate length control for bench serving by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6594
- Tiny fix lint CI does not trigger on master by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6609
- chore: upgrade transformers 4.52.3 by @zhyncs in https://github.com/sgl-project/sglang/pull/6575
- Revert "Tiny fix lint CI does not trigger on master (#6609)" by @zhyncs in https://github.com/sgl-project/sglang/pull/6610
- refactor qwen moe code, use communicator to support tp+dp by @yizhang2077 in https://github.com/sgl-project/sglang/pull/6581
- feat: Improve Mistral and Qwen25 function call parsing by @CatherineSue in https://github.com/sgl-project/sglang/pull/6597
- qwen3moe support two batch overlap by @yizhang2077 in https://github.com/sgl-project/sglang/pull/6598
- Tiny fix CI by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6611
- Supported precomputed feature for Kimi VL by @lifuhuang in https://github.com/sgl-project/sglang/pull/6599
- [FA][Test] Fix Sparse FA test by @b8zhong in https://github.com/sgl-project/sglang/pull/6306
- fix qwen3moe eplb prefill bug by @yizhang2077 in https://github.com/sgl-project/sglang/pull/6617
- Automatically configure for EPLB-related args by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6628
- Fix EPLB algorithm fail to run when using 3 nodes for prefill by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6629
- Tiny fix missing expert location dispatch info by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6620
- Update nightly thresholds and dependencies. by @saienduri in https://github.com/sgl-project/sglang/pull/6635
- Tiny fix sampler error when prob is not contiguous by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6639
- [PD] Handle P/D failure and reconnect without affecting other instances by @ShangmingCai in https://github.com/sgl-project/sglang/pull/6263
- follow-up: move Idefics2 to a shared location to eliminate unexpected dependency. by @lifuhuang in https://github.com/sgl-project/sglang/pull/6603
- fix: added "\n" to qwen25 tool parser structural tags by @shuaills in https://github.com/sgl-project/sglang/pull/6631
- [New Model] Devstral support by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/6547
- chore: upgrade mooncake-transfer-engine by @zhyncs in https://github.com/sgl-project/sglang/pull/6643
- Tiny refactor communicator by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6646
- Support TP in attention for two batch overlap by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6634
- Super tiny rename environment variable by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6648
- Refactor LoRA handling to support adapter tensors in fused format by @lifuhuang in https://github.com/sgl-project/sglang/pull/6585
- [Bugfix]: Fix call for function_call_parser.multi_format_detector in adapter.py by @CatherineSue in https://github.com/sgl-project/sglang/pull/6650
- update toc for doc and dockerfile code style format by @habaohaba in https://github.com/sgl-project/sglang/pull/6450
- Add note to add supported model to documentation by @b8zhong in https://github.com/sgl-project/sglang/pull/6640
- docs: Update documentation to reflect xgrammar as default grammar backend by @vincentzed in https://github.com/sgl-project/sglang/pull/6601
- Add environment flag for disabling message queue broadcaster by @Fridge003 in https://github.com/sgl-project/sglang/pull/6403
- fix: fix nightly test from updating transformers by @mickqian in https://github.com/sgl-project/sglang/pull/6658
- Fix qwen3 tbo/dp-lm-head by @yizhang2077 in https://github.com/sgl-project/sglang/pull/6652
- fix communicator for non-dp lm head by @ch-wan in https://github.com/sgl-project/sglang/pull/6662
- Support EAGLE draft extend CUDA graph by @ispobock in https://github.com/sgl-project/sglang/pull/6606
- DeepSeek: enable none block-quant FP8 quantizations by @HaiShaw in https://github.com/sgl-project/sglang/pull/6638
- Fix OOM when updating expert locations by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6660
- Speed up expert location update by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6661
- Revert "fix communicator for non-dp lm head (#6662)" by @zhyncs in https://github.com/sgl-project/sglang/pull/6677
- [PD] Make bootstrap code common between NIXL and Mooncake by @trevor-m in https://github.com/sgl-project/sglang/pull/6473
- [CI] update verlengine ci to 4-gpu test by @ocss884 in https://github.com/sgl-project/sglang/pull/6007
- Fix DeepEP error in Qwen 3 MoE models by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6673
- Support gathering expert distribution details by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6665
- Disable compiling arch below sm_90 in aarch64 by default by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/6380
- fix(tool call): Fix tool_index in PythonicDetector and issues with mixed output in non-streaming by @CatherineSue in https://github.com/sgl-project/sglang/pull/6678
- Add batch test for draft extend by @ispobock in https://github.com/sgl-project/sglang/pull/6672
- feat: Add warnings for invalid tool_choice and UTs by @CatherineSue in https://github.com/sgl-project/sglang/pull/6582
- Update amd docker and nightly models. by @saienduri in https://github.com/sgl-project/sglang/pull/6687
- Refine pre_reorder_triton_kernel slightly to improve performance by @yuan-luo in https://github.com/sgl-project/sglang/pull/6627
- fix log_info_on_rank0 error when run benchmark by @BBuf in https://github.com/sgl-project/sglang/pull/6260
- fix(deepseekv3): Fix DeepSeekV3Detector tool_index assignment and multi-tool call streaming support by @CatherineSue in https://github.com/sgl-project/sglang/pull/6655
- [Bugfix] Fix missing abort finish reason for PD with ChatCompletion by @ShangmingCai in https://github.com/sgl-project/sglang/pull/6693
- [CI] Fix flaky pp single node test by @ShangmingCai in https://github.com/sgl-project/sglang/pull/6689
- [PD] Abort unbootstrapped prefill requests through timeout by @ShangmingCai in https://github.com/sgl-project/sglang/pull/6685
- [PD Perf] replace Queue to FastQueue by @whybeyoung in https://github.com/sgl-project/sglang/pull/6649
- [Bugfix] Fix slice operation when chunk size mismatch by @ShangmingCai in https://github.com/sgl-project/sglang/pull/6697
- [Bugfix] Fix ChatCompletion endpoint of mini_lb when stream is set by @ShangmingCai in https://github.com/sgl-project/sglang/pull/6703
- [CI] Fix setup of disaggregation with different tp by @ShangmingCai in https://github.com/sgl-project/sglang/pull/6706
- [PD] Remove Unnecessary Exception Handling for FastQueue.get() by @Hongbosherlock in https://github.com/sgl-project/sglang/pull/6712
- Fuse routed_scaling_factor in DeepSeek by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6710
- Overlap two kernels in DeepSeek with communication by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6711
- Minor refactor two-batch overlap by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6682
- Speed up when having padding tokens two-batch overlap by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6668
- [Feature] Support Flashinfer fp8 blockwise GEMM kernel on Blackwell by @Fridge003 in https://github.com/sgl-project/sglang/pull/6479
- Fix LoRA bench by @Edenzzzz in https://github.com/sgl-project/sglang/pull/6719
- Fix PP for Qwen3 MoE by @jinyouzhi in https://github.com/sgl-project/sglang/pull/6709
- [feat] triton kernel for get_last_loc by @Alcanderian in https://github.com/sgl-project/sglang/pull/6676
- [fix] more mem for draft_extend cuda_graph by @Alcanderian in https://github.com/sgl-project/sglang/pull/6726
- [PD] bug fix: Update status if nixl receiver send a a dummy req. by @thesues in https://github.com/sgl-project/sglang/pull/6720
- Tune memory arguments on B200 by @Fridge003 in https://github.com/sgl-project/sglang/pull/6718
- Add DeepSeek-R1-0528 function call chat template by @Xu-Wenqing in https://github.com/sgl-project/sglang/pull/6725
- refactor(tool call): Fix BaseFormatDetector tool_index issue and refactor
parse_streaming_increment
by @CatherineSue in https://github.com/sgl-project/sglang/pull/6715 - Add draft extend CUDA graph for Triton backend by @ispobock in https://github.com/sgl-project/sglang/pull/6705
- refactor apply_w8a8_block_fp8_linear in fp by @ChangyiYang in https://github.com/sgl-project/sglang/pull/6545
- [PD] Support completion endpoint by @ShangmingCai in https://github.com/sgl-project/sglang/pull/6729
- Init PD Rust LB (PO2) by @hnyls2002 in https://github.com/sgl-project/sglang/pull/6437
- Super tiny enable sole usage of expert distribution metrics and update doc by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6680
- Support picking variants of EPLB algorithms by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6728
- Support tuning DeepEP configs by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6742
- [test] add ut and bm for get_last_loc by @Alcanderian in https://github.com/sgl-project/sglang/pull/6746
- Fix mem_fraction_static for AMD CI by @Fridge003 in https://github.com/sgl-project/sglang/pull/6748
- [fix][RL] Fix DeepSeekV3ForCausalLM.post_load_weights for multiple update weight by @zhuzilin in https://github.com/sgl-project/sglang/pull/6265
- Improve EPLB logical to physical dispatch map by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6727
- Update DeepSeek-R1-0528 function call chat template by @Xu-Wenqing in https://github.com/sgl-project/sglang/pull/6765
- [PD] Optimize time out logic and add env var doc for mooncake by @ShangmingCai in https://github.com/sgl-project/sglang/pull/6761
- Fix aiohttp 'Chunk too big' in bench_serving by @guoyuhong in https://github.com/sgl-project/sglang/pull/6737
- Support sliding window in triton backend by @NorthmanPKU in https://github.com/sgl-project/sglang/pull/6509
- Fix shared experts fusion error by @lambert0312 in https://github.com/sgl-project/sglang/pull/6289
- Fix one bug in the grouped-gemm triton kernel by @ch-wan in https://github.com/sgl-project/sglang/pull/6772
- update llama4 chat template and pythonic parser by @upfixer in https://github.com/sgl-project/sglang/pull/6679
- feat(tool call): Enhance Llama32Detector for improved JSON parsing in non-stream by @CatherineSue in https://github.com/sgl-project/sglang/pull/6784
- Support token-level quantization for EP MoE by @ch-wan in https://github.com/sgl-project/sglang/pull/6782
- Temporarily lower mmlu threshold for triton sliding window backend by @NorthmanPKU in https://github.com/sgl-project/sglang/pull/6785
- ci: relax test_function_call_required by @CatherineSue in https://github.com/sgl-project/sglang/pull/6786
- Add intel_amx backend for Radix Attention for CPU by @yanbing-j in https://github.com/sgl-project/sglang/pull/6408
- Fix incorrect LoRA weight loading for fused gate_up_proj by @lifuhuang in https://github.com/sgl-project/sglang/pull/6734
- fix(PD-disaggregation): Can not get local ip by @storyicon in https://github.com/sgl-project/sglang/pull/6792
- [FIX] mmmu bench serving result display error (#6525) by @Arist12 in https://github.com/sgl-project/sglang/pull/6791
- Bump torch to 2.7.0 by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/6788
- chore: bump sgl-kernel v0.1.5 by @zhyncs in https://github.com/sgl-project/sglang/pull/6794
- Improve profiler and integrate profiler in bench_one_batch_server by @merrymercy in https://github.com/sgl-project/sglang/pull/6787
- chore: upgrade sgl-kernel v0.1.5 by @zhyncs in https://github.com/sgl-project/sglang/pull/6795
- [Minor] Always append newline after image token when parsing chat message by @lifuhuang in https://github.com/sgl-project/sglang/pull/6797
- Update CI tests for Llama4 models by @ravi03071991 in https://github.com/sgl-project/sglang/pull/6421
- [Feat] Enable PDL automatically on Hopper architecture by @PopSoda2002 in https://github.com/sgl-project/sglang/pull/5981
- chore: update blackwell docker by @zhyncs in https://github.com/sgl-project/sglang/pull/6800
- misc: cache is_hopper_arch by @Edenzzzz in https://github.com/sgl-project/sglang/pull/6799
- Remove contiguous before Flashinfer groupwise fp8 gemm by @Fridge003 in https://github.com/sgl-project/sglang/pull/6804
- Correctly abort the failed grammar requests & Improve the handling of abort by @merrymercy in https://github.com/sgl-project/sglang/pull/6803
- [EP] Add cuda kernel for moe_ep_pre_reorder by @yuan-luo in https://github.com/sgl-project/sglang/pull/6699
- Add draft extend CUDA graph for flashinfer backend by @ispobock in https://github.com/sgl-project/sglang/pull/6805
- Refactor CustomOp to avoid confusing bugs by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5382
- Tiny log prefill time by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6780
- Tiny fix EPLB assertion about rebalancing period and recorder window size by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6813
- Add simple utility to dump tensors for debugging by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6815
- Fix profiles do not have consistent names by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6811
- Speed up rebalancing when using non-static dispatch algorithms by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6812
- [1/2] Add Kernel support for Cutlass based Fused FP4 MoE by @pavanimajety in https://github.com/sgl-project/sglang/pull/6093
- [Router] Fix k8s Service Discovery by @YouNeedCryDear in https://github.com/sgl-project/sglang/pull/6766
- Add CPU optimized kernels for topk and rope fusions by @jianan-gu in https://github.com/sgl-project/sglang/pull/6456
- fix new_page_count_next_decode by @pansicheng in https://github.com/sgl-project/sglang/pull/6671
- Fix wrong weight reference in dynamic EPLB by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6818
- Minor add metrics to expert location updater by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6816
- [Refactor] Rename
n_share_experts_fusion
asnum_fused_shared_experts
by @ch-wan in https://github.com/sgl-project/sglang/pull/6735 - [FEAT] Add transformers backend support by @SunMarc in https://github.com/sgl-project/sglang/pull/5929
- [fix] recover auto-dispatch for rmsnorm and rope by @Alcanderian in https://github.com/sgl-project/sglang/pull/6745
- fix ep_moe_reorder kernel bugs by @BBuf in https://github.com/sgl-project/sglang/pull/6858
- [Refactor] Multimodal data processing for VLM by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/6659
- Decoder-only Scoring API by @chanh in https://github.com/sgl-project/sglang/pull/6460
- feat: add dp-rank to KV events by @ishandhanani in https://github.com/sgl-project/sglang/pull/6852
- Set
num_fused_shared_experts
asnum_shared_experts
when shared_experts fusion is not disabled by @ch-wan in https://github.com/sgl-project/sglang/pull/6736 - Fix one missing arg in DeepEP by @ch-wan in https://github.com/sgl-project/sglang/pull/6878
- Support LoRA in TestOpenAIVisionServer and fix fused kv_proj loading bug. by @lifuhuang in https://github.com/sgl-project/sglang/pull/6861
- support 1 shot allreduce in 1-node and 2-node using mscclpp by @zyksir in https://github.com/sgl-project/sglang/pull/6277
- Fix Qwen3MoE missing token padding optimization by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6820
- Tiny update error hints by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6846
- Support layerwise rebalancing experts by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6851
- Tiny allow profiler API to auto create directory by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6865
- Support Blackwell DeepEP docker images by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6868
- [EP] Add cuda kernel for moe_ep_post_reorder by @yuan-luo in https://github.com/sgl-project/sglang/pull/6837
- Fix OpenAI Client error with single request via batch api by @ravi03071991 in https://github.com/sgl-project/sglang/pull/6170
- [PD] Fix potential perf spike caused by tracker gc and optimize doc by @ShangmingCai in https://github.com/sgl-project/sglang/pull/6764
- Use deepgemm instead of triton for fused_qkv_a_proj_with_mqa by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6890
- [CUTLASS-FP4-MOE] Introduce CutlassMoEParams class for easy initialization of Cutlass Grouped Gems Metadata by @pavanimajety in https://github.com/sgl-project/sglang/pull/6887
- bugfix(OAI): Fix image_data processing for jinja chat templates by @CatherineSue in https://github.com/sgl-project/sglang/pull/6877
- [CPU] enable CI for PRs, add Dockerfile and auto build task by @ZailiWang in https://github.com/sgl-project/sglang/pull/6458
- AITER backend extension and workload optimizations by @HaiShaw in https://github.com/sgl-project/sglang/pull/6838
- [Feature] Support Flashinfer fmha on Blackwell by @NorthmanPKU in https://github.com/sgl-project/sglang/pull/6930
- Fix a bug in abort & Improve docstrings for abort by @merrymercy in https://github.com/sgl-project/sglang/pull/6931
- Tiny support customize DeepEP max dispatch tokens per rank by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6934
- Sync the changes on cuda graph runners by @merrymercy in https://github.com/sgl-project/sglang/pull/6932
- [PD] Optimize transfer queue forward logic for dummy rank by @ShangmingCai in https://github.com/sgl-project/sglang/pull/6922
- [Refactor] image data process in bench_serving by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/6879
- [fix] logical_to_all_physical_map index 256 is out of bounds in EP parallel. by @MiterV1 in https://github.com/sgl-project/sglang/pull/6767
- Add triton fused moe kernel config for E=257 on B200 by @Fridge003 in https://github.com/sgl-project/sglang/pull/6939
- [sgl-kernel] update deepgemm by @Alcanderian in https://github.com/sgl-project/sglang/pull/6942
- chore: bump sgl-kernel v0.1.6 by @zhyncs in https://github.com/sgl-project/sglang/pull/6943
- Minor compile fused topk by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6944
- [Bugfix] pipeline parallelism and Eagle Qwen2 by @Swipe4057 in https://github.com/sgl-project/sglang/pull/6910
- Tiny re-introduce profile id logging by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6912
- Add triton version as a fused_moe_triton config search key to avoid performace decrease in different Triton version by @BBuf in https://github.com/sgl-project/sglang/pull/5955
- reduce torch.zeros overhead in moe align block size kernel by @BBuf in https://github.com/sgl-project/sglang/pull/6369
- chore: upgrade sgl-kernel v0.1.6 by @Alcanderian in https://github.com/sgl-project/sglang/pull/6945
- add fbgemm moe grouped gemm kernel benchmark by @BBuf in https://github.com/sgl-project/sglang/pull/6924
- [Docker] Add docker file for SGL Router by @YouNeedCryDear in https://github.com/sgl-project/sglang/pull/6915
- Disabling mixed chunked prefill when eagle is enabled by @Swipe4057 in https://github.com/sgl-project/sglang/pull/6874
- Add canary for EPLB rebalancing by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6895
- Refactor global_server_args_dict by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6866
- Fuse routed scaling factor in topk_reduce kernel by @BBuf in https://github.com/sgl-project/sglang/pull/6220
- Update server timeout time in AMD CI. by @saienduri in https://github.com/sgl-project/sglang/pull/6953
- [misc] add is_cpu() by @Alcanderian in https://github.com/sgl-project/sglang/pull/6950
- Add H20 fused MoE kernel tuning configs for DeepSeek-R1/V3 by @Xu-Wenqing in https://github.com/sgl-project/sglang/pull/6885
- Add a CUDA kernel for fusing mapping and weighted sum for MoE. by @elfiegg in https://github.com/sgl-project/sglang/pull/6916
- chore: bump sgl-kernel v0.1.6.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/6955
- chore: upgrade sgl-kernel v0.1.6.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/6957
- [DeepseekR1-FP4] Add Support for nvidia/DeepSeekR1-FP4 model by @pavanimajety in https://github.com/sgl-project/sglang/pull/6853
- Revert "Fuse routed scaling factor in topk_reduce kernel (#6220)" by @zhyncs in https://github.com/sgl-project/sglang/pull/6968
- [AMD] Add more tests to per-commit-amd by @hubertlu-tw in https://github.com/sgl-project/sglang/pull/6926
- chore: bump sgl-kernel v0.1.7 by @zhyncs in https://github.com/sgl-project/sglang/pull/6963
- Slightly improve the sampler to skip unnecessary steps by @merrymercy in https://github.com/sgl-project/sglang/pull/6956
- rebase h20 fused_moe config by @BBuf in https://github.com/sgl-project/sglang/pull/6966
- Fix CI and triton moe Configs by @merrymercy in https://github.com/sgl-project/sglang/pull/6974
- Remove unnecessary kernels of num_token_non_padded by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6965
- Extend cuda graph capture bs for B200 by @Fridge003 in https://github.com/sgl-project/sglang/pull/6937
- Fuse routed scaling factor in deepseek by @BBuf in https://github.com/sgl-project/sglang/pull/6970
- Sync cuda graph runners by @merrymercy in https://github.com/sgl-project/sglang/pull/6976
- Fix draft extend ut stability with flush cache by @ispobock in https://github.com/sgl-project/sglang/pull/6979
- Fix triton sliding window test case by @merrymercy in https://github.com/sgl-project/sglang/pull/6981
- Fix expert distribution dumping causes OOM by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6967
- Minor remove one kernel for DeepSeek by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6977
- [perf][sgl-kernel] extend cutlass_mla_decode to support num_head < 128 by @Alcanderian in https://github.com/sgl-project/sglang/pull/6929
- Enable more unit tests for AMD CI. by @saienduri in https://github.com/sgl-project/sglang/pull/6983
- Use torch.compile to fuse flash attention decode metadata preparation by @merrymercy in https://github.com/sgl-project/sglang/pull/6973
- Speed up set_lora_info by eliminating unnecessary H2D transfers by @lifuhuang in https://github.com/sgl-project/sglang/pull/6960
- support qwen3 emebedding by @Titan-p in https://github.com/sgl-project/sglang/pull/6990
- Fix torch profiler bugs for bench_offline_throughput.py by @PanJason in https://github.com/sgl-project/sglang/pull/6557
- chore: upgrade flashinfer v0.2.6.post1 jit by @zhyncs in https://github.com/sgl-project/sglang/pull/6958
- cleanup tmp dir by @zhyncs in https://github.com/sgl-project/sglang/pull/7007
- chore: update pr test xeon by @zhyncs in https://github.com/sgl-project/sglang/pull/7008
- Fix cutlass MLA gets almost zero accuracy by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6998
- Update amd nightly models CI. by @saienduri in https://github.com/sgl-project/sglang/pull/6992
- feat: add direct routing strategy to DP worker by @ishandhanani in https://github.com/sgl-project/sglang/pull/6884
- Fallback to lower triton version for unfound fused moe configs by @Fridge003 in https://github.com/sgl-project/sglang/pull/7013
- Fix torchvision version for Blackwell by @Edenzzzz in https://github.com/sgl-project/sglang/pull/7015
- Simplify prepare_extend_after_decode by @merrymercy in https://github.com/sgl-project/sglang/pull/6987
- Migrate to assertEqual by @emmanuel-ferdman in https://github.com/sgl-project/sglang/pull/6741
- Fix torch version in blackwell dockerfile by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/7017
- chore: update pr test xeon by @zhyncs in https://github.com/sgl-project/sglang/pull/7018
- Update default settings for blackwell by @Fridge003 in https://github.com/sgl-project/sglang/pull/7023
- Support both approximate and exact expert distribution collection by @fzyzcjy in https://github.com/sgl-project/sglang/pull/6964
- Add decode req pool by @ByronHsu in https://github.com/sgl-project/sglang/pull/6980
- [CI] Add CI workflow for sgl-router docker build by @YouNeedCryDear in https://github.com/sgl-project/sglang/pull/7027
- Fix fused_moe triton configs by @yudian0504 in https://github.com/sgl-project/sglang/pull/7029
- CPU: map changes from developing branch in sgl-kernel by @yanbing-j in https://github.com/sgl-project/sglang/pull/6833
- chore: bump v0.4.7 by @zhyncs in https://github.com/sgl-project/sglang/pull/7038
New Contributors
- @slin1237 made their first contribution in https://github.com/sgl-project/sglang/pull/5741
- @GeLee-Q made their first contribution in https://github.com/sgl-project/sglang/pull/5850
- @xutianyi1999 made their first contribution in https://github.com/sgl-project/sglang/pull/5823
- @pengcuo made their first contribution in https://github.com/sgl-project/sglang/pull/5875
- @johnnynunez made their first contribution in https://github.com/sgl-project/sglang/pull/5746
- @zhjunqin made their first contribution in https://github.com/sgl-project/sglang/pull/5903
- @xiaomin-D made their first contribution in https://github.com/sgl-project/sglang/pull/5350
- @lifuhuang made their first contribution in https://github.com/sgl-project/sglang/pull/5969
- @botieking98 made their first contribution in https://github.com/sgl-project/sglang/pull/3853
- @ishandhanani made their first contribution in https://github.com/sgl-project/sglang/pull/6075
- @zminglei made their first contribution in https://github.com/sgl-project/sglang/pull/6077
- @Othame made their first contribution in https://github.com/sgl-project/sglang/pull/5764
- @xu-yfei made their first contribution in https://github.com/sgl-project/sglang/pull/5822
- @Simon-Li made their first contribution in https://github.com/sgl-project/sglang/pull/6016
- @shimizust made their first contribution in https://github.com/sgl-project/sglang/pull/4701
- @applesaucethebun made their first contribution in https://github.com/sgl-project/sglang/pull/6179
- @emmanuel-ferdman made their first contribution in https://github.com/sgl-project/sglang/pull/6180
- @KivenChen made their first contribution in https://github.com/sgl-project/sglang/pull/5084
- @blzheng made their first contribution in https://github.com/sgl-project/sglang/pull/6115
- @zhuzilin made their first contribution in https://github.com/sgl-project/sglang/pull/5914
- @quinnrong94 made their first contribution in https://github.com/sgl-project/sglang/pull/6109
- @yiliu30 made their first contribution in https://github.com/sgl-project/sglang/pull/6317
- @ysulsky made their first contribution in https://github.com/sgl-project/sglang/pull/6136
- @doujiang24 made their first contribution in https://github.com/sgl-project/sglang/pull/6392
- @wangxiyu191 made their first contribution in https://github.com/sgl-project/sglang/pull/6323
- @yanbing-j made their first contribution in https://github.com/sgl-project/sglang/pull/6405
- @guapisolo made their first contribution in https://github.com/sgl-project/sglang/pull/5339
- @MiterV1 made their first contribution in https://github.com/sgl-project/sglang/pull/6524
- @upfixer made their first contribution in https://github.com/sgl-project/sglang/pull/6537
- @habaohaba made their first contribution in https://github.com/sgl-project/sglang/pull/6450
- @jinyouzhi made their first contribution in https://github.com/sgl-project/sglang/pull/6709
- @thesues made their first contribution in https://github.com/sgl-project/sglang/pull/6720
- @ChangyiYang made their first contribution in https://github.com/sgl-project/sglang/pull/6545
- @NorthmanPKU made their first contribution in https://github.com/sgl-project/sglang/pull/6509
- @storyicon made their first contribution in https://github.com/sgl-project/sglang/pull/6792
- @Arist12 made their first contribution in https://github.com/sgl-project/sglang/pull/6791
- @pavanimajety made their first contribution in https://github.com/sgl-project/sglang/pull/6093
- @YouNeedCryDear made their first contribution in https://github.com/sgl-project/sglang/pull/6766
- @jianan-gu made their first contribution in https://github.com/sgl-project/sglang/pull/6456
- @pansicheng made their first contribution in https://github.com/sgl-project/sglang/pull/6671
- @SunMarc made their first contribution in https://github.com/sgl-project/sglang/pull/5929
- @chanh made their first contribution in https://github.com/sgl-project/sglang/pull/6460
- @zyksir made their first contribution in https://github.com/sgl-project/sglang/pull/6277
- @ZailiWang made their first contribution in https://github.com/sgl-project/sglang/pull/6458
Full Changelog: https://github.com/sgl-project/sglang/compare/v0.4.6...v0.4.7