Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
README.md | 2025-04-27 | 43.0 kB | |
Release v0.4.6 source code.tar.gz | 2025-04-27 | 3.7 MB | |
Release v0.4.6 source code.zip | 2025-04-27 | 4.5 MB | |
Totals: 3 Items | 8.3 MB | 0 |
Highlights
- Use FlashAttention3 as the default attention backend for main stream models (DeepSeek, QWen, Llama, etc). https://github.com/sgl-project/sglang/issues/4709#issuecomment-2817728855
- PD disaggregation with mooncake and NIXL transfer backends [#4880] [#5477] [#4655]
- DeepSeek performance improvements: turn on DeepGemm by default and some kernel fusions. [#5580] [#5628]
- Update torch to 2.6.0. Fix torch.compile cache. [#5417] [#5213]
- Preliminary support for blackwell [#5303]
Thanks very much to LinkedIn team, Alibaba Cloud, Mooncake team, NVIDIA Team, AMD Team, Pytorch Team, Ant Group, Baseten Team, Oracle Team, Meituan Team, iFlytek MaaS team and the open source community users for their contributions!
We’re thrilled about these advancements and eager to hear your feedback! Join us on our Slack channel at slack.sglang.ai to connect and share your thoughts. Cheers!
Coming Soon
- Large scale expert parallelism + PD disaggregation [#4734] [#5524]
- Pipeline Parallelism [#5724]
- MLA Cutlass Backend [#5390]
What's Changed
- [ci] fix llama4 ci error by @BBuf in https://github.com/sgl-project/sglang/pull/5126
- Refactor and Optimize FA3 Code by @hebiao064 in https://github.com/sgl-project/sglang/pull/5090
- Add Llama4 user guide by @ispobock in https://github.com/sgl-project/sglang/pull/5133
- [Misc] Use pytest.mark.skipif in sgl-kernel test by @yinfan98 in https://github.com/sgl-project/sglang/pull/5137
- feat: disable grammar restrictions within reasoning sections by @minleminzui in https://github.com/sgl-project/sglang/pull/4984
- [modelopt] automatically inspect if model is ModelOpt quantized and set quantization method by @yundai424 in https://github.com/sgl-project/sglang/pull/5145
- [AMD] Fix missing per_token_group_quant_fp8 for ROCm by @hubertlu-tw in https://github.com/sgl-project/sglang/pull/5140
- fix multimodal hash feature by @huangtingwei9988 in https://github.com/sgl-project/sglang/pull/5083
- Fix run time error in ROCm platform by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/5147
- [FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct by @zcnrex in https://github.com/sgl-project/sglang/pull/5103
- Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 by @yubofredwang in https://github.com/sgl-project/sglang/pull/4760
- Use public model for FA3 speculative decode testing by @yubofredwang in https://github.com/sgl-project/sglang/pull/5152
- Add dummy grok test to amd CI. by @saienduri in https://github.com/sgl-project/sglang/pull/5115
- fix empty_cache error in pt_weights_iterator by @dangkai4u in https://github.com/sgl-project/sglang/pull/5151
- Fix torch compile errors by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/5158
- Fix loading KV quantization scale; Enable modelopt kv cache by @yundai424 in https://github.com/sgl-project/sglang/pull/4686
- [PD] Fix unclosed prefill connection warning of mini_lb by @ShangmingCai in https://github.com/sgl-project/sglang/pull/5155
- Add optimized native kernels in sgl-kernel by @mingfeima in https://github.com/sgl-project/sglang/pull/5150
- [PD] Simplify mini LB by @ByronHsu in https://github.com/sgl-project/sglang/pull/4911
- Small improvement of native api docs by @simveit in https://github.com/sgl-project/sglang/pull/5139
- [feat&refactor] Enhance multimodal input support with refactor io_struct by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/4938
- Support 2x8xH100 for Llama 4 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5159
- FP4 weight loading and inference (2/2) by @trevor-m in https://github.com/sgl-project/sglang/pull/3972
- Fix multimodal hashing error by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5174
- Tiny disable model that does not work by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5175
- [Bugfix] Fix index out of bounds in local attention with large sequences by @CatherineSue in https://github.com/sgl-project/sglang/pull/5173
- [Fix] DeepEP Compatibility with Low Latency by @liz-badada in https://github.com/sgl-project/sglang/pull/5068
- docs: remove the use of Downward API for LWS_WORKER_INDEX by @yankay in https://github.com/sgl-project/sglang/pull/5110
- feat: add DeepGEMM build warning by @zhyncs in https://github.com/sgl-project/sglang/pull/5176
- fix: use DeepEPDispatcher on CUDA by @zhyncs in https://github.com/sgl-project/sglang/pull/5180
- [DeepEP] fix: import buffer error by @ch-wan in https://github.com/sgl-project/sglang/pull/5179
- Let
bench_one_batch
supportenable_dp_attention
by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4058 - [Misc] clean up vllm in sgl-kernel test by @yinfan98 in https://github.com/sgl-project/sglang/pull/5189
- Fix ci test "test_eval_fp8_accuracy" failed by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/5185
- Optimize topk operation in llama4 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5128
- Support Llama4 fp8 inference by @HandH1998 in https://github.com/sgl-project/sglang/pull/5194
- [ci] fix ci test fused_moe op by @BBuf in https://github.com/sgl-project/sglang/pull/5102
- model: support mllama4 by @mickqian in https://github.com/sgl-project/sglang/pull/5144
- Rework grok test. by @saienduri in https://github.com/sgl-project/sglang/pull/5171
- sgl-kernel use cutlass latest version for fp8 blockwise gemm by @yizhang2077 in https://github.com/sgl-project/sglang/pull/5207
- Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 by @Muuuchen in https://github.com/sgl-project/sglang/pull/5196
- fix: log warning when disable cuda graph by @zhyncs in https://github.com/sgl-project/sglang/pull/5209
- [metrics] Add in queue metrics by @hebiao064 in https://github.com/sgl-project/sglang/pull/4444
- Fix DeepSeek error when using DeepEP mode by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5190
- reduce moe_align_block_size_kernel small batch mode overhead by @BBuf in https://github.com/sgl-project/sglang/pull/5086
- [PD] Support KV transfer with mooncake by @stmatengss in https://github.com/sgl-project/sglang/pull/4880
- [PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool by @stmatengss in https://github.com/sgl-project/sglang/pull/5204
- Update deps for mllama4 by @ispobock in https://github.com/sgl-project/sglang/pull/5215
- Fix deepseek-v3 with torch.compile in PyTorch 2.6. by @zou3519 in https://github.com/sgl-project/sglang/pull/5213
- ROCm sgl-kernel: compatible to later torch by @HaiShaw in https://github.com/sgl-project/sglang/pull/5167
- [Misc] Clean sgl-kernel test by @yinfan98 in https://github.com/sgl-project/sglang/pull/5216
- Update Makefile / build script to avoid installing incompatible torch dependency by @elfiegg in https://github.com/sgl-project/sglang/pull/5245
- Fix torch.compile cacheing by @zou3519 in https://github.com/sgl-project/sglang/pull/5259
- ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations by @HaiShaw in https://github.com/sgl-project/sglang/pull/5228
- Optimize attention in llama4 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5127
- Optimize GPU memory usage in FlashAttentionBackend's strided indexing by @CatherineSue in https://github.com/sgl-project/sglang/pull/5262
- Support
--enable-llama4-multimodal
by @ch-wan in https://github.com/sgl-project/sglang/pull/5254 - [fix] fix mrope positions not picked up by @mickqian in https://github.com/sgl-project/sglang/pull/5265
- doc: nested loop code for offline engine by @minleminzui in https://github.com/sgl-project/sglang/pull/5244
- fix: examples for token_in_token_out_vlm by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/5193
- Fix a 404 link in send_request.ipynb by @windsonsea in https://github.com/sgl-project/sglang/pull/5280
- fix: enable fp4 compilation on cu128 by @zhyncs in https://github.com/sgl-project/sglang/pull/5286
- feat: add cu128 identifier for sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/5287
- chore: relax the torch version restriction for sgl-kernel compilation by @zhyncs in https://github.com/sgl-project/sglang/pull/5288
- chore: bump sgl-kernel v0.0.8.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/5289
- [PD] fix: skip warmup request in disaggregation mode to prevent crash on timeout by @GaoYusong in https://github.com/sgl-project/sglang/pull/5292
- [Docs] Supported Model Docs - Major restructuring by @adarshxs in https://github.com/sgl-project/sglang/pull/5290
- fix: update update_wheel_index for cu128 by @zhyncs in https://github.com/sgl-project/sglang/pull/5300
- [Docs] Remove the older supported docs section by @adarshxs in https://github.com/sgl-project/sglang/pull/5301
- remove moe_align_block_size torch.zeros in small batch/expert mode by @BBuf in https://github.com/sgl-project/sglang/pull/5298
- feat: add blackwell Dockerfile by @zhyncs in https://github.com/sgl-project/sglang/pull/5302
- feat: add blackwell workflow by @zhyncs in https://github.com/sgl-project/sglang/pull/5303
- fix: use fa3 unit test on hopper only by @zhyncs in https://github.com/sgl-project/sglang/pull/5304
- misc: update blackwell Dockerfile by @zhyncs in https://github.com/sgl-project/sglang/pull/5306
- fix: remove cublas_grouped_gemm by @zhyncs in https://github.com/sgl-project/sglang/pull/5307
- fix: update flash attn by @zhyncs in https://github.com/sgl-project/sglang/pull/5308
- fix: use deepgemm only on hopper by @zhyncs in https://github.com/sgl-project/sglang/pull/5310
- [VLM] Adopt fast image processor by default by @mickqian in https://github.com/sgl-project/sglang/pull/5065
- Adjust ci test threshold by @ispobock in https://github.com/sgl-project/sglang/pull/5271
- Blackwell Cutlass MLA kernel by @trevor-m in https://github.com/sgl-project/sglang/pull/5142
- misc: cleanup 3rdparty by @zhyncs in https://github.com/sgl-project/sglang/pull/5311
- update variable naming and comments for rocm by @Lzy17 in https://github.com/sgl-project/sglang/pull/5299
- Fix w8a8_int8 model shared experts fusion load weights error by @lambert0312 in https://github.com/sgl-project/sglang/pull/5120
- Add flash_attn_varlen_func to sgl-kernel by @Fridge003 in https://github.com/sgl-project/sglang/pull/5315
- Fix fa3 window size setup by @qingquansong in https://github.com/sgl-project/sglang/pull/5316
- chore: bump sgl-kernel v0.0.8.post2 by @zhyncs in https://github.com/sgl-project/sglang/pull/5317
- feat: use fa3 mla by default on hopper by @zhyncs in https://github.com/sgl-project/sglang/pull/5210
- Fix: docs/backend/structured_outputs.ipynb by @thyecust in https://github.com/sgl-project/sglang/pull/4884
- Delete python/sglang/srt/layers/moe/fused_moe_triton/configs/E=257,N=… by @BBuf in https://github.com/sgl-project/sglang/pull/5321
- refine fused_moe tuning docs by @BBuf in https://github.com/sgl-project/sglang/pull/5294
- Support server based rollout in Verlengine by @yitianlian in https://github.com/sgl-project/sglang/pull/4848
- [Feat] Add sparse attn to sgl-kernel by @yinfan98 in https://github.com/sgl-project/sglang/pull/5327
- fix: solve cu118 issue for cutlass mla by @zhyncs in https://github.com/sgl-project/sglang/pull/5331
- chore: bump sgl-kernel v0.0.8.post3 by @zhyncs in https://github.com/sgl-project/sglang/pull/5332
- ci: update release node by @zhyncs in https://github.com/sgl-project/sglang/pull/5333
- fix: determine if flashinfer is installed by @zhyncs in https://github.com/sgl-project/sglang/pull/5336
- feat: adapt merge_state by @zhyncs in https://github.com/sgl-project/sglang/pull/5337
- misc: update sagemaker Dockerfile by @zhyncs in https://github.com/sgl-project/sglang/pull/5341
- Fix: ensure tensors used in dist.broadcast are created on the correct… by @minleminzui in https://github.com/sgl-project/sglang/pull/5322
- docs: update adoption and sponsorship list with Oracle by @zhyncs in https://github.com/sgl-project/sglang/pull/5343
- chore: upgrade sgl-kernel 0.0.8.post3 by @zhyncs in https://github.com/sgl-project/sglang/pull/5342
- Fix typo: infight -> inflight by @hnyls2002 in https://github.com/sgl-project/sglang/pull/5357
- [PD] Add transfer backend abstraction by @ByronHsu in https://github.com/sgl-project/sglang/pull/5328
- fix MLATokenToKVPoolHost get_size_per_token bug by @huangtingwei9988 in https://github.com/sgl-project/sglang/pull/5161
- fix [#5322] by @zhyncs in https://github.com/sgl-project/sglang/pull/5359
- feat: update experiment_runner by @zhyncs in https://github.com/sgl-project/sglang/pull/5360
- [DeepEP] Reduce routed scaling overhead by @yuleil in https://github.com/sgl-project/sglang/pull/5277
- Free metadata_buffer_index after transfer finished by @jokerwyt in https://github.com/sgl-project/sglang/pull/5364
- Fix DeepSeek DP Attention + torch compile by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5367
- Support for Qwen2.5-VL Model in bitsandbytes Format by @yhyang201 in https://github.com/sgl-project/sglang/pull/5003
- Fix PD disaggregation bugs by @hnyls2002 in https://github.com/sgl-project/sglang/pull/5326
- [PD Bug] fix MLA get_contiguous_buf_infos error by @whybeyoung in https://github.com/sgl-project/sglang/pull/5384
- [perf] experimental enhance fp8 per-tensor quant by @Alcanderian in https://github.com/sgl-project/sglang/pull/5370
- Apply deepseek cuda rope by @ispobock in https://github.com/sgl-project/sglang/pull/5385
- apply fused moe gate in ds v3/r1 by @BBuf in https://github.com/sgl-project/sglang/pull/5371
- fix: update test config by @zhyncs in https://github.com/sgl-project/sglang/pull/5392
- [Fix] Turn off DeepGEMM by default by @Fridge003 in https://github.com/sgl-project/sglang/pull/5263
- minor clean up of sgl-kernel/CMakeLists.txt by @merrymercy in https://github.com/sgl-project/sglang/pull/5393
- Add A800 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 by @lambert0312 in https://github.com/sgl-project/sglang/pull/5368
- Add H20 dtype fp8_w8a8 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 by @Ximingwang-09 in https://github.com/sgl-project/sglang/pull/5291
- [fix/misc] remove duplicate row in deepseek v2 model by @yyccli in https://github.com/sgl-project/sglang/pull/5279
- chore: upgrade DeepGEMM by @zhyncs in https://github.com/sgl-project/sglang/pull/5395
- fix: update pr-test-sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/5399
- kernel: support slightly faster merge_state_v2 cuda kernel by @DefTruth in https://github.com/sgl-project/sglang/pull/5381
- chore: bump sgl-kernel 0.0.9 by @zhyncs in https://github.com/sgl-project/sglang/pull/5400
- chore: upgrade sgl-kernel 0.0.9 by @zhyncs in https://github.com/sgl-project/sglang/pull/5401
- Tiny fix DeepseekScalingRotaryEmbedding always use forward_native by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5406
- Fix bench_serving with random-ids by @guoyuhong in https://github.com/sgl-project/sglang/pull/5214
- [misc] fix ci flaky case by @Alcanderian in https://github.com/sgl-project/sglang/pull/5352
- [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP by @Muuuchen in https://github.com/sgl-project/sglang/pull/5412
- Support dynamic connection and TP 16 by @yuan-luo in https://github.com/sgl-project/sglang/pull/5351
- Fix broadcast use cuda device lead to memory capacity unbalanced by @lambert0312 in https://github.com/sgl-project/sglang/pull/5416
- [PD] Fix dynamic port support and MLA buffer for Mooncake by @ShangmingCai in https://github.com/sgl-project/sglang/pull/5415
- Distinguish bootstrap key only in decode server by @hnyls2002 in https://github.com/sgl-project/sglang/pull/5422
- [PD] Remove unused bootstrap param and fix port table type by @ShangmingCai in https://github.com/sgl-project/sglang/pull/5423
- [minor] cleanup cmakelists.txt by @merrymercy in https://github.com/sgl-project/sglang/pull/5420
- bugfix: fix merge_state_v2 cuda graph by @DefTruth in https://github.com/sgl-project/sglang/pull/5419
- chore: bump sgl-kernel v0.0.9.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/5430
- fix: solve release issue by @zhyncs in https://github.com/sgl-project/sglang/pull/5434
- BLackwell cutlass mla: Add check for bad page size/block num combinations by @trevor-m in https://github.com/sgl-project/sglang/pull/5431
- feat: update model_specific_adjustment by @zhyncs in https://github.com/sgl-project/sglang/pull/5344
- chore: upgrade sgl-kernel 0.0.9.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/5436
- Fix ignore_eos parameter when loading a chat template by @CatherineSue in https://github.com/sgl-project/sglang/pull/5264
- add attention backend supporting matrix in the doc by @mRSun15 in https://github.com/sgl-project/sglang/pull/5211
- Support BNB quantization for llama/mllama by @ryang-max in https://github.com/sgl-project/sglang/pull/5038
- [Docs] Update start/install.md by @windsonsea in https://github.com/sgl-project/sglang/pull/5398
- [Minor] Move torch.compile patch to a better place by @merrymercy in https://github.com/sgl-project/sglang/pull/5397
- [Bug fix] need record start time in pd mode by @whybeyoung in https://github.com/sgl-project/sglang/pull/5425
- Support MHA with chunked prefix cache for DeepSeek chunked prefill by @Fridge003 in https://github.com/sgl-project/sglang/pull/5113
- chore: bump v0.4.5.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/5445
- Fix several minor issues in PD disaggregation by @ch-wan in https://github.com/sgl-project/sglang/pull/5444
- [doc] Update benchmark_and_profiling.md by @BBuf in https://github.com/sgl-project/sglang/pull/5449
- Update cutlass dependency. by @elfiegg in https://github.com/sgl-project/sglang/pull/5447
- add multi-lora feature in README.md by @Ying1123 in https://github.com/sgl-project/sglang/pull/5463
- Clean up imports by @merrymercy in https://github.com/sgl-project/sglang/pull/5467
- [verl] Modify the update_weights func to align with verl's resharding by @BearBiscuit05 in https://github.com/sgl-project/sglang/pull/5345
- [Model Support] unsloth/Phi-4-mini bnb model by @yyihuang in https://github.com/sgl-project/sglang/pull/4982
- Update attention_backend.md: plural form by @didier-durand in https://github.com/sgl-project/sglang/pull/5489
- Add test for flash_attn_varlen_func kernel by @Fridge003 in https://github.com/sgl-project/sglang/pull/5484
- Deprecate disable-mla by @Fridge003 in https://github.com/sgl-project/sglang/pull/5481
- Deprecate enable-flashinfer-mla and enable-flashmla by @Fridge003 in https://github.com/sgl-project/sglang/pull/5480
- Feat/support encoder model (like bert) by @woodx9 in https://github.com/sgl-project/sglang/pull/4887
- Enable local attention during decode by @CatherineSue in https://github.com/sgl-project/sglang/pull/5479
- Refactor DeepSeek decoder layer branches by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5205
- Fix a link in sgl-kernel/README.md by @windsonsea in https://github.com/sgl-project/sglang/pull/5493
- [Bug fix] use correct func path in deepseek by @XucSh in https://github.com/sgl-project/sglang/pull/5496
- Doc: fix problems of the 'Execute Notebooks / run-all-notebooks' ci caused by the unstability of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B by @minleminzui in https://github.com/sgl-project/sglang/pull/5503
- [Feat] Update sgl-kernel flashinfer to latest main version by @yinfan98 in https://github.com/sgl-project/sglang/pull/5500
- Fix: Incorrect parameters passed to forward_batch_generation (#5506) by @u4lr451 in https://github.com/sgl-project/sglang/pull/5511
- Fix: fix the exception 'the memory capacity is unbalanced. Some GPUs … by @minleminzui in https://github.com/sgl-project/sglang/pull/5426
- [docs] Fix several consistency issues in sampling_params.md by @windsonsea in https://github.com/sgl-project/sglang/pull/5373
- Configuration qwen2_moe.py - qkv_bias now in transformers by @michaelfeil in https://github.com/sgl-project/sglang/pull/5512
- Introduce moe_dense_tp_size to fix dense layer errors in DeepSeek V3 + 4x8xH100 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4836
- Sgl kernel fused_moe_gate support n_shared_experts by @BBuf in https://github.com/sgl-project/sglang/pull/5440
- chore: bump sgl-kernel 0.0.9.post2 by @zhyncs in https://github.com/sgl-project/sglang/pull/5518
- use sglang_per_token_group_quant_fp8 from sgl-kernel instead of trion kernel by @strgrb in https://github.com/sgl-project/sglang/pull/5473
- fix kimi vl running bug after rebase main by @BBuf in https://github.com/sgl-project/sglang/pull/5461
- fix bug of VLLM_AVAILABLE not defined by @liwenju0 in https://github.com/sgl-project/sglang/pull/5497
- Avoid computing lse in Ragged Prefill when there's no prefix. by @Edenzzzz in https://github.com/sgl-project/sglang/pull/5476
- [Model] Adding Qwen3 and Qwen3MoE by @yhyang201 in https://github.com/sgl-project/sglang/pull/4693
- fix util import by @zhyncs in https://github.com/sgl-project/sglang/pull/5542
- Revert "Avoid computing lse in Ragged Prefill when there's no prefix.… by @zhyncs in https://github.com/sgl-project/sglang/pull/5544
- chore: upgrade sgl-kernel 0.0.9.post2 by @zhyncs in https://github.com/sgl-project/sglang/pull/5540
- Fix DeepGEMM masked cannot be run on groups not being multiple or 4 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5340
- Make profiler output file names consistent by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5548
- [PD] Tiny fix timeout error when generate by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5545
- [PD] Fix no cache connect for recevier by @whybeyoung in https://github.com/sgl-project/sglang/pull/5534
- feat: use flashinfer jit package by @zhyncs in https://github.com/sgl-project/sglang/pull/5547
- [PD] Remove the requirement of config file for mooncake backend by @ShangmingCai in https://github.com/sgl-project/sglang/pull/5460
- restruct compressed_tensors_w8a8_fp8 by @BBuf in https://github.com/sgl-project/sglang/pull/5475
- simplify the control logic for using shared experts fusion by @BBuf in https://github.com/sgl-project/sglang/pull/5504
- Remove one kernel in per_tensor_quant_mla_fp8 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5549
- Fix sampler nan check when calling top_k_top_p_sampling_from_probs by @yubofredwang in https://github.com/sgl-project/sglang/pull/5546
- [PD] Support page size > 1 by @ByronHsu in https://github.com/sgl-project/sglang/pull/5561
- fix hicache write back by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/5543
- Minor update for ROCm variable style by @Lzy17 in https://github.com/sgl-project/sglang/pull/5562
- Fix bench_one_batch producing unnatural results for expert parallel by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5149
- [perf] introduce deep gemm group_gemm_masked as bmm by @Alcanderian in https://github.com/sgl-project/sglang/pull/5432
- [PD] Fix DeepSeek cannot be run on latest master by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5568
- Fix BumpAllocator error when no input_ids by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5564
- enable DeepSeek V3 shared_experts_fusion in sm90 by @BBuf in https://github.com/sgl-project/sglang/pull/5571
- [Fix] fix outlines and xgrammar by @Alcanderian in https://github.com/sgl-project/sglang/pull/4947
- [Doc]Add instruction for profiling with bench_one_batch by @Fridge003 in https://github.com/sgl-project/sglang/pull/5581
- Release v0.4.5.post2 by @merrymercy in https://github.com/sgl-project/sglang/pull/5582
- Fix bench_serving fail when zero warmup requests by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5574
- Fix DeepEP cannot run on latest master by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5567
- Fix torch memory saver not enabled in DP scenario by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5560
- Super tiny fix typo by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5559
- Add document for LoRA serving by @Fridge003 in https://github.com/sgl-project/sglang/pull/5521
- Tiny improve error message by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5526
- [PD] Fix server crash when using batch requests by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5531
- [Feat] upgrade pytorch2.6 by @sleepcoo in https://github.com/sgl-project/sglang/pull/5417
- Fix enable chunked prefill for Llama4 by @tarinkk in https://github.com/sgl-project/sglang/pull/5575
- fix: use fa3 for gemma2 by @zhyncs in https://github.com/sgl-project/sglang/pull/5586
- Fix ChatCompletionMessageGenericParam to allow for None content by @Amadeus-Winarto in https://github.com/sgl-project/sglang/pull/5452
- [PD] Fix large page size + chunk prefill by @ByronHsu in https://github.com/sgl-project/sglang/pull/5588
- Add test config yamls for Deepseek v3 by @Fridge003 in https://github.com/sgl-project/sglang/pull/5433
- [Feature] Prefill assistant response - add continue_final_message parameter by @adarshxs in https://github.com/sgl-project/sglang/pull/4226
- add function call parser for DeepSeek V3 by @finger92 in https://github.com/sgl-project/sglang/pull/5224
- smaller and non gated models for docs by @simveit in https://github.com/sgl-project/sglang/pull/5378
- Feat: Implement JSON Mode (response_format.type="json_object") by @kyle-pena-kuzco in https://github.com/sgl-project/sglang/pull/4733
- check marlin format before attempting conversion by @qeternity in https://github.com/sgl-project/sglang/pull/4675
- compressed_tensors: port w8a16 fp8 from vllm by @vhain in https://github.com/sgl-project/sglang/pull/4852
- Fix one more issue reported by torchfix by @b8zhong in https://github.com/sgl-project/sglang/pull/4859
- Add sanity check for max_running_requests by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5016
- Correct grafana heatmap. by @mac0ne in https://github.com/sgl-project/sglang/pull/5019
- Perform Batch Tokenization. by @sundar24295s in https://github.com/sgl-project/sglang/pull/5141
- Speedup shared expert weight construction by avoid cloning by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5188
- Tiny add Engine.flush_cache API by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5241
- [misc] remove is_cuda_available by @Alcanderian in https://github.com/sgl-project/sglang/pull/5319
- Fix flush cache by @merrymercy in https://github.com/sgl-project/sglang/pull/5590
- Add Speculative Decoding Eagle3 topk > 1 by @qingquansong in https://github.com/sgl-project/sglang/pull/5318
- upstream hicache fixes by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/5570
- Tiny add warning when cannot recognize bool env var by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5348
- Modify metrics service endpoint by @lambert0312 in https://github.com/sgl-project/sglang/pull/3443
- Update protocol.py to fix [#4589] by @relic-yuexi in https://github.com/sgl-project/sglang/pull/4590
- [Feat.] Enable grafana to show metrics by @PopSoda2002 in https://github.com/sgl-project/sglang/pull/4718
- [Fix] Enhance DP Attention for IPv6 Compatibility by @Lucius-THU in https://github.com/sgl-project/sglang/pull/4937
- Support o1 model on Azure by @ChuyueSun in https://github.com/sgl-project/sglang/pull/4980
- Tiny remove duplicated code by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5021
- Tiny update error hint by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5037
- Support PD bootstrap fields on /v1/chat/completions endpoint by @jokerwyt in https://github.com/sgl-project/sglang/pull/5488
- [PD] Fix generate endpoint of min_lb for PD by @ShangmingCai in https://github.com/sgl-project/sglang/pull/5598
- [PD] Fix edge case and simplify large page size + chunked prefill by @ByronHsu in https://github.com/sgl-project/sglang/pull/5589
- [PD] Add NIXL transfer backend by @trevor-m in https://github.com/sgl-project/sglang/pull/5477
- [PD] Support decode overlap schedule by @ByronHsu in https://github.com/sgl-project/sglang/pull/5608
- [PD] Support prefill overlap + Ensure no race condition by @ByronHsu in https://github.com/sgl-project/sglang/pull/5609
- Enhance GPU memory settings by @hnyls2002 in https://github.com/sgl-project/sglang/pull/5604
- [feature] enable pre compile jit deep_gemm by @Alcanderian in https://github.com/sgl-project/sglang/pull/5580
- Clean up mem settings by @merrymercy in https://github.com/sgl-project/sglang/pull/5610
- Support aiter RMSNorm in AMD by @michael-amd in https://github.com/sgl-project/sglang/pull/5510
- chore: bump v0.4.5.post3 by @zhyncs in https://github.com/sgl-project/sglang/pull/5611
- Remove extra copy in deepseek forward absorb by @ispobock in https://github.com/sgl-project/sglang/pull/5578
- [Doc] Fix a 404 link to llama-405b by @windsonsea in https://github.com/sgl-project/sglang/pull/5615
- [fix] force use deepgemm in compile_deep_gemm by @Alcanderian in https://github.com/sgl-project/sglang/pull/5618
- [fix] fix compile_deep_gemm missing kv_b_proj by @Alcanderian in https://github.com/sgl-project/sglang/pull/5620
- fix: gemma 3 not use softcap by @zhyncs in https://github.com/sgl-project/sglang/pull/5622
- Fix FA3 DeepSeek prefill performance regression by @Alcanderian in https://github.com/sgl-project/sglang/pull/5624
- [NFC] Remove duplicate
compressed-tensors
by @c8ef in https://github.com/sgl-project/sglang/pull/5640 - Fix shared experts fusion error without quantization by @lambert0312 in https://github.com/sgl-project/sglang/pull/5632
- [feature] Add H20 fp8_w8a8 FusedMoE config for --n-share-experts-fusion=16 by @saltyfish66 in https://github.com/sgl-project/sglang/pull/5641
- fix flashmla bug by @sleepcoo in https://github.com/sgl-project/sglang/pull/5272
- [fix] reduce dp capture bs by @Alcanderian in https://github.com/sgl-project/sglang/pull/5634
- Remove q concat in FA3 backend for DeepSeek decode by @ispobock in https://github.com/sgl-project/sglang/pull/5638
- Revert "Support aiter RMSNorm in AMD" by @HaiShaw in https://github.com/sgl-project/sglang/pull/5646
- fix: update bench_speculative by @zhyncs in https://github.com/sgl-project/sglang/pull/5649
- Turn on DeepGemm By Default and Update Doc by @Fridge003 in https://github.com/sgl-project/sglang/pull/5628
- Fuse q_a_proj and kv_a_proj for DeepSeek models by @Fridge003 in https://github.com/sgl-project/sglang/pull/5619
- Remove unnecessary
torch.full
in DeepSeek by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5601 - [1/2] Add FP8 Blockscale MoE CUTLASS kernel for Blackwell by @elfiegg in https://github.com/sgl-project/sglang/pull/5281
- fix sgl-kernel unit tests by @zhyncs in https://github.com/sgl-project/sglang/pull/5666
- fix awq_dequantize import by @zhyncs in https://github.com/sgl-project/sglang/pull/5669
- Integrating PD disaggregation with DP attention and DeepEP by @ch-wan in https://github.com/sgl-project/sglang/pull/5435
- fix gemma3 unit test by @zhyncs in https://github.com/sgl-project/sglang/pull/5670
- fix torchvision::nms not exist by @zhyncs in https://github.com/sgl-project/sglang/pull/5671
- [PD] Add support for dp attention with mooncake by @ShangmingCai in https://github.com/sgl-project/sglang/pull/5530
- tune the threshold of gemma-2-27b-it in test_nightly_gsm8k_eval.py by @merrymercy in https://github.com/sgl-project/sglang/pull/5677
- [Doc] Fix two 404 links caused by sglang typo by @windsonsea in https://github.com/sgl-project/sglang/pull/5667
- fix: update truss bench_serving by @zhyncs in https://github.com/sgl-project/sglang/pull/5683
- fix: only compile ApplyTokenBitmaskInplace cu124+ by @zhyncs in https://github.com/sgl-project/sglang/pull/5686
- chore: bump sgl-kernel 0.1.0 by @zhyncs in https://github.com/sgl-project/sglang/pull/5688
- vlm: enable radix cache for qwen-vl models by @mickqian in https://github.com/sgl-project/sglang/pull/5349
- [BugFix] Fix combination of MTP and
--n-share-experts-fusion
with R1 by @guoyuhong in https://github.com/sgl-project/sglang/pull/5707 - Fix weight loading bug for Deepseek v3+nextn by @Fridge003 in https://github.com/sgl-project/sglang/pull/5684
- Add example to use sgl engine with fastapi by @ravi03071991 in https://github.com/sgl-project/sglang/pull/5648
- [Doc] Fix a link to Weilin Zhao by @windsonsea in https://github.com/sgl-project/sglang/pull/5706
- Add MMMU benchmark results by @ravi03071991 in https://github.com/sgl-project/sglang/pull/4491
- [Model] Support
ArcticForCausalLM
architecture (Snowflake/snowflake-arctic-instruct) by @b8zhong in https://github.com/sgl-project/sglang/pull/5078 - [PD] Better logs by @hnyls2002 in https://github.com/sgl-project/sglang/pull/5715
- [PD] Add kvargs table and thread pool for kvcache sender of mooncake by @ShangmingCai in https://github.com/sgl-project/sglang/pull/5738
- [PD]: Support Muti Prefill in one node by @hcyz33 in https://github.com/sgl-project/sglang/pull/5704
- Fix: deepseek forward absorb by @michael-amd in https://github.com/sgl-project/sglang/pull/5723
- Pin torch audio to 2.6.0 by @merrymercy in https://github.com/sgl-project/sglang/pull/5750
- Revert "[Model] Support
ArcticForCausalLM
architecture (Snowflake/snowflake-arctic-instruct)" by @merrymercy in https://github.com/sgl-project/sglang/pull/5754 - Disable flaky eagle tests by @merrymercy in https://github.com/sgl-project/sglang/pull/5753
- update triton 3.2.0 h200 fused moe triton config and add warning about triton fused_moe_kernel performance degradation due to different Triton versions. by @BBuf in https://github.com/sgl-project/sglang/pull/5740
- [Docs] Update runtime/engine/readme.md by @windsonsea in https://github.com/sgl-project/sglang/pull/5737
- Reorder loop in shared expert weight loading by @ispobock in https://github.com/sgl-project/sglang/pull/5719
- fix: fix one more bug from merging mm_inputs by @mickqian in https://github.com/sgl-project/sglang/pull/5718
- [Fix]: support deepseek-vl2-tiny model by @bppps in https://github.com/sgl-project/sglang/pull/5552
- Bugfix for minicpmo vision test by @yizhang2077 in https://github.com/sgl-project/sglang/pull/5760
- [Minor] fix documentations by @merrymercy in https://github.com/sgl-project/sglang/pull/5756
- Add an assertion to enhance the robustness of the operator by @liwenju0 in https://github.com/sgl-project/sglang/pull/5736
- fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512 by @lkm2835 in https://github.com/sgl-project/sglang/pull/5733
- Use device_id in dist init to reduce NCCL communicator warmup & creation overhead by @Edenzzzz in https://github.com/sgl-project/sglang/pull/5728
- [fix] fix potential bumpy throughtput with deepgemm by @Alcanderian in https://github.com/sgl-project/sglang/pull/5722
- Resolves the
404 Not Found
error when runningcompile_deep_gemm.py
in multi-node setups by @guoyuhong in https://github.com/sgl-project/sglang/pull/5720 - perf: update H20 fused_moe_triton kernel config to get higher throughput during prefilling by @saltyfish66 in https://github.com/sgl-project/sglang/pull/5716
- we fix the non existent access of
decrypted_config_file
by @vincentzed in https://github.com/sgl-project/sglang/pull/5685 - CI: rewrite test_vision_chunked_prefill to speedup by @mickqian in https://github.com/sgl-project/sglang/pull/5682
- Fuse MLA set kv cache kernel by @ispobock in https://github.com/sgl-project/sglang/pull/5748
- Update amd docker image to
sglang:v0.4.5.post3-rocm630
. by @saienduri in https://github.com/sgl-project/sglang/pull/5697 - [feature] support for roberta embedding models by @DavidBao03 in https://github.com/sgl-project/sglang/pull/5730
- [fix] fix bench_one_batch_server by @Alcanderian in https://github.com/sgl-project/sglang/pull/5607
- support for the DeepSeek model by enabling streaming response parsing by @Frank-Jie in https://github.com/sgl-project/sglang/pull/5592
- fix: Use
is not None
instead of!= None
for None checks. by @vincentzed in https://github.com/sgl-project/sglang/pull/5687 - Add Llama 4 to FA3 test by @hebiao064 in https://github.com/sgl-project/sglang/pull/5509
- [misc] more decode step log for batch_one_batch by @Alcanderian in https://github.com/sgl-project/sglang/pull/5565
- Handle JSONDecodeError while processing request data by @yan97ao in https://github.com/sgl-project/sglang/pull/5599
- fix(srt): check if sample_indices is not None before usage. by @aoshen524 in https://github.com/sgl-project/sglang/pull/5633
- update llguidance to 0.7.11; adds StructTag by @mmoskal in https://github.com/sgl-project/sglang/pull/4870
- Use sgl-kernel sgl_per_token_group_quant_int8 by @lambert0312 in https://github.com/sgl-project/sglang/pull/4971
- Add memory_saver check by @kebe7jun in https://github.com/sgl-project/sglang/pull/4986
- add switch to disable open api doc by @congcongke in https://github.com/sgl-project/sglang/pull/3744
- Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512" by @merrymercy in https://github.com/sgl-project/sglang/pull/5772
- Fix eagle test case by @merrymercy in https://github.com/sgl-project/sglang/pull/5776
- Split local attention test from fa3 test by @Fridge003 in https://github.com/sgl-project/sglang/pull/5774
- Revert "Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512"" by @merrymercy in https://github.com/sgl-project/sglang/pull/5777
- Simplify FA3 tests by @merrymercy in https://github.com/sgl-project/sglang/pull/5779
- Revert "[fix] fix bench_one_batch_server" by @merrymercy in https://github.com/sgl-project/sglang/pull/5785
- Revert "Use device_id in dist init to reduce NCCL communicator warmup & creation overhead" by @merrymercy in https://github.com/sgl-project/sglang/pull/5786
- [CI] Tune threshold by @merrymercy in https://github.com/sgl-project/sglang/pull/5787
- [CI] fix port conflicts by @merrymercy in https://github.com/sgl-project/sglang/pull/5789
- [CI] Fix ci tests by @merrymercy in https://github.com/sgl-project/sglang/pull/5769
- [PD]Reduce kv transfer threads by @hnyls2002 in https://github.com/sgl-project/sglang/pull/5791
- [CI] Fix test case by @merrymercy in https://github.com/sgl-project/sglang/pull/5790
- Add 8-GPU Test for Deepseek-V3 by @Fridge003 in https://github.com/sgl-project/sglang/pull/5691
- Release v0.4.6 by @Fridge003 in https://github.com/sgl-project/sglang/pull/5795
New Contributors
- @huangtingwei9988 made their first contribution in https://github.com/sgl-project/sglang/pull/5083
- @yubofredwang made their first contribution in https://github.com/sgl-project/sglang/pull/4760
- @dangkai4u made their first contribution in https://github.com/sgl-project/sglang/pull/5151
- @ShangmingCai made their first contribution in https://github.com/sgl-project/sglang/pull/5155
- @mingfeima made their first contribution in https://github.com/sgl-project/sglang/pull/5150
- @yankay made their first contribution in https://github.com/sgl-project/sglang/pull/5110
- @Muuuchen made their first contribution in https://github.com/sgl-project/sglang/pull/5196
- @stmatengss made their first contribution in https://github.com/sgl-project/sglang/pull/4880
- @zou3519 made their first contribution in https://github.com/sgl-project/sglang/pull/5213
- @GaoYusong made their first contribution in https://github.com/sgl-project/sglang/pull/5292
- @Lzy17 made their first contribution in https://github.com/sgl-project/sglang/pull/5299
- @thyecust made their first contribution in https://github.com/sgl-project/sglang/pull/4884
- @yitianlian made their first contribution in https://github.com/sgl-project/sglang/pull/4848
- @yuleil made their first contribution in https://github.com/sgl-project/sglang/pull/5277
- @jokerwyt made their first contribution in https://github.com/sgl-project/sglang/pull/5364
- @yhyang201 made their first contribution in https://github.com/sgl-project/sglang/pull/5003
- @yyccli made their first contribution in https://github.com/sgl-project/sglang/pull/5279
- @DefTruth made their first contribution in https://github.com/sgl-project/sglang/pull/5381
- @yuan-luo made their first contribution in https://github.com/sgl-project/sglang/pull/5351
- @mRSun15 made their first contribution in https://github.com/sgl-project/sglang/pull/5211
- @ryang-max made their first contribution in https://github.com/sgl-project/sglang/pull/5038
- @BearBiscuit05 made their first contribution in https://github.com/sgl-project/sglang/pull/5345
- @yyihuang made their first contribution in https://github.com/sgl-project/sglang/pull/4982
- @u4lr451 made their first contribution in https://github.com/sgl-project/sglang/pull/5511
- @liwenju0 made their first contribution in https://github.com/sgl-project/sglang/pull/5497
- @Amadeus-Winarto made their first contribution in https://github.com/sgl-project/sglang/pull/5452
- @finger92 made their first contribution in https://github.com/sgl-project/sglang/pull/5224
- @kyle-pena-kuzco made their first contribution in https://github.com/sgl-project/sglang/pull/4733
- @mac0ne made their first contribution in https://github.com/sgl-project/sglang/pull/5019
- @sundar24295s made their first contribution in https://github.com/sgl-project/sglang/pull/5141
- @relic-yuexi made their first contribution in https://github.com/sgl-project/sglang/pull/4590
- @PopSoda2002 made their first contribution in https://github.com/sgl-project/sglang/pull/4718
- @Lucius-THU made their first contribution in https://github.com/sgl-project/sglang/pull/4937
- @michael-amd made their first contribution in https://github.com/sgl-project/sglang/pull/5510
- @c8ef made their first contribution in https://github.com/sgl-project/sglang/pull/5640
- @bppps made their first contribution in https://github.com/sgl-project/sglang/pull/5552
- @vincentzed made their first contribution in https://github.com/sgl-project/sglang/pull/5685
- @DavidBao03 made their first contribution in https://github.com/sgl-project/sglang/pull/5730
- @Frank-Jie made their first contribution in https://github.com/sgl-project/sglang/pull/5592
- @yan97ao made their first contribution in https://github.com/sgl-project/sglang/pull/5599
- @mmoskal made their first contribution in https://github.com/sgl-project/sglang/pull/4870
- @congcongke made their first contribution in https://github.com/sgl-project/sglang/pull/3744
Full Changelog: https://github.com/sgl-project/sglang/compare/v0.4.5...v0.4.6