SGLang - Browse /v0.4.6 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
README.md	2025-04-27	43.0 kB	0
Release v0.4.6 source code.tar.gz	2025-04-27	3.7 MB	0
Release v0.4.6 source code.zip	2025-04-27	4.5 MB	0
Totals: 3 Items		8.3 MB	0

Highlights

Use FlashAttention3 as the default attention backend for main stream models (DeepSeek, QWen, Llama, etc). https://github.com/sgl-project/sglang/issues/4709#issuecomment-2817728855
PD disaggregation with mooncake and NIXL transfer backends [#4880] [#5477] [#4655]
DeepSeek performance improvements: turn on DeepGemm by default and some kernel fusions. [#5580] [#5628]
Update torch to 2.6.0. Fix torch.compile cache. [#5417] [#5213]
Preliminary support for blackwell [#5303]

Thanks very much to LinkedIn team, Alibaba Cloud, Mooncake team, NVIDIA Team, AMD Team, Pytorch Team, Ant Group, Baseten Team, Oracle Team, Meituan Team, iFlytek MaaS team and the open source community users for their contributions!

We’re thrilled about these advancements and eager to hear your feedback! Join us on our Slack channel at slack.sglang.ai to connect and share your thoughts. Cheers!

Coming Soon

Large scale expert parallelism + PD disaggregation [#4734] [#5524]
Pipeline Parallelism [#5724]
MLA Cutlass Backend [#5390]

What's Changed

[ci] fix llama4 ci error by @BBuf in https://github.com/sgl-project/sglang/pull/5126
Refactor and Optimize FA3 Code by @hebiao064 in https://github.com/sgl-project/sglang/pull/5090
Add Llama4 user guide by @ispobock in https://github.com/sgl-project/sglang/pull/5133
[Misc] Use pytest.mark.skipif in sgl-kernel test by @yinfan98 in https://github.com/sgl-project/sglang/pull/5137
feat: disable grammar restrictions within reasoning sections by @minleminzui in https://github.com/sgl-project/sglang/pull/4984
[modelopt] automatically inspect if model is ModelOpt quantized and set quantization method by @yundai424 in https://github.com/sgl-project/sglang/pull/5145
[AMD] Fix missing per_token_group_quant_fp8 for ROCm by @hubertlu-tw in https://github.com/sgl-project/sglang/pull/5140
fix multimodal hash feature by @huangtingwei9988 in https://github.com/sgl-project/sglang/pull/5083
Fix run time error in ROCm platform by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/5147
[FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct by @zcnrex in https://github.com/sgl-project/sglang/pull/5103
Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 by @yubofredwang in https://github.com/sgl-project/sglang/pull/4760
Use public model for FA3 speculative decode testing by @yubofredwang in https://github.com/sgl-project/sglang/pull/5152
Add dummy grok test to amd CI. by @saienduri in https://github.com/sgl-project/sglang/pull/5115
fix empty_cache error in pt_weights_iterator by @dangkai4u in https://github.com/sgl-project/sglang/pull/5151
Fix torch compile errors by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/5158
Fix loading KV quantization scale; Enable modelopt kv cache by @yundai424 in https://github.com/sgl-project/sglang/pull/4686
[PD] Fix unclosed prefill connection warning of mini_lb by @ShangmingCai in https://github.com/sgl-project/sglang/pull/5155
Add optimized native kernels in sgl-kernel by @mingfeima in https://github.com/sgl-project/sglang/pull/5150
[PD] Simplify mini LB by @ByronHsu in https://github.com/sgl-project/sglang/pull/4911
Small improvement of native api docs by @simveit in https://github.com/sgl-project/sglang/pull/5139
[feat&refactor] Enhance multimodal input support with refactor io_struct by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/4938
Support 2x8xH100 for Llama 4 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5159
FP4 weight loading and inference (2/2) by @trevor-m in https://github.com/sgl-project/sglang/pull/3972
Fix multimodal hashing error by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5174
Tiny disable model that does not work by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5175
[Bugfix] Fix index out of bounds in local attention with large sequences by @CatherineSue in https://github.com/sgl-project/sglang/pull/5173
[Fix] DeepEP Compatibility with Low Latency by @liz-badada in https://github.com/sgl-project/sglang/pull/5068
docs: remove the use of Downward API for LWS_WORKER_INDEX by @yankay in https://github.com/sgl-project/sglang/pull/5110
feat: add DeepGEMM build warning by @zhyncs in https://github.com/sgl-project/sglang/pull/5176
fix: use DeepEPDispatcher on CUDA by @zhyncs in https://github.com/sgl-project/sglang/pull/5180
[DeepEP] fix: import buffer error by @ch-wan in https://github.com/sgl-project/sglang/pull/5179
Let bench_one_batch support enable_dp_attention by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4058
[Misc] clean up vllm in sgl-kernel test by @yinfan98 in https://github.com/sgl-project/sglang/pull/5189
Fix ci test "test_eval_fp8_accuracy" failed by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/5185
Optimize topk operation in llama4 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5128
Support Llama4 fp8 inference by @HandH1998 in https://github.com/sgl-project/sglang/pull/5194
[ci] fix ci test fused_moe op by @BBuf in https://github.com/sgl-project/sglang/pull/5102
model: support mllama4 by @mickqian in https://github.com/sgl-project/sglang/pull/5144
Rework grok test. by @saienduri in https://github.com/sgl-project/sglang/pull/5171
sgl-kernel use cutlass latest version for fp8 blockwise gemm by @yizhang2077 in https://github.com/sgl-project/sglang/pull/5207
Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 by @Muuuchen in https://github.com/sgl-project/sglang/pull/5196
fix: log warning when disable cuda graph by @zhyncs in https://github.com/sgl-project/sglang/pull/5209
[metrics] Add in queue metrics by @hebiao064 in https://github.com/sgl-project/sglang/pull/4444
Fix DeepSeek error when using DeepEP mode by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5190
reduce moe_align_block_size_kernel small batch mode overhead by @BBuf in https://github.com/sgl-project/sglang/pull/5086
[PD] Support KV transfer with mooncake by @stmatengss in https://github.com/sgl-project/sglang/pull/4880
[PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool by @stmatengss in https://github.com/sgl-project/sglang/pull/5204
Update deps for mllama4 by @ispobock in https://github.com/sgl-project/sglang/pull/5215
Fix deepseek-v3 with torch.compile in PyTorch 2.6. by @zou3519 in https://github.com/sgl-project/sglang/pull/5213
ROCm sgl-kernel: compatible to later torch by @HaiShaw in https://github.com/sgl-project/sglang/pull/5167
[Misc] Clean sgl-kernel test by @yinfan98 in https://github.com/sgl-project/sglang/pull/5216
Update Makefile / build script to avoid installing incompatible torch dependency by @elfiegg in https://github.com/sgl-project/sglang/pull/5245
Fix torch.compile cacheing by @zou3519 in https://github.com/sgl-project/sglang/pull/5259
ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations by @HaiShaw in https://github.com/sgl-project/sglang/pull/5228
Optimize attention in llama4 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5127
Optimize GPU memory usage in FlashAttentionBackend's strided indexing by @CatherineSue in https://github.com/sgl-project/sglang/pull/5262
Support --enable-llama4-multimodal by @ch-wan in https://github.com/sgl-project/sglang/pull/5254
[fix] fix mrope positions not picked up by @mickqian in https://github.com/sgl-project/sglang/pull/5265
doc: nested loop code for offline engine by @minleminzui in https://github.com/sgl-project/sglang/pull/5244
fix: examples for token_in_token_out_vlm by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/5193
Fix a 404 link in send_request.ipynb by @windsonsea in https://github.com/sgl-project/sglang/pull/5280
fix: enable fp4 compilation on cu128 by @zhyncs in https://github.com/sgl-project/sglang/pull/5286
feat: add cu128 identifier for sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/5287
chore: relax the torch version restriction for sgl-kernel compilation by @zhyncs in https://github.com/sgl-project/sglang/pull/5288
chore: bump sgl-kernel v0.0.8.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/5289
[PD] fix: skip warmup request in disaggregation mode to prevent crash on timeout by @GaoYusong in https://github.com/sgl-project/sglang/pull/5292
[Docs] Supported Model Docs - Major restructuring by @adarshxs in https://github.com/sgl-project/sglang/pull/5290
fix: update update_wheel_index for cu128 by @zhyncs in https://github.com/sgl-project/sglang/pull/5300
[Docs] Remove the older supported docs section by @adarshxs in https://github.com/sgl-project/sglang/pull/5301
remove moe_align_block_size torch.zeros in small batch/expert mode by @BBuf in https://github.com/sgl-project/sglang/pull/5298
feat: add blackwell Dockerfile by @zhyncs in https://github.com/sgl-project/sglang/pull/5302
feat: add blackwell workflow by @zhyncs in https://github.com/sgl-project/sglang/pull/5303
fix: use fa3 unit test on hopper only by @zhyncs in https://github.com/sgl-project/sglang/pull/5304
misc: update blackwell Dockerfile by @zhyncs in https://github.com/sgl-project/sglang/pull/5306
fix: remove cublas_grouped_gemm by @zhyncs in https://github.com/sgl-project/sglang/pull/5307
fix: update flash attn by @zhyncs in https://github.com/sgl-project/sglang/pull/5308
fix: use deepgemm only on hopper by @zhyncs in https://github.com/sgl-project/sglang/pull/5310
[VLM] Adopt fast image processor by default by @mickqian in https://github.com/sgl-project/sglang/pull/5065
Adjust ci test threshold by @ispobock in https://github.com/sgl-project/sglang/pull/5271
Blackwell Cutlass MLA kernel by @trevor-m in https://github.com/sgl-project/sglang/pull/5142
misc: cleanup 3rdparty by @zhyncs in https://github.com/sgl-project/sglang/pull/5311
update variable naming and comments for rocm by @Lzy17 in https://github.com/sgl-project/sglang/pull/5299
Fix w8a8_int8 model shared experts fusion load weights error by @lambert0312 in https://github.com/sgl-project/sglang/pull/5120
Add flash_attn_varlen_func to sgl-kernel by @Fridge003 in https://github.com/sgl-project/sglang/pull/5315
Fix fa3 window size setup by @qingquansong in https://github.com/sgl-project/sglang/pull/5316
chore: bump sgl-kernel v0.0.8.post2 by @zhyncs in https://github.com/sgl-project/sglang/pull/5317
feat: use fa3 mla by default on hopper by @zhyncs in https://github.com/sgl-project/sglang/pull/5210
Fix: docs/backend/structured_outputs.ipynb by @thyecust in https://github.com/sgl-project/sglang/pull/4884
Delete python/sglang/srt/layers/moe/fused_moe_triton/configs/E=257,N=… by @BBuf in https://github.com/sgl-project/sglang/pull/5321
refine fused_moe tuning docs by @BBuf in https://github.com/sgl-project/sglang/pull/5294
Support server based rollout in Verlengine by @yitianlian in https://github.com/sgl-project/sglang/pull/4848
[Feat] Add sparse attn to sgl-kernel by @yinfan98 in https://github.com/sgl-project/sglang/pull/5327
fix: solve cu118 issue for cutlass mla by @zhyncs in https://github.com/sgl-project/sglang/pull/5331
chore: bump sgl-kernel v0.0.8.post3 by @zhyncs in https://github.com/sgl-project/sglang/pull/5332
ci: update release node by @zhyncs in https://github.com/sgl-project/sglang/pull/5333
fix: determine if flashinfer is installed by @zhyncs in https://github.com/sgl-project/sglang/pull/5336
feat: adapt merge_state by @zhyncs in https://github.com/sgl-project/sglang/pull/5337
misc: update sagemaker Dockerfile by @zhyncs in https://github.com/sgl-project/sglang/pull/5341
Fix: ensure tensors used in dist.broadcast are created on the correct… by @minleminzui in https://github.com/sgl-project/sglang/pull/5322
docs: update adoption and sponsorship list with Oracle by @zhyncs in https://github.com/sgl-project/sglang/pull/5343
chore: upgrade sgl-kernel 0.0.8.post3 by @zhyncs in https://github.com/sgl-project/sglang/pull/5342
Fix typo: infight -> inflight by @hnyls2002 in https://github.com/sgl-project/sglang/pull/5357
[PD] Add transfer backend abstraction by @ByronHsu in https://github.com/sgl-project/sglang/pull/5328
fix MLATokenToKVPoolHost get_size_per_token bug by @huangtingwei9988 in https://github.com/sgl-project/sglang/pull/5161
fix [#5322] by @zhyncs in https://github.com/sgl-project/sglang/pull/5359
feat: update experiment_runner by @zhyncs in https://github.com/sgl-project/sglang/pull/5360
[DeepEP] Reduce routed scaling overhead by @yuleil in https://github.com/sgl-project/sglang/pull/5277
Free metadata_buffer_index after transfer finished by @jokerwyt in https://github.com/sgl-project/sglang/pull/5364
Fix DeepSeek DP Attention + torch compile by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5367
Support for Qwen2.5-VL Model in bitsandbytes Format by @yhyang201 in https://github.com/sgl-project/sglang/pull/5003
Fix PD disaggregation bugs by @hnyls2002 in https://github.com/sgl-project/sglang/pull/5326
[PD Bug] fix MLA get_contiguous_buf_infos error by @whybeyoung in https://github.com/sgl-project/sglang/pull/5384
[perf] experimental enhance fp8 per-tensor quant by @Alcanderian in https://github.com/sgl-project/sglang/pull/5370
Apply deepseek cuda rope by @ispobock in https://github.com/sgl-project/sglang/pull/5385
apply fused moe gate in ds v3/r1 by @BBuf in https://github.com/sgl-project/sglang/pull/5371
fix: update test config by @zhyncs in https://github.com/sgl-project/sglang/pull/5392
[Fix] Turn off DeepGEMM by default by @Fridge003 in https://github.com/sgl-project/sglang/pull/5263
minor clean up of sgl-kernel/CMakeLists.txt by @merrymercy in https://github.com/sgl-project/sglang/pull/5393
Add A800 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 by @lambert0312 in https://github.com/sgl-project/sglang/pull/5368
Add H20 dtype fp8_w8a8 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 by @Ximingwang-09 in https://github.com/sgl-project/sglang/pull/5291
[fix/misc] remove duplicate row in deepseek v2 model by @yyccli in https://github.com/sgl-project/sglang/pull/5279
chore: upgrade DeepGEMM by @zhyncs in https://github.com/sgl-project/sglang/pull/5395
fix: update pr-test-sgl-kernel by @zhyncs in https://github.com/sgl-project/sglang/pull/5399
kernel: support slightly faster merge_state_v2 cuda kernel by @DefTruth in https://github.com/sgl-project/sglang/pull/5381
chore: bump sgl-kernel 0.0.9 by @zhyncs in https://github.com/sgl-project/sglang/pull/5400
chore: upgrade sgl-kernel 0.0.9 by @zhyncs in https://github.com/sgl-project/sglang/pull/5401
Tiny fix DeepseekScalingRotaryEmbedding always use forward_native by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5406
Fix bench_serving with random-ids by @guoyuhong in https://github.com/sgl-project/sglang/pull/5214
[misc] fix ci flaky case by @Alcanderian in https://github.com/sgl-project/sglang/pull/5352
[FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP by @Muuuchen in https://github.com/sgl-project/sglang/pull/5412
Support dynamic connection and TP 16 by @yuan-luo in https://github.com/sgl-project/sglang/pull/5351
Fix broadcast use cuda device lead to memory capacity unbalanced by @lambert0312 in https://github.com/sgl-project/sglang/pull/5416
[PD] Fix dynamic port support and MLA buffer for Mooncake by @ShangmingCai in https://github.com/sgl-project/sglang/pull/5415
Distinguish bootstrap key only in decode server by @hnyls2002 in https://github.com/sgl-project/sglang/pull/5422
[PD] Remove unused bootstrap param and fix port table type by @ShangmingCai in https://github.com/sgl-project/sglang/pull/5423
[minor] cleanup cmakelists.txt by @merrymercy in https://github.com/sgl-project/sglang/pull/5420
bugfix: fix merge_state_v2 cuda graph by @DefTruth in https://github.com/sgl-project/sglang/pull/5419
chore: bump sgl-kernel v0.0.9.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/5430
fix: solve release issue by @zhyncs in https://github.com/sgl-project/sglang/pull/5434
BLackwell cutlass mla: Add check for bad page size/block num combinations by @trevor-m in https://github.com/sgl-project/sglang/pull/5431
feat: update model_specific_adjustment by @zhyncs in https://github.com/sgl-project/sglang/pull/5344
chore: upgrade sgl-kernel 0.0.9.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/5436
Fix ignore_eos parameter when loading a chat template by @CatherineSue in https://github.com/sgl-project/sglang/pull/5264
add attention backend supporting matrix in the doc by @mRSun15 in https://github.com/sgl-project/sglang/pull/5211
Support BNB quantization for llama/mllama by @ryang-max in https://github.com/sgl-project/sglang/pull/5038
[Docs] Update start/install.md by @windsonsea in https://github.com/sgl-project/sglang/pull/5398
[Minor] Move torch.compile patch to a better place by @merrymercy in https://github.com/sgl-project/sglang/pull/5397
[Bug fix] need record start time in pd mode by @whybeyoung in https://github.com/sgl-project/sglang/pull/5425
Support MHA with chunked prefix cache for DeepSeek chunked prefill by @Fridge003 in https://github.com/sgl-project/sglang/pull/5113
chore: bump v0.4.5.post1 by @zhyncs in https://github.com/sgl-project/sglang/pull/5445
Fix several minor issues in PD disaggregation by @ch-wan in https://github.com/sgl-project/sglang/pull/5444
[doc] Update benchmark_and_profiling.md by @BBuf in https://github.com/sgl-project/sglang/pull/5449
Update cutlass dependency. by @elfiegg in https://github.com/sgl-project/sglang/pull/5447
add multi-lora feature in README.md by @Ying1123 in https://github.com/sgl-project/sglang/pull/5463
Clean up imports by @merrymercy in https://github.com/sgl-project/sglang/pull/5467
[verl] Modify the update_weights func to align with verl's resharding by @BearBiscuit05 in https://github.com/sgl-project/sglang/pull/5345
[Model Support] unsloth/Phi-4-mini bnb model by @yyihuang in https://github.com/sgl-project/sglang/pull/4982
Update attention_backend.md: plural form by @didier-durand in https://github.com/sgl-project/sglang/pull/5489
Add test for flash_attn_varlen_func kernel by @Fridge003 in https://github.com/sgl-project/sglang/pull/5484
Deprecate disable-mla by @Fridge003 in https://github.com/sgl-project/sglang/pull/5481
Deprecate enable-flashinfer-mla and enable-flashmla by @Fridge003 in https://github.com/sgl-project/sglang/pull/5480
Feat/support encoder model (like bert) by @woodx9 in https://github.com/sgl-project/sglang/pull/4887
Enable local attention during decode by @CatherineSue in https://github.com/sgl-project/sglang/pull/5479
Refactor DeepSeek decoder layer branches by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5205
Fix a link in sgl-kernel/README.md by @windsonsea in https://github.com/sgl-project/sglang/pull/5493
[Bug fix] use correct func path in deepseek by @XucSh in https://github.com/sgl-project/sglang/pull/5496
Doc: fix problems of the 'Execute Notebooks / run-all-notebooks' ci caused by the unstability of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B by @minleminzui in https://github.com/sgl-project/sglang/pull/5503
[Feat] Update sgl-kernel flashinfer to latest main version by @yinfan98 in https://github.com/sgl-project/sglang/pull/5500
Fix: Incorrect parameters passed to forward_batch_generation (#5506) by @u4lr451 in https://github.com/sgl-project/sglang/pull/5511
Fix: fix the exception 'the memory capacity is unbalanced. Some GPUs … by @minleminzui in https://github.com/sgl-project/sglang/pull/5426
[docs] Fix several consistency issues in sampling_params.md by @windsonsea in https://github.com/sgl-project/sglang/pull/5373
Configuration qwen2_moe.py - qkv_bias now in transformers by @michaelfeil in https://github.com/sgl-project/sglang/pull/5512
Introduce moe_dense_tp_size to fix dense layer errors in DeepSeek V3 + 4x8xH100 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/4836
Sgl kernel fused_moe_gate support n_shared_experts by @BBuf in https://github.com/sgl-project/sglang/pull/5440
chore: bump sgl-kernel 0.0.9.post2 by @zhyncs in https://github.com/sgl-project/sglang/pull/5518
use sglang_per_token_group_quant_fp8 from sgl-kernel instead of trion kernel by @strgrb in https://github.com/sgl-project/sglang/pull/5473
fix kimi vl running bug after rebase main by @BBuf in https://github.com/sgl-project/sglang/pull/5461
fix bug of VLLM_AVAILABLE not defined by @liwenju0 in https://github.com/sgl-project/sglang/pull/5497
Avoid computing lse in Ragged Prefill when there's no prefix. by @Edenzzzz in https://github.com/sgl-project/sglang/pull/5476
[Model] Adding Qwen3 and Qwen3MoE by @yhyang201 in https://github.com/sgl-project/sglang/pull/4693
fix util import by @zhyncs in https://github.com/sgl-project/sglang/pull/5542
Revert "Avoid computing lse in Ragged Prefill when there's no prefix.… by @zhyncs in https://github.com/sgl-project/sglang/pull/5544
chore: upgrade sgl-kernel 0.0.9.post2 by @zhyncs in https://github.com/sgl-project/sglang/pull/5540
Fix DeepGEMM masked cannot be run on groups not being multiple or 4 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5340
Make profiler output file names consistent by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5548
[PD] Tiny fix timeout error when generate by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5545
[PD] Fix no cache connect for recevier by @whybeyoung in https://github.com/sgl-project/sglang/pull/5534
feat: use flashinfer jit package by @zhyncs in https://github.com/sgl-project/sglang/pull/5547
[PD] Remove the requirement of config file for mooncake backend by @ShangmingCai in https://github.com/sgl-project/sglang/pull/5460
restruct compressed_tensors_w8a8_fp8 by @BBuf in https://github.com/sgl-project/sglang/pull/5475
simplify the control logic for using shared experts fusion by @BBuf in https://github.com/sgl-project/sglang/pull/5504
Remove one kernel in per_tensor_quant_mla_fp8 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5549
Fix sampler nan check when calling top_k_top_p_sampling_from_probs by @yubofredwang in https://github.com/sgl-project/sglang/pull/5546
[PD] Support page size > 1 by @ByronHsu in https://github.com/sgl-project/sglang/pull/5561
fix hicache write back by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/5543
Minor update for ROCm variable style by @Lzy17 in https://github.com/sgl-project/sglang/pull/5562
Fix bench_one_batch producing unnatural results for expert parallel by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5149
[perf] introduce deep gemm group_gemm_masked as bmm by @Alcanderian in https://github.com/sgl-project/sglang/pull/5432
[PD] Fix DeepSeek cannot be run on latest master by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5568
Fix BumpAllocator error when no input_ids by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5564
enable DeepSeek V3 shared_experts_fusion in sm90 by @BBuf in https://github.com/sgl-project/sglang/pull/5571
[Fix] fix outlines and xgrammar by @Alcanderian in https://github.com/sgl-project/sglang/pull/4947
[Doc]Add instruction for profiling with bench_one_batch by @Fridge003 in https://github.com/sgl-project/sglang/pull/5581
Release v0.4.5.post2 by @merrymercy in https://github.com/sgl-project/sglang/pull/5582
Fix bench_serving fail when zero warmup requests by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5574
Fix DeepEP cannot run on latest master by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5567
Fix torch memory saver not enabled in DP scenario by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5560
Super tiny fix typo by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5559
Add document for LoRA serving by @Fridge003 in https://github.com/sgl-project/sglang/pull/5521
Tiny improve error message by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5526
[PD] Fix server crash when using batch requests by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5531
[Feat] upgrade pytorch2.6 by @sleepcoo in https://github.com/sgl-project/sglang/pull/5417
Fix enable chunked prefill for Llama4 by @tarinkk in https://github.com/sgl-project/sglang/pull/5575
fix: use fa3 for gemma2 by @zhyncs in https://github.com/sgl-project/sglang/pull/5586
Fix ChatCompletionMessageGenericParam to allow for None content by @Amadeus-Winarto in https://github.com/sgl-project/sglang/pull/5452
[PD] Fix large page size + chunk prefill by @ByronHsu in https://github.com/sgl-project/sglang/pull/5588
Add test config yamls for Deepseek v3 by @Fridge003 in https://github.com/sgl-project/sglang/pull/5433
[Feature] Prefill assistant response - add continue_final_message parameter by @adarshxs in https://github.com/sgl-project/sglang/pull/4226
add function call parser for DeepSeek V3 by @finger92 in https://github.com/sgl-project/sglang/pull/5224
smaller and non gated models for docs by @simveit in https://github.com/sgl-project/sglang/pull/5378
Feat: Implement JSON Mode (response_format.type="json_object") by @kyle-pena-kuzco in https://github.com/sgl-project/sglang/pull/4733
check marlin format before attempting conversion by @qeternity in https://github.com/sgl-project/sglang/pull/4675
compressed_tensors: port w8a16 fp8 from vllm by @vhain in https://github.com/sgl-project/sglang/pull/4852
Fix one more issue reported by torchfix by @b8zhong in https://github.com/sgl-project/sglang/pull/4859
Add sanity check for max_running_requests by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5016
Correct grafana heatmap. by @mac0ne in https://github.com/sgl-project/sglang/pull/5019
Perform Batch Tokenization. by @sundar24295s in https://github.com/sgl-project/sglang/pull/5141
Speedup shared expert weight construction by avoid cloning by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5188
Tiny add Engine.flush_cache API by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5241
[misc] remove is_cuda_available by @Alcanderian in https://github.com/sgl-project/sglang/pull/5319
Fix flush cache by @merrymercy in https://github.com/sgl-project/sglang/pull/5590
Add Speculative Decoding Eagle3 topk > 1 by @qingquansong in https://github.com/sgl-project/sglang/pull/5318
upstream hicache fixes by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/5570
Tiny add warning when cannot recognize bool env var by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5348
Modify metrics service endpoint by @lambert0312 in https://github.com/sgl-project/sglang/pull/3443
Update protocol.py to fix [#4589] by @relic-yuexi in https://github.com/sgl-project/sglang/pull/4590
[Feat.] Enable grafana to show metrics by @PopSoda2002 in https://github.com/sgl-project/sglang/pull/4718
[Fix] Enhance DP Attention for IPv6 Compatibility by @Lucius-THU in https://github.com/sgl-project/sglang/pull/4937
Support o1 model on Azure by @ChuyueSun in https://github.com/sgl-project/sglang/pull/4980
Tiny remove duplicated code by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5021
Tiny update error hint by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5037
Support PD bootstrap fields on /v1/chat/completions endpoint by @jokerwyt in https://github.com/sgl-project/sglang/pull/5488
[PD] Fix generate endpoint of min_lb for PD by @ShangmingCai in https://github.com/sgl-project/sglang/pull/5598
[PD] Fix edge case and simplify large page size + chunked prefill by @ByronHsu in https://github.com/sgl-project/sglang/pull/5589
[PD] Add NIXL transfer backend by @trevor-m in https://github.com/sgl-project/sglang/pull/5477
[PD] Support decode overlap schedule by @ByronHsu in https://github.com/sgl-project/sglang/pull/5608
[PD] Support prefill overlap + Ensure no race condition by @ByronHsu in https://github.com/sgl-project/sglang/pull/5609
Enhance GPU memory settings by @hnyls2002 in https://github.com/sgl-project/sglang/pull/5604
[feature] enable pre compile jit deep_gemm by @Alcanderian in https://github.com/sgl-project/sglang/pull/5580
Clean up mem settings by @merrymercy in https://github.com/sgl-project/sglang/pull/5610
Support aiter RMSNorm in AMD by @michael-amd in https://github.com/sgl-project/sglang/pull/5510
chore: bump v0.4.5.post3 by @zhyncs in https://github.com/sgl-project/sglang/pull/5611
Remove extra copy in deepseek forward absorb by @ispobock in https://github.com/sgl-project/sglang/pull/5578
[Doc] Fix a 404 link to llama-405b by @windsonsea in https://github.com/sgl-project/sglang/pull/5615
[fix] force use deepgemm in compile_deep_gemm by @Alcanderian in https://github.com/sgl-project/sglang/pull/5618
[fix] fix compile_deep_gemm missing kv_b_proj by @Alcanderian in https://github.com/sgl-project/sglang/pull/5620
fix: gemma 3 not use softcap by @zhyncs in https://github.com/sgl-project/sglang/pull/5622
Fix FA3 DeepSeek prefill performance regression by @Alcanderian in https://github.com/sgl-project/sglang/pull/5624
[NFC] Remove duplicate compressed-tensors by @c8ef in https://github.com/sgl-project/sglang/pull/5640
Fix shared experts fusion error without quantization by @lambert0312 in https://github.com/sgl-project/sglang/pull/5632
[feature] Add H20 fp8_w8a8 FusedMoE config for --n-share-experts-fusion=16 by @saltyfish66 in https://github.com/sgl-project/sglang/pull/5641
fix flashmla bug by @sleepcoo in https://github.com/sgl-project/sglang/pull/5272
[fix] reduce dp capture bs by @Alcanderian in https://github.com/sgl-project/sglang/pull/5634
Remove q concat in FA3 backend for DeepSeek decode by @ispobock in https://github.com/sgl-project/sglang/pull/5638
Revert "Support aiter RMSNorm in AMD" by @HaiShaw in https://github.com/sgl-project/sglang/pull/5646
fix: update bench_speculative by @zhyncs in https://github.com/sgl-project/sglang/pull/5649
Turn on DeepGemm By Default and Update Doc by @Fridge003 in https://github.com/sgl-project/sglang/pull/5628
Fuse q_a_proj and kv_a_proj for DeepSeek models by @Fridge003 in https://github.com/sgl-project/sglang/pull/5619
Remove unnecessary torch.full in DeepSeek by @fzyzcjy in https://github.com/sgl-project/sglang/pull/5601
[1/2] Add FP8 Blockscale MoE CUTLASS kernel for Blackwell by @elfiegg in https://github.com/sgl-project/sglang/pull/5281
fix sgl-kernel unit tests by @zhyncs in https://github.com/sgl-project/sglang/pull/5666
fix awq_dequantize import by @zhyncs in https://github.com/sgl-project/sglang/pull/5669
Integrating PD disaggregation with DP attention and DeepEP by @ch-wan in https://github.com/sgl-project/sglang/pull/5435
fix gemma3 unit test by @zhyncs in https://github.com/sgl-project/sglang/pull/5670
fix torchvision::nms not exist by @zhyncs in https://github.com/sgl-project/sglang/pull/5671
[PD] Add support for dp attention with mooncake by @ShangmingCai in https://github.com/sgl-project/sglang/pull/5530
tune the threshold of gemma-2-27b-it in test_nightly_gsm8k_eval.py by @merrymercy in https://github.com/sgl-project/sglang/pull/5677
[Doc] Fix two 404 links caused by sglang typo by @windsonsea in https://github.com/sgl-project/sglang/pull/5667
fix: update truss bench_serving by @zhyncs in https://github.com/sgl-project/sglang/pull/5683
fix: only compile ApplyTokenBitmaskInplace cu124+ by @zhyncs in https://github.com/sgl-project/sglang/pull/5686
chore: bump sgl-kernel 0.1.0 by @zhyncs in https://github.com/sgl-project/sglang/pull/5688
vlm: enable radix cache for qwen-vl models by @mickqian in https://github.com/sgl-project/sglang/pull/5349
[BugFix] Fix combination of MTP and --n-share-experts-fusionwith R1 by @guoyuhong in https://github.com/sgl-project/sglang/pull/5707
Fix weight loading bug for Deepseek v3+nextn by @Fridge003 in https://github.com/sgl-project/sglang/pull/5684
Add example to use sgl engine with fastapi by @ravi03071991 in https://github.com/sgl-project/sglang/pull/5648
[Doc] Fix a link to Weilin Zhao by @windsonsea in https://github.com/sgl-project/sglang/pull/5706
Add MMMU benchmark results by @ravi03071991 in https://github.com/sgl-project/sglang/pull/4491
[Model] Support ArcticForCausalLM architecture (Snowflake/snowflake-arctic-instruct) by @b8zhong in https://github.com/sgl-project/sglang/pull/5078
[PD] Better logs by @hnyls2002 in https://github.com/sgl-project/sglang/pull/5715
[PD] Add kvargs table and thread pool for kvcache sender of mooncake by @ShangmingCai in https://github.com/sgl-project/sglang/pull/5738
[PD]: Support Muti Prefill in one node by @hcyz33 in https://github.com/sgl-project/sglang/pull/5704
Fix: deepseek forward absorb by @michael-amd in https://github.com/sgl-project/sglang/pull/5723
Pin torch audio to 2.6.0 by @merrymercy in https://github.com/sgl-project/sglang/pull/5750
Revert "[Model] Support ArcticForCausalLM architecture (Snowflake/snowflake-arctic-instruct)" by @merrymercy in https://github.com/sgl-project/sglang/pull/5754
Disable flaky eagle tests by @merrymercy in https://github.com/sgl-project/sglang/pull/5753
update triton 3.2.0 h200 fused moe triton config and add warning about triton fused_moe_kernel performance degradation due to different Triton versions. by @BBuf in https://github.com/sgl-project/sglang/pull/5740
[Docs] Update runtime/engine/readme.md by @windsonsea in https://github.com/sgl-project/sglang/pull/5737
Reorder loop in shared expert weight loading by @ispobock in https://github.com/sgl-project/sglang/pull/5719
fix: fix one more bug from merging mm_inputs by @mickqian in https://github.com/sgl-project/sglang/pull/5718
[Fix]: support deepseek-vl2-tiny model by @bppps in https://github.com/sgl-project/sglang/pull/5552
Bugfix for minicpmo vision test by @yizhang2077 in https://github.com/sgl-project/sglang/pull/5760
[Minor] fix documentations by @merrymercy in https://github.com/sgl-project/sglang/pull/5756
Add an assertion to enhance the robustness of the operator by @liwenju0 in https://github.com/sgl-project/sglang/pull/5736
fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512 by @lkm2835 in https://github.com/sgl-project/sglang/pull/5733
Use device_id in dist init to reduce NCCL communicator warmup & creation overhead by @Edenzzzz in https://github.com/sgl-project/sglang/pull/5728
[fix] fix potential bumpy throughtput with deepgemm by @Alcanderian in https://github.com/sgl-project/sglang/pull/5722
Resolves the 404 Not Found error when running compile_deep_gemm.py in multi-node setups by @guoyuhong in https://github.com/sgl-project/sglang/pull/5720
perf: update H20 fused_moe_triton kernel config to get higher throughput during prefilling by @saltyfish66 in https://github.com/sgl-project/sglang/pull/5716
we fix the non existent access of decrypted_config_file by @vincentzed in https://github.com/sgl-project/sglang/pull/5685
CI: rewrite test_vision_chunked_prefill to speedup by @mickqian in https://github.com/sgl-project/sglang/pull/5682
Fuse MLA set kv cache kernel by @ispobock in https://github.com/sgl-project/sglang/pull/5748
Update amd docker image to sglang:v0.4.5.post3-rocm630. by @saienduri in https://github.com/sgl-project/sglang/pull/5697
[feature] support for roberta embedding models by @DavidBao03 in https://github.com/sgl-project/sglang/pull/5730
[fix] fix bench_one_batch_server by @Alcanderian in https://github.com/sgl-project/sglang/pull/5607
support for the DeepSeek model by enabling streaming response parsing by @Frank-Jie in https://github.com/sgl-project/sglang/pull/5592
fix: Use is not None instead of != None for None checks. by @vincentzed in https://github.com/sgl-project/sglang/pull/5687
Add Llama 4 to FA3 test by @hebiao064 in https://github.com/sgl-project/sglang/pull/5509
[misc] more decode step log for batch_one_batch by @Alcanderian in https://github.com/sgl-project/sglang/pull/5565
Handle JSONDecodeError while processing request data by @yan97ao in https://github.com/sgl-project/sglang/pull/5599
fix(srt): check if sample_indices is not None before usage. by @aoshen524 in https://github.com/sgl-project/sglang/pull/5633
update llguidance to 0.7.11; adds StructTag by @mmoskal in https://github.com/sgl-project/sglang/pull/4870
Use sgl-kernel sgl_per_token_group_quant_int8 by @lambert0312 in https://github.com/sgl-project/sglang/pull/4971
Add memory_saver check by @kebe7jun in https://github.com/sgl-project/sglang/pull/4986
add switch to disable open api doc by @congcongke in https://github.com/sgl-project/sglang/pull/3744
Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512" by @merrymercy in https://github.com/sgl-project/sglang/pull/5772
Fix eagle test case by @merrymercy in https://github.com/sgl-project/sglang/pull/5776
Split local attention test from fa3 test by @Fridge003 in https://github.com/sgl-project/sglang/pull/5774
Revert "Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512"" by @merrymercy in https://github.com/sgl-project/sglang/pull/5777
Simplify FA3 tests by @merrymercy in https://github.com/sgl-project/sglang/pull/5779
Revert "[fix] fix bench_one_batch_server" by @merrymercy in https://github.com/sgl-project/sglang/pull/5785
Revert "Use device_id in dist init to reduce NCCL communicator warmup & creation overhead" by @merrymercy in https://github.com/sgl-project/sglang/pull/5786
[CI] Tune threshold by @merrymercy in https://github.com/sgl-project/sglang/pull/5787
[CI] fix port conflicts by @merrymercy in https://github.com/sgl-project/sglang/pull/5789
[CI] Fix ci tests by @merrymercy in https://github.com/sgl-project/sglang/pull/5769
[PD]Reduce kv transfer threads by @hnyls2002 in https://github.com/sgl-project/sglang/pull/5791
[CI] Fix test case by @merrymercy in https://github.com/sgl-project/sglang/pull/5790
Add 8-GPU Test for Deepseek-V3 by @Fridge003 in https://github.com/sgl-project/sglang/pull/5691
Release v0.4.6 by @Fridge003 in https://github.com/sgl-project/sglang/pull/5795

New Contributors

@huangtingwei9988 made their first contribution in https://github.com/sgl-project/sglang/pull/5083
@yubofredwang made their first contribution in https://github.com/sgl-project/sglang/pull/4760
@dangkai4u made their first contribution in https://github.com/sgl-project/sglang/pull/5151
@ShangmingCai made their first contribution in https://github.com/sgl-project/sglang/pull/5155
@mingfeima made their first contribution in https://github.com/sgl-project/sglang/pull/5150
@yankay made their first contribution in https://github.com/sgl-project/sglang/pull/5110
@Muuuchen made their first contribution in https://github.com/sgl-project/sglang/pull/5196
@stmatengss made their first contribution in https://github.com/sgl-project/sglang/pull/4880
@zou3519 made their first contribution in https://github.com/sgl-project/sglang/pull/5213
@GaoYusong made their first contribution in https://github.com/sgl-project/sglang/pull/5292
@Lzy17 made their first contribution in https://github.com/sgl-project/sglang/pull/5299
@thyecust made their first contribution in https://github.com/sgl-project/sglang/pull/4884
@yitianlian made their first contribution in https://github.com/sgl-project/sglang/pull/4848
@yuleil made their first contribution in https://github.com/sgl-project/sglang/pull/5277
@jokerwyt made their first contribution in https://github.com/sgl-project/sglang/pull/5364
@yhyang201 made their first contribution in https://github.com/sgl-project/sglang/pull/5003
@yyccli made their first contribution in https://github.com/sgl-project/sglang/pull/5279
@DefTruth made their first contribution in https://github.com/sgl-project/sglang/pull/5381
@yuan-luo made their first contribution in https://github.com/sgl-project/sglang/pull/5351
@mRSun15 made their first contribution in https://github.com/sgl-project/sglang/pull/5211
@ryang-max made their first contribution in https://github.com/sgl-project/sglang/pull/5038
@BearBiscuit05 made their first contribution in https://github.com/sgl-project/sglang/pull/5345
@yyihuang made their first contribution in https://github.com/sgl-project/sglang/pull/4982
@u4lr451 made their first contribution in https://github.com/sgl-project/sglang/pull/5511
@liwenju0 made their first contribution in https://github.com/sgl-project/sglang/pull/5497
@Amadeus-Winarto made their first contribution in https://github.com/sgl-project/sglang/pull/5452
@finger92 made their first contribution in https://github.com/sgl-project/sglang/pull/5224
@kyle-pena-kuzco made their first contribution in https://github.com/sgl-project/sglang/pull/4733
@mac0ne made their first contribution in https://github.com/sgl-project/sglang/pull/5019
@sundar24295s made their first contribution in https://github.com/sgl-project/sglang/pull/5141
@relic-yuexi made their first contribution in https://github.com/sgl-project/sglang/pull/4590
@PopSoda2002 made their first contribution in https://github.com/sgl-project/sglang/pull/4718
@Lucius-THU made their first contribution in https://github.com/sgl-project/sglang/pull/4937
@michael-amd made their first contribution in https://github.com/sgl-project/sglang/pull/5510
@c8ef made their first contribution in https://github.com/sgl-project/sglang/pull/5640
@bppps made their first contribution in https://github.com/sgl-project/sglang/pull/5552
@vincentzed made their first contribution in https://github.com/sgl-project/sglang/pull/5685
@DavidBao03 made their first contribution in https://github.com/sgl-project/sglang/pull/5730
@Frank-Jie made their first contribution in https://github.com/sgl-project/sglang/pull/5592
@yan97ao made their first contribution in https://github.com/sgl-project/sglang/pull/5599
@mmoskal made their first contribution in https://github.com/sgl-project/sglang/pull/4870
@congcongke made their first contribution in https://github.com/sgl-project/sglang/pull/3744

Full Changelog: https://github.com/sgl-project/sglang/compare/v0.4.5...v0.4.6

Source: README.md, updated 2025-04-27

SGLang Files

SGLang is a fast serving framework for large language models

Highlights

Coming Soon

What's Changed

New Contributors

SGLang Files

SGLang is a fast serving framework for large language models

Get an email when there's a new version of SGLang

Highlights

Coming Soon

What's Changed

New Contributors