| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| README.md | 2025-08-01 | 23.4 kB | |
| v0.21.0 source code.tar.gz | 2025-08-01 | 353.7 MB | |
| v0.21.0 source code.zip | 2025-08-01 | 357.5 MB | |
| Totals: 3 Items | 711.2 MB | 0 | |
TensorRT-LLM Release 0.21.0
Key Features and Enhancements
- Model Support
- Added Gemma3 VLM support
- Features
- Added large-scale EP support
- Integrated NIXL into the communication layer of the disaggregated service
- Added fabric Memory support for KV Cache Transfer
- Added MCP in ScaffoldingLLM
- Added support for w4a8_mxfp4_fp8 quantization
- Added support for fp8 rowwise quantization
- Added generation logits support in TRTLLM Sampler
- Added log probs support in TRTLLM Sampler
- Optimized TRTLLM Sampler perf single beam single step
- Enabled Disaggregated serving for Qwen-3
- Added EAGLE3 support for Qwen-3
- Fused finalize and allreduce for Qwen-MoE model
- Refactored Fused MoE module
- Added support for chunked attention on Blackwell and Hopper
- Introduced sliding-window attention kernels for the generation phase on Blackwell
- Updated DeepSeek FP8 TRT-LLM Gen cubins to improve performance in large batch size scenarios
- Added FP8 block-scale GEMM support on SM89
- Enabled overlap scheduler between draft forwards
- Added Piecewise cuda graph support for MLA
- Added model-agnostic one-engine eagle3
- Enabled Finalize + Allreduce + add + rmsnorm fusion
- Integrated TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner
- Added support for Eagle3 + disaggregated serving in two model speculative decoding flow
- Validated Llama 3.1 models on H200 NVL
- Benchmark:
- Added all_reduce.py benchmark script for testing
- Added beam width to trtllm-bench latency command
- Fixed trtllm-bench iter_stats and cuda_graph_batch_sizes errors
- Enabled trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA
- Supported post_proc for bench
- Added no_kv_cache_reuse option and streaming support for trtllm serve bench
Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:25.05-py3. - The base Docker image for TensorRT-LLM Backend is updated to
nvcr.io/nvidia/tritonserver:25.05-py3. - The dependent public PyTorch version is updated to 2.7.1.
- The dependent TensorRT version is updated to 10.11.
- The dependent NVIDIA ModelOpt version is updated to 0.31.
- The dependent NCCL version is updated to 2.27.5.
API Changes
- Set _AutoDeployLlmArgs as primary config object
- Removed decoder request from decoder interface
- Enhanced the torch_compile_config in llm args
- Removed the redundant use_kv_cache field from PytorchConfig
- Moved allreduce_strategy from committed api to reference
Fixed Issues
- Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (#4678)
- Fixed EP load balancer with MTP layer and route offset by EP rank (#4767)
- Fixed cuda graph padding for spec decoding (#4853)
- Fixed llama 4 long context issue (#4809)
- Fixed max_num_sequences calculation with overlap scheduling (#4532)
- Fixed chunked prefill + overlap scheduling (#5761)
- Fixed trtllm-bench hang issue due to LLM API IPC (#4798)
- Fixed index out of bounds error in spec decoding (#5954)
- Fixed MTP illegal memory access in cuda graph warmup (#5947)
- Fixed no free slots error with spec decode + disagg (#5975)
- Fixed one-off attention window size for Gemma3 1B (#5564)
Known Issues
- accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken.
- Enabling disaggregated serving, MTP, and the overlap scheduler at the same time can lead to accuracy problems.
- In 0.21, full chunked attention support has been added to make sure LLaMA4 model can functionally run with > 8K seq length, while there is a known performance regression(only affect LLaMA4 model) on Hopper due to this functional enhancement. The root cause of the regression has been identified already and the fix will be part of the future release.
What's Changed
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5221
- [test] split nemotron test cases from examples_test_list by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/5238
- Update DeepSeek R1 perf numbers to latest release/0.20 results by @litaotju in https://github.com/NVIDIA/TensorRT-LLM/pull/5235
- [feat] Add llm args to tune python gc threshold by @nv-yilinf in https://github.com/NVIDIA/TensorRT-LLM/pull/5141
- [TRTLLM-5835][feat] Optimized Mamba2Mixer prefill by @tomeras91 in https://github.com/NVIDIA/TensorRT-LLM/pull/5128
- [TRTLLM-3456] Speculation: Draft Target in new FW by @IzzyPutterman in https://github.com/NVIDIA/TensorRT-LLM/pull/4558
- chore: Waive CI failure. by @SimengLiu-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5252
- [infra] Make test_chunked_prefill faster by @mikeiovine in https://github.com/NVIDIA/TensorRT-LLM/pull/5248
- Update internal cutlass commit. by @Tracin in https://github.com/NVIDIA/TensorRT-LLM/pull/5228
- test: add more pytorch cases in perf test by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/5237
- Fix: https://nvbugs/5345720 by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/5259
- test: [CI] remove closed bugs by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5218
- [TRTLLM-5330] perf: Optimize MoE supplementary kernels for large-scale EP by @syuoni in https://github.com/NVIDIA/TensorRT-LLM/pull/5215
- fix mla test by @qsang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5240
- doc: add document of benchmarking for Qwen3 by @byshiue in https://github.com/NVIDIA/TensorRT-LLM/pull/5158
- update setup.py for special cases by @qsang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5227
- move some test cases of TensorRT backend back by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/5232
- [feat] Add EAGLE3 support for Qwen3 by @nv-yilinf in https://github.com/NVIDIA/TensorRT-LLM/pull/5206
- [TRTLLM-5786][https://nvbugspro.nvidia.com/bug/5310520][test] Add QA test cases by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/5073
- CI: move multi-gpu test cases of tensorrt backend to h200 by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/5272
- refactor: Unify decoder test with e2e worklfow by @Funatiq in https://github.com/NVIDIA/TensorRT-LLM/pull/5239
- [feat] Piecewise cuda graph support for MLA by @liji-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/4467
- chore: Mass integration of release/0.20 by @amirkl94 in https://github.com/NVIDIA/TensorRT-LLM/pull/5082
- [TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner by @DomBrown in https://github.com/NVIDIA/TensorRT-LLM/pull/5207
- None - Some clean-ups for the automation pipeline by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/5245
- Re-implement LlmResponse in Python to reduce host overhead of pybind by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/5224
- delete cubins by @qsang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5274
- infra[TRTLLM-5635] remove package stage in CI build by @niukuo in https://github.com/NVIDIA/TensorRT-LLM/pull/5075
- [Infra] - Update dependencies with NGC PyTorch 25.05 and TRT 10.11 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/4885
- [chore] Remove BaseDraftTokenManager by @mikeiovine in https://github.com/NVIDIA/TensorRT-LLM/pull/5251
- [infra] Report CI authorization errors to PR by @tburt-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5175
- Revert "[infra] Report CI authorization errors to PR" by @tburt-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5298
- refactor: Update decoder buffer and logits management by @Funatiq in https://github.com/NVIDIA/TensorRT-LLM/pull/4450
- fix: only set _mpi_session if world_size is > 1 by @achartier in https://github.com/NVIDIA/TensorRT-LLM/pull/5253
- update LlmRequest.is_dummy property by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/5283
- test: update qa test list by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/5305
- CI: extend model weights load time for dsv3 in stress test. by @dominicshanshan in https://github.com/NVIDIA/TensorRT-LLM/pull/5275
- [fix][test] move deepseek single gpu tests to post merge by @omera-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5280
- Waive L0 tests by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/5308
- feat: Add non-streaming support for trtllm serve bench script & fixed prompt and output token length by @yizhang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/4971
- chore: partition LLM class into TorchLLM and TrtLLM by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/4900
- [feat]: improve performance of XQA-MLA for sm120 by @lowsfer in https://github.com/NVIDIA/TensorRT-LLM/pull/5087
- doc:update contributing md for internal developers by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/5250
- test: cherry-pick deepseek rcca cases in main branch by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/5307
- [TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. by @hyukn in https://github.com/NVIDIA/TensorRT-LLM/pull/5139
- CI: fix TensorRT H200 tests by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/5301
- [TRTLLM-5758] test: Add Bielik-11B-v2.2 Model Support by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/5159
- chore: Refine printed info of CHECK_TYPE. by @bobboli in https://github.com/NVIDIA/TensorRT-LLM/pull/5295
- refactor: Introduce ResourceManagerType enum for resource management by @Funatiq in https://github.com/NVIDIA/TensorRT-LLM/pull/5246
- chore: bump version to 0.21.0rc3 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/5309
- test: correct unittest rerun behavior by @tongyuantongyu in https://github.com/NVIDIA/TensorRT-LLM/pull/5273
- Fix rerun step by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/5319
- Waive L0 by @yizhang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5311
- tests: add multi nodes tests by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5196
- feat: Add LLGuidance Support for PyTorch Backend by @jellysnack in https://github.com/NVIDIA/TensorRT-LLM/pull/5214
- [Infra]Update 5080 and 5090 case condition since we will upgrade driver by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/5317
- chore: Update README.md to expose meet-up info by @juney-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/5329
- Remove duplicated test cases by @HuiGao-NV in https://github.com/NVIDIA/TensorRT-LLM/pull/5323
- Add disagg slurm scripts by @qiaoxj07 in https://github.com/NVIDIA/TensorRT-LLM/pull/5243
- Unwaive disaggregated serving accuracy tests by @Tabrizian in https://github.com/NVIDIA/TensorRT-LLM/pull/5095
- [feat] Multi-node CI testing support via Slurm by @yuanjingx87 in https://github.com/NVIDIA/TensorRT-LLM/pull/4771
- [fix][test] remove some cpp test cases from h100 by @omera-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5335
- [fix][test] remove duplicate test runs by @omera-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5241
- chore: skip test_llm_gpt2_medium_fp8 for fp8_pc_pt + quant_lm_head by @achartier in https://github.com/NVIDIA/TensorRT-LLM/pull/5293
- [fix][test] clear cuda cache before unittests automatically by @omera-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5121
- fix[nvbug5298640]: trtllm-llmapi-launch multiple LLM instances by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/4727
- ci: Split long running jobs into multiple jobs by @Funatiq in https://github.com/NVIDIA/TensorRT-LLM/pull/5268
- [feat] Fusion finalize and allreduce for qwenmoe model by @zongfeijing in https://github.com/NVIDIA/TensorRT-LLM/pull/5223
- chore: remove torch_compile prefix for TorchCompileConfig field members by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/5261
- [test] add nvfp4 DeepSeek-V3-Lite-mtp tests by @lfr-0531 in https://github.com/NVIDIA/TensorRT-LLM/pull/5125
- Waive L0 test by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/5349
- chore: bump version to 0.21.0 by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/5325
- tests: cherry-pick from main branch, add qwen3 test cases and amend test name in perf test by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/5357
- [Infra]cherry pick sanity check yml change for 5080 and 5090 from main by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/5363
- doc: cherry pick [#5334] by @MartinMarciniszyn in https://github.com/NVIDIA/TensorRT-LLM/pull/5368
- fix: Fix skip by mpi size fixture by @yizhang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5355
- Fix: missing clientId when serialize and deserialize response (cherry-pick [#5231]) by @kaiyux in https://github.com/NVIDIA/TensorRT-LLM/pull/5378
- tests: fix typos in qa test by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/5421
- nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mistral Small multimodal for BS=8 by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5453
- feat: TRTLLM-5941 Upgrade xgrammar to 0.1.18 by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/5364
- test: set enable_attention_dp=True in default deepseek settings by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/5461
- tests: Set kv cache free memory fraction in test case by @HuiGao-NV in https://github.com/NVIDIA/TensorRT-LLM/pull/5462
- [Infra] - Waive failed tests on release/0.21 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/5477
- Fix permission for local user issues in NGC docker container. by @MartinMarciniszyn in https://github.com/NVIDIA/TensorRT-LLM/pull/5373
- [nvbug 5273941] fix: broken cyclic reference detect by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/5417
- [nvbug/5354825] Fix nougat test image url by @amukkara in https://github.com/NVIDIA/TensorRT-LLM/pull/5496
- fix: fix regression in LOCAL_USER by @ixlmar in https://github.com/NVIDIA/TensorRT-LLM/pull/5517
- doc: Fix benchmark cmd in disagg scripts by @kaiyux in https://github.com/NVIDIA/TensorRT-LLM/pull/5516
- fix: constrain grepping in docker/Makefile by @ixlmar in https://github.com/NVIDIA/TensorRT-LLM/pull/5493
- [Infra][release/0.21] - waive failed tests by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/5537
- ci: unwaive llmapi launch test by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/5281
- [TRTLLM-5989, TRTLLM-5991, TRTLLM-5993] doc: Update container instructions by @ixlmar in https://github.com/NVIDIA/TensorRT-LLM/pull/5490
- [cherry-pick] [CI] Waive
test_fp8_block_scales_4gpus[ep4-mtp_nextn=0-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False]by @venkywonka in https://github.com/NVIDIA/TensorRT-LLM/pull/5553 - [Infra][release/0.21]Update nccl to 2.27.5 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/5539
- fix [nvbug5351244]: test_mpi_session submit sync/async by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/5608
- fix:https://nvbugs/5362398 by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/5609
- [nvbug 5300551] test: increase block count in eviction test by @zhengd-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5465
- test: add more tests for GB200 with 8 GPUs/2 nodes in L0 tests by @yizhang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5397
- doc: Fix outdated config in DeepSeek best perf practice doc by @kaiyux in https://github.com/NVIDIA/TensorRT-LLM/pull/5638
- fix: [https://nvbugs/5355219] Fix bug of Qwen3 235B CI on dgx_gb200 by @byshiue in https://github.com/NVIDIA/TensorRT-LLM/pull/5602
- [https://nvbugspro.nvidia.com/bug/5351333][fix] Update to chunking calculation. by @FrankD412 in https://github.com/NVIDIA/TensorRT-LLM/pull/5625
- fix: Investigate Gemma3 1B decoder output discrepancy by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5564
- [Infra] - Waive failed cases on release/0.21 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/5674
- Doc: Update invalid hugging face URLs by @Linda-Stadter in https://github.com/NVIDIA/TensorRT-LLM/pull/5683
- [NVBUG:5355009] Modify check for fuse_fp4_quant on SM120 by @farazkh80 in https://github.com/NVIDIA/TensorRT-LLM/pull/5651
- [TRTLLM-6100] fix: Nvbug 5356427: autotuned TRTLLM Gen fp8 block scale MoE illegal memory access by @DomBrown in https://github.com/NVIDIA/TensorRT-LLM/pull/5676
- [nvbug/5341178][fix] Fix OOM in Llama 4 accuracy test by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5735
- test: Move some of the test from post merge to pre-merge, update dgx b200 test case by @yizhang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5640
- [5321981] fix: Fix the Llama3.1 405B hanging issue. by @hyukn in https://github.com/NVIDIA/TensorRT-LLM/pull/5698
- [Infra][nvbugs/5370968] - Unwaive l0 test by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/5750
- [nvbugs/5336321][fix] Enable attention dp = False test case, Fix TRTLLM Gen Moe workspace allocation by @yizhang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5463
- [nvbug/5337601][fix] Fix disagg + speculative decoding by @Tabrizian in https://github.com/NVIDIA/TensorRT-LLM/pull/5558
- [Infra] - Always use x86 image for the Jenkins agent by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/5756
- test: fix some test failure and add llama_nemotron models in perf sanity test, add more torch cases by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/5693
- fix: Skip rope scaling for local layers in Gemma3 VLM by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5773
- [nvbug 5004744][fix] rewrite completion API to avoid repetitive tokens by @LinPoly in https://github.com/NVIDIA/TensorRT-LLM/pull/5201
- fix _pad_attention_dp_dummy_request by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/5583
- Fix docker cache mount by @MartinMarciniszyn in https://github.com/NVIDIA/TensorRT-LLM/pull/5763
- [nvbug/5302638][nvbugs/5310314] fix _handle_cancelled_requests by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/5532
- cherry pick [#5416] by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/5776
- [nvbug 5304752][fix] enhance _check_arguments to filter illegal requests for pytorch backend by @LinPoly in https://github.com/NVIDIA/TensorRT-LLM/pull/5541
- [nvbug5266240] chore: unwaive test_llm_with_dummy_weights by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/5744
- [https://nvbugspro.nvidia.com/bug/5355054] fallback to cubins for fp8 fmha kernels on Ada. by @PerkzZheng in https://github.com/NVIDIA/TensorRT-LLM/pull/5779
- fix: [https://nvbugspro.nvidia.com/bug/5345215] Unwaive for bug 5345215. by @bobboli in https://github.com/NVIDIA/TensorRT-LLM/pull/5606
- [nvbugs/5326453] Avoid nesting NCCL grouping in allgather OP by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/5789
- fix: [https://nvbugs/5351130][https://nvbugs/5333654] Unwaive for bug 5351130 and 5333654. by @bobboli in https://github.com/NVIDIA/TensorRT-LLM/pull/5821
- doc: Update gb200 doc by @yizhang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5840
- test: remove duplicate cases in perf sanity test by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/5870
- [nvbug 5327706][fix] fix mgmn postprocess error by @LinPoly in https://github.com/NVIDIA/TensorRT-LLM/pull/5835
- [nvbugs/5345391] fix: chunked prefill + overlap scheduling by @Funatiq in https://github.com/NVIDIA/TensorRT-LLM/pull/5761
- cherry-pick: [fix: nvbugs/5355493] Correctly clamp max sequence len to max attention window by @netanel-haber in https://github.com/NVIDIA/TensorRT-LLM/pull/5874
- [https://nvbugs/5355316] fix: update torch.compile option to fix triton store_cubin error by @dc3671 in https://github.com/NVIDIA/TensorRT-LLM/pull/5865
- test: Add Gemma3 unit tests to CI in release/0.21 by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5899
- tests: Fix lora perf test by @amirkl94 in https://github.com/NVIDIA/TensorRT-LLM/pull/5875
- fix: [nvbugs/5351130] Adjust DSV3-Lite tests free_gpu_memory_fraction to 0.75 to prevent OOM on CI. by @bobboli in https://github.com/NVIDIA/TensorRT-LLM/pull/5896
- chore: Port leftover 0.20 by @amirkl94 in https://github.com/NVIDIA/TensorRT-LLM/pull/5907
- fix [nvbug/5351244]: address remote mpi session submit by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/5664
- fix: [5328141] increase tolerance for test_fp8_block_scale_gemm by @nekorobov in https://github.com/NVIDIA/TensorRT-LLM/pull/5849
- fix: timeout and broken pipe in disagg and worker tests by @zhengd-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5827
- [nvbugs/5333742] fix MTP illegal memory access in cuda graph warmup by @lfr-0531 in https://github.com/NVIDIA/TensorRT-LLM/pull/5947
- fix: fix index out of bounds error in spec decoding by @lfr-0531 in https://github.com/NVIDIA/TensorRT-LLM/pull/5954
- [nvbugs/5368410][fix] Disable moe allreduce for multi node by @yizhang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5918
- [fix] Release slots with spec decode + disagg by @Tabrizian in https://github.com/NVIDIA/TensorRT-LLM/pull/5975
- [TRTLLM-6495] doc: add disclaimer for 3rd party software installation. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/6039
- [None] - Waive L0 tests by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/6082
- Cherry Pick: PR [#6076] by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/6088
- add release notes for 0.21 release by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/6049
- fix: Fix triton backend build [nvbug 5396469] by @pcastonguay in https://github.com/NVIDIA/TensorRT-LLM/pull/6098
- [None][infra] Cherry-pick [#6128] and [#6130] from main branch by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/6151
- [Doc][Qwen3] update qwen3 into support-matrix by @byshiue in https://github.com/NVIDIA/TensorRT-LLM/pull/6161
- [fix]: Revert commit 388b491 by @LinPoly in https://github.com/NVIDIA/TensorRT-LLM/pull/6143
- doc: update known issues by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/6247
- [fix] Cherry pick "[TRTLLM-6262] Fix Llama4 Scout FP4 crash issue" by @mikeiovine in https://github.com/NVIDIA/TensorRT-LLM/pull/6267
- doc: update release notes by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/6324
- test: Relax Gemma3 unit test thresholds by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6016
- tests: Add llama4 functional cases by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/6392
- doc: update release notes by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/6438
- [https://nvbugspro.nvidia.com/bug/5415268] fix illegal smem access with chunked attention by @PerkzZheng in https://github.com/NVIDIA/TensorRT-LLM/pull/6401
- [doc] Update perf_overview.md for release 0.21 by @zbpatel in https://github.com/NVIDIA/TensorRT-LLM/pull/6270
- [None][infra] Pin the version for triton to 3.3.1 (#6508) by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/6519
New Contributors
- @jellysnack made their first contribution in https://github.com/NVIDIA/TensorRT-LLM/pull/5214
Full Changelog: https://github.com/NVIDIA/TensorRT-LLM/compare/v0.21.0rc2...v0.21.0