TensorRT LLM - Browse /v0.21.0 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
README.md	2025-08-01	23.4 kB	0
v0.21.0 source code.tar.gz	2025-08-01	353.7 MB	0
v0.21.0 source code.zip	2025-08-01	357.5 MB	0
Totals: 3 Items		711.2 MB	0

TensorRT-LLM Release 0.21.0

Key Features and Enhancements

Model Support
Added Gemma3 VLM support
Features
Added large-scale EP support
Integrated NIXL into the communication layer of the disaggregated service
Added fabric Memory support for KV Cache Transfer
Added MCP in ScaffoldingLLM
Added support for w4a8_mxfp4_fp8 quantization
Added support for fp8 rowwise quantization
Added generation logits support in TRTLLM Sampler
Added log probs support in TRTLLM Sampler
Optimized TRTLLM Sampler perf single beam single step
Enabled Disaggregated serving for Qwen-3
Added EAGLE3 support for Qwen-3
Fused finalize and allreduce for Qwen-MoE model
Refactored Fused MoE module
Added support for chunked attention on Blackwell and Hopper
Introduced sliding-window attention kernels for the generation phase on Blackwell
Updated DeepSeek FP8 TRT-LLM Gen cubins to improve performance in large batch size scenarios
Added FP8 block-scale GEMM support on SM89
Enabled overlap scheduler between draft forwards
Added Piecewise cuda graph support for MLA
Added model-agnostic one-engine eagle3
Enabled Finalize + Allreduce + add + rmsnorm fusion
Integrated TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner
Added support for Eagle3 + disaggregated serving in two model speculative decoding flow
Validated Llama 3.1 models on H200 NVL
Benchmark:
Added all_reduce.py benchmark script for testing
Added beam width to trtllm-bench latency command
Fixed trtllm-bench iter_stats and cuda_graph_batch_sizes errors
Enabled trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA
Supported post_proc for bench
Added no_kv_cache_reuse option and streaming support for trtllm serve bench

Infrastructure Changes

The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:25.05-py3.
The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:25.05-py3.
The dependent public PyTorch version is updated to 2.7.1.
The dependent TensorRT version is updated to 10.11.
The dependent NVIDIA ModelOpt version is updated to 0.31.
The dependent NCCL version is updated to 2.27.5.

API Changes

Set _AutoDeployLlmArgs as primary config object
Removed decoder request from decoder interface
Enhanced the torch_compile_config in llm args
Removed the redundant use_kv_cache field from PytorchConfig
Moved allreduce_strategy from committed api to reference

Fixed Issues

Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (#4678)
Fixed EP load balancer with MTP layer and route offset by EP rank (#4767)
Fixed cuda graph padding for spec decoding (#4853)
Fixed llama 4 long context issue (#4809)
Fixed max_num_sequences calculation with overlap scheduling (#4532)
Fixed chunked prefill + overlap scheduling (#5761)
Fixed trtllm-bench hang issue due to LLM API IPC (#4798)
Fixed index out of bounds error in spec decoding (#5954)
Fixed MTP illegal memory access in cuda graph warmup (#5947)
Fixed no free slots error with spec decode + disagg (#5975)
Fixed one-off attention window size for Gemma3 1B (#5564)

Known Issues

accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken.
Enabling disaggregated serving, MTP, and the overlap scheduler at the same time can lead to accuracy problems.
In 0.21, full chunked attention support has been added to make sure LLaMA4 model can functionally run with > 8K seq length, while there is a known performance regression(only affect LLaMA4 model) on Hopper due to this functional enhancement. The root cause of the regression has been identified already and the fix will be part of the future release.

What's Changed

test: [CI] Add failed cases into waives.txt by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5221
[test] split nemotron test cases from examples_test_list by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/5238
Update DeepSeek R1 perf numbers to latest release/0.20 results by @litaotju in https://github.com/NVIDIA/TensorRT-LLM/pull/5235
[feat] Add llm args to tune python gc threshold by @nv-yilinf in https://github.com/NVIDIA/TensorRT-LLM/pull/5141
[TRTLLM-5835][feat] Optimized Mamba2Mixer prefill by @tomeras91 in https://github.com/NVIDIA/TensorRT-LLM/pull/5128
[TRTLLM-3456] Speculation: Draft Target in new FW by @IzzyPutterman in https://github.com/NVIDIA/TensorRT-LLM/pull/4558
chore: Waive CI failure. by @SimengLiu-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5252
[infra] Make test_chunked_prefill faster by @mikeiovine in https://github.com/NVIDIA/TensorRT-LLM/pull/5248
Update internal cutlass commit. by @Tracin in https://github.com/NVIDIA/TensorRT-LLM/pull/5228
test: add more pytorch cases in perf test by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/5237
Fix: https://nvbugs/5345720 by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/5259
test: [CI] remove closed bugs by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5218
[TRTLLM-5330] perf: Optimize MoE supplementary kernels for large-scale EP by @syuoni in https://github.com/NVIDIA/TensorRT-LLM/pull/5215
fix mla test by @qsang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5240
doc: add document of benchmarking for Qwen3 by @byshiue in https://github.com/NVIDIA/TensorRT-LLM/pull/5158
update setup.py for special cases by @qsang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5227
move some test cases of TensorRT backend back by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/5232
[feat] Add EAGLE3 support for Qwen3 by @nv-yilinf in https://github.com/NVIDIA/TensorRT-LLM/pull/5206
[TRTLLM-5786][https://nvbugspro.nvidia.com/bug/5310520][test] Add QA test cases by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/5073
CI: move multi-gpu test cases of tensorrt backend to h200 by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/5272
refactor: Unify decoder test with e2e worklfow by @Funatiq in https://github.com/NVIDIA/TensorRT-LLM/pull/5239
[feat] Piecewise cuda graph support for MLA by @liji-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/4467
chore: Mass integration of release/0.20 by @amirkl94 in https://github.com/NVIDIA/TensorRT-LLM/pull/5082
[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner by @DomBrown in https://github.com/NVIDIA/TensorRT-LLM/pull/5207
None - Some clean-ups for the automation pipeline by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/5245
Re-implement LlmResponse in Python to reduce host overhead of pybind by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/5224
delete cubins by @qsang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5274
infra[TRTLLM-5635] remove package stage in CI build by @niukuo in https://github.com/NVIDIA/TensorRT-LLM/pull/5075
[Infra] - Update dependencies with NGC PyTorch 25.05 and TRT 10.11 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/4885
[chore] Remove BaseDraftTokenManager by @mikeiovine in https://github.com/NVIDIA/TensorRT-LLM/pull/5251
[infra] Report CI authorization errors to PR by @tburt-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5175
Revert "[infra] Report CI authorization errors to PR" by @tburt-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5298
refactor: Update decoder buffer and logits management by @Funatiq in https://github.com/NVIDIA/TensorRT-LLM/pull/4450
fix: only set _mpi_session if world_size is > 1 by @achartier in https://github.com/NVIDIA/TensorRT-LLM/pull/5253
update LlmRequest.is_dummy property by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/5283
test: update qa test list by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/5305
CI: extend model weights load time for dsv3 in stress test. by @dominicshanshan in https://github.com/NVIDIA/TensorRT-LLM/pull/5275
[fix][test] move deepseek single gpu tests to post merge by @omera-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5280
Waive L0 tests by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/5308
feat: Add non-streaming support for trtllm serve bench script & fixed prompt and output token length by @yizhang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/4971
chore: partition LLM class into TorchLLM and TrtLLM by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/4900
[feat]: improve performance of XQA-MLA for sm120 by @lowsfer in https://github.com/NVIDIA/TensorRT-LLM/pull/5087
doc:update contributing md for internal developers by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/5250
test: cherry-pick deepseek rcca cases in main branch by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/5307
[TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. by @hyukn in https://github.com/NVIDIA/TensorRT-LLM/pull/5139
CI: fix TensorRT H200 tests by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/5301
[TRTLLM-5758] test: Add Bielik-11B-v2.2 Model Support by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/5159
chore: Refine printed info of CHECK_TYPE. by @bobboli in https://github.com/NVIDIA/TensorRT-LLM/pull/5295
refactor: Introduce ResourceManagerType enum for resource management by @Funatiq in https://github.com/NVIDIA/TensorRT-LLM/pull/5246
chore: bump version to 0.21.0rc3 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/5309
test: correct unittest rerun behavior by @tongyuantongyu in https://github.com/NVIDIA/TensorRT-LLM/pull/5273
Fix rerun step by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/5319
Waive L0 by @yizhang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5311
tests: add multi nodes tests by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5196
feat: Add LLGuidance Support for PyTorch Backend by @jellysnack in https://github.com/NVIDIA/TensorRT-LLM/pull/5214
[Infra]Update 5080 and 5090 case condition since we will upgrade driver by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/5317
chore: Update README.md to expose meet-up info by @juney-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/5329
Remove duplicated test cases by @HuiGao-NV in https://github.com/NVIDIA/TensorRT-LLM/pull/5323
Add disagg slurm scripts by @qiaoxj07 in https://github.com/NVIDIA/TensorRT-LLM/pull/5243
Unwaive disaggregated serving accuracy tests by @Tabrizian in https://github.com/NVIDIA/TensorRT-LLM/pull/5095
[feat] Multi-node CI testing support via Slurm by @yuanjingx87 in https://github.com/NVIDIA/TensorRT-LLM/pull/4771
[fix][test] remove some cpp test cases from h100 by @omera-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5335
[fix][test] remove duplicate test runs by @omera-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5241
chore: skip test_llm_gpt2_medium_fp8 for fp8_pc_pt + quant_lm_head by @achartier in https://github.com/NVIDIA/TensorRT-LLM/pull/5293
[fix][test] clear cuda cache before unittests automatically by @omera-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5121
fix[nvbug5298640]: trtllm-llmapi-launch multiple LLM instances by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/4727
ci: Split long running jobs into multiple jobs by @Funatiq in https://github.com/NVIDIA/TensorRT-LLM/pull/5268
[feat] Fusion finalize and allreduce for qwenmoe model by @zongfeijing in https://github.com/NVIDIA/TensorRT-LLM/pull/5223
chore: remove torch_compile prefix for TorchCompileConfig field members by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/5261
[test] add nvfp4 DeepSeek-V3-Lite-mtp tests by @lfr-0531 in https://github.com/NVIDIA/TensorRT-LLM/pull/5125
Waive L0 test by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/5349
chore: bump version to 0.21.0 by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/5325
tests: cherry-pick from main branch, add qwen3 test cases and amend test name in perf test by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/5357
[Infra]cherry pick sanity check yml change for 5080 and 5090 from main by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/5363
doc: cherry pick [#5334] by @MartinMarciniszyn in https://github.com/NVIDIA/TensorRT-LLM/pull/5368
fix: Fix skip by mpi size fixture by @yizhang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5355
Fix: missing clientId when serialize and deserialize response (cherry-pick [#5231]) by @kaiyux in https://github.com/NVIDIA/TensorRT-LLM/pull/5378
tests: fix typos in qa test by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/5421
nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mistral Small multimodal for BS=8 by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5453
feat: TRTLLM-5941 Upgrade xgrammar to 0.1.18 by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/5364
test: set enable_attention_dp=True in default deepseek settings by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/5461
tests: Set kv cache free memory fraction in test case by @HuiGao-NV in https://github.com/NVIDIA/TensorRT-LLM/pull/5462
[Infra] - Waive failed tests on release/0.21 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/5477
Fix permission for local user issues in NGC docker container. by @MartinMarciniszyn in https://github.com/NVIDIA/TensorRT-LLM/pull/5373
[nvbug 5273941] fix: broken cyclic reference detect by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/5417
[nvbug/5354825] Fix nougat test image url by @amukkara in https://github.com/NVIDIA/TensorRT-LLM/pull/5496
fix: fix regression in LOCAL_USER by @ixlmar in https://github.com/NVIDIA/TensorRT-LLM/pull/5517
doc: Fix benchmark cmd in disagg scripts by @kaiyux in https://github.com/NVIDIA/TensorRT-LLM/pull/5516
fix: constrain grepping in docker/Makefile by @ixlmar in https://github.com/NVIDIA/TensorRT-LLM/pull/5493
[Infra][release/0.21] - waive failed tests by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/5537
ci: unwaive llmapi launch test by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/5281
[TRTLLM-5989, TRTLLM-5991, TRTLLM-5993] doc: Update container instructions by @ixlmar in https://github.com/NVIDIA/TensorRT-LLM/pull/5490
[cherry-pick] [CI] Waive test_fp8_block_scales_4gpus[ep4-mtp_nextn=0-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] by @venkywonka in https://github.com/NVIDIA/TensorRT-LLM/pull/5553
[Infra][release/0.21]Update nccl to 2.27.5 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/5539
fix [nvbug5351244]: test_mpi_session submit sync/async by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/5608
fix:https://nvbugs/5362398 by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/5609
[nvbug 5300551] test: increase block count in eviction test by @zhengd-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5465
test: add more tests for GB200 with 8 GPUs/2 nodes in L0 tests by @yizhang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5397
doc: Fix outdated config in DeepSeek best perf practice doc by @kaiyux in https://github.com/NVIDIA/TensorRT-LLM/pull/5638
fix: [https://nvbugs/5355219] Fix bug of Qwen3 235B CI on dgx_gb200 by @byshiue in https://github.com/NVIDIA/TensorRT-LLM/pull/5602
[https://nvbugspro.nvidia.com/bug/5351333][fix] Update to chunking calculation. by @FrankD412 in https://github.com/NVIDIA/TensorRT-LLM/pull/5625
fix: Investigate Gemma3 1B decoder output discrepancy by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5564
[Infra] - Waive failed cases on release/0.21 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/5674
Doc: Update invalid hugging face URLs by @Linda-Stadter in https://github.com/NVIDIA/TensorRT-LLM/pull/5683
[NVBUG:5355009] Modify check for fuse_fp4_quant on SM120 by @farazkh80 in https://github.com/NVIDIA/TensorRT-LLM/pull/5651
[TRTLLM-6100] fix: Nvbug 5356427: autotuned TRTLLM Gen fp8 block scale MoE illegal memory access by @DomBrown in https://github.com/NVIDIA/TensorRT-LLM/pull/5676
[nvbug/5341178][fix] Fix OOM in Llama 4 accuracy test by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5735
test: Move some of the test from post merge to pre-merge, update dgx b200 test case by @yizhang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5640
[5321981] fix: Fix the Llama3.1 405B hanging issue. by @hyukn in https://github.com/NVIDIA/TensorRT-LLM/pull/5698
[Infra][nvbugs/5370968] - Unwaive l0 test by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/5750
[nvbugs/5336321][fix] Enable attention dp = False test case, Fix TRTLLM Gen Moe workspace allocation by @yizhang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5463
[nvbug/5337601][fix] Fix disagg + speculative decoding by @Tabrizian in https://github.com/NVIDIA/TensorRT-LLM/pull/5558
[Infra] - Always use x86 image for the Jenkins agent by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/5756
test: fix some test failure and add llama_nemotron models in perf sanity test, add more torch cases by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/5693
fix: Skip rope scaling for local layers in Gemma3 VLM by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5773
[nvbug 5004744][fix] rewrite completion API to avoid repetitive tokens by @LinPoly in https://github.com/NVIDIA/TensorRT-LLM/pull/5201
fix _pad_attention_dp_dummy_request by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/5583
Fix docker cache mount by @MartinMarciniszyn in https://github.com/NVIDIA/TensorRT-LLM/pull/5763
[nvbug/5302638][nvbugs/5310314] fix _handle_cancelled_requests by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/5532
cherry pick [#5416] by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/5776
[nvbug 5304752][fix] enhance _check_arguments to filter illegal requests for pytorch backend by @LinPoly in https://github.com/NVIDIA/TensorRT-LLM/pull/5541
[nvbug5266240] chore: unwaive test_llm_with_dummy_weights by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/5744
[https://nvbugspro.nvidia.com/bug/5355054] fallback to cubins for fp8 fmha kernels on Ada. by @PerkzZheng in https://github.com/NVIDIA/TensorRT-LLM/pull/5779
fix: [https://nvbugspro.nvidia.com/bug/5345215] Unwaive for bug 5345215. by @bobboli in https://github.com/NVIDIA/TensorRT-LLM/pull/5606
[nvbugs/5326453] Avoid nesting NCCL grouping in allgather OP by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/5789
fix: [https://nvbugs/5351130][https://nvbugs/5333654] Unwaive for bug 5351130 and 5333654. by @bobboli in https://github.com/NVIDIA/TensorRT-LLM/pull/5821
doc: Update gb200 doc by @yizhang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5840
test: remove duplicate cases in perf sanity test by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/5870
[nvbug 5327706][fix] fix mgmn postprocess error by @LinPoly in https://github.com/NVIDIA/TensorRT-LLM/pull/5835
[nvbugs/5345391] fix: chunked prefill + overlap scheduling by @Funatiq in https://github.com/NVIDIA/TensorRT-LLM/pull/5761
cherry-pick: [fix: nvbugs/5355493] Correctly clamp max sequence len to max attention window by @netanel-haber in https://github.com/NVIDIA/TensorRT-LLM/pull/5874
[https://nvbugs/5355316] fix: update torch.compile option to fix triton store_cubin error by @dc3671 in https://github.com/NVIDIA/TensorRT-LLM/pull/5865
test: Add Gemma3 unit tests to CI in release/0.21 by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5899
tests: Fix lora perf test by @amirkl94 in https://github.com/NVIDIA/TensorRT-LLM/pull/5875
fix: [nvbugs/5351130] Adjust DSV3-Lite tests free_gpu_memory_fraction to 0.75 to prevent OOM on CI. by @bobboli in https://github.com/NVIDIA/TensorRT-LLM/pull/5896
chore: Port leftover 0.20 by @amirkl94 in https://github.com/NVIDIA/TensorRT-LLM/pull/5907
fix [nvbug/5351244]: address remote mpi session submit by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/5664
fix: [5328141] increase tolerance for test_fp8_block_scale_gemm by @nekorobov in https://github.com/NVIDIA/TensorRT-LLM/pull/5849
fix: timeout and broken pipe in disagg and worker tests by @zhengd-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5827
[nvbugs/5333742] fix MTP illegal memory access in cuda graph warmup by @lfr-0531 in https://github.com/NVIDIA/TensorRT-LLM/pull/5947
fix: fix index out of bounds error in spec decoding by @lfr-0531 in https://github.com/NVIDIA/TensorRT-LLM/pull/5954
[nvbugs/5368410][fix] Disable moe allreduce for multi node by @yizhang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/5918
[fix] Release slots with spec decode + disagg by @Tabrizian in https://github.com/NVIDIA/TensorRT-LLM/pull/5975
[TRTLLM-6495] doc: add disclaimer for 3rd party software installation. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/6039
[None] - Waive L0 tests by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/6082
Cherry Pick: PR [#6076] by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/6088
add release notes for 0.21 release by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/6049
fix: Fix triton backend build [nvbug 5396469] by @pcastonguay in https://github.com/NVIDIA/TensorRT-LLM/pull/6098
[None][infra] Cherry-pick [#6128] and [#6130] from main branch by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/6151
[Doc][Qwen3] update qwen3 into support-matrix by @byshiue in https://github.com/NVIDIA/TensorRT-LLM/pull/6161
[fix]: Revert commit 388b491 by @LinPoly in https://github.com/NVIDIA/TensorRT-LLM/pull/6143
doc: update known issues by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/6247
[fix] Cherry pick "[TRTLLM-6262] Fix Llama4 Scout FP4 crash issue" by @mikeiovine in https://github.com/NVIDIA/TensorRT-LLM/pull/6267
doc: update release notes by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/6324
test: Relax Gemma3 unit test thresholds by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6016
tests: Add llama4 functional cases by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/6392
doc: update release notes by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/6438
[https://nvbugspro.nvidia.com/bug/5415268] fix illegal smem access with chunked attention by @PerkzZheng in https://github.com/NVIDIA/TensorRT-LLM/pull/6401
[doc] Update perf_overview.md for release 0.21 by @zbpatel in https://github.com/NVIDIA/TensorRT-LLM/pull/6270
[None][infra] Pin the version for triton to 3.3.1 (#6508) by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/6519

New Contributors

@jellysnack made their first contribution in https://github.com/NVIDIA/TensorRT-LLM/pull/5214

Full Changelog: https://github.com/NVIDIA/TensorRT-LLM/compare/v0.21.0rc2...v0.21.0

Source: README.md, updated 2025-08-01

TensorRT LLM Files

TensorRT LLM provides users with an easy-to-use Python API

TensorRT-LLM Release 0.21.0

Key Features and Enhancements

Infrastructure Changes

API Changes

Fixed Issues

Known Issues

What's Changed

New Contributors

TensorRT LLM Files

TensorRT LLM provides users with an easy-to-use Python API

Get an email when there's a new version of TensorRT LLM

TensorRT-LLM Release 0.21.0

Key Features and Enhancements

Infrastructure Changes

API Changes

Fixed Issues

Known Issues

What's Changed

New Contributors