TensorRT LLM - Browse /v1.0.0 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
ATTRIBUTIONS-CPP-aarch64.md	2025-10-16	719.0 kB	0
ATTRIBUTIONS-Python.md	2025-10-15	2.1 MB	0
ATTRIBUTIONS-CPP-x86_64.md	2025-10-15	710.7 kB	0
README.md	2025-09-23	33.0 kB	0
v1.0.0 source code.tar.gz	2025-09-23	321.9 MB	0
v1.0.0 source code.zip	2025-09-23	325.6 MB	0
Totals: 6 Items		651.1 MB	0

TensorRT LLM Release 1.0

TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now stable and the default experience, and the LLM API is now stable. For more details on new developments in 1.0, please see below.

Key Features and Enhancements

Model Support
Add Mistral3.1 VLM model support
Add TensorRT-Engine Qwen3 (dense) model support
Add phi-4-multimodal model support
Add EXAONE 4.0 model support
Add Qwen3 MoE support to TensorRT backend
Features
Add support for sm121
Add LoRA support for Gemma3
Support PyTorch LoRA adapter eviction
Add LoRA support for PyTorch backend in trtllm-serve
Add support of scheduling attention dp request
Remove padding of FusedMoE in attention DP
Support torch compile for attention dp
Add KV events support for sliding window attention
Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE
Add Piecewise CUDA Graph support for MLA
Support mutliCtasKvMode for high-throughput MLA kernels
Enable kvcache to be reused during request generation
Add ADP schedule balance optimization
Add chunked prefill support for MLA (Blackwell)
Enable Multi-block mode for Hopper spec dec XQA kernel
Add vLLM KV Pool support for XQA kernel
Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5
Add support for fused gate_up_proj scales for FP8 blockwise
Support FP8 row-wise dense GEMM in torch flow
Enable fp8 SwiGLU to minimize host overhead
Add Deepseek R1 FP8 Support on Blackwell
Add support for MXFP8xMXFP4 in pytorch
Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)
Opensource MOE MXFP8-MXFP4 implementation
Add support for Modelopt fp8_pb_wo quantization scheme
Support deepEP fp4 post quant all2all dispatch
Fuse w4a8 moe pre-quant scale on Hopper
Support Weight-Only-Quantization in PyTorch Workflow
Add support for per expert activation scaling factors
Add ReDrafter support for Qwen
Enable CUDA Graph for Nemotron-H
Add support for YARN in NemotronNAS models
Switch to internal version of MMProjector in Gemma3
Disable add special tokens for Llama3.3 70B
Auto-enable ngram with concurrency <= 32
Support turning on/off spec decoding dynamically
Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21
Add support for external multimodal embeddings
Add support for disaggregation with pp with pytorch backend
Add status tags to LLM API reference
Support JSON Schema in OpenAI-Compatible API
Support chunked prefill on spec decode 2 model
Add KV cache reuse support for multimodal models
Support nanobind bindings
Add support for two-model engine KV cache reuse
Add Eagle-3 support for qwen3 dense model
Migrate Eagle-3 and draft/target speculation to Drafter
Enable guided decoding with overlap scheduler
Support n-gram speculative decoding with disagg
Add beam search support to the PyTorch Workflow
Add LLGuidance Support for PyTorch Backend
Add NGrams V2 support
Add MTP support for Online EPLB
Support disaggregated serving in TRTLLM Sampler
Add core infrastructure to enable loading of custom checkpoint formats
Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs
Use huge page mapping for host accessible memory on GB200
Add user-provided speculative decoding support
Add streaming scaffolding_llm.generate_async support
Detokenize option in /v1/completions request
Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner
Remove support for llmapi + TRT backend in Triton
Add request_perf_metrics to triton LLMAPI backend
Add support for Triton request cancellation
Benchmark:
Add support for benchmarking individual gemms in MOE benchmark (#6080)
Add speculative metrics for trtllm-bench
Add the ability to write a request timeline for trtllm-bench
Add no_kv_cache_reuse option and streaming support for trtllm-serve bench
Add latency support for trtllm-bench
Add Acceptance Rate calculation to benchmark_serving
Add wide-ep benchmarking scripts
Update trtllm-bench to support new Pytorch default
Add support for TRTLLM CustomDataset
Make benchmark_serving part of the library
Documentation:
Refactored the doc structure to focus on the PyTorch workflow.
Improved the LLMAPI and API reference documentation. Stable APIs are now protected and will remain consistent in subsequent versions following v1.0.
Removed legacy documentation related to the TensorRT workflow.

Infrastructure Changes

The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:25.06-py3.
The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:25.06-py3.
The dependent NVIDIA ModelOpt version is updated to 0.33.
The dependent xgrammar version is updated to 0.1.21.
The dependent transformers version is updated to 4.53.1.

API Changes

BREAKING CHANGE Promote PyTorch to be the default LLM backend
BREAKING CHANGE Change default backend to PyTorch in trtllm-serve
BREAKING CHANGE Unify KvCacheConfig in LLM class for pytorch backend
BREAKING CHANGE Rename cuda_graph_config padding_enabled field
BREAKING CHANGE Rename mixed_sampler to enable_mixed_sampler
BREAKING CHANGE Rename LLM.autotuner_enabled to enable_autotuner
Add back allreduce_strategy parameter into TorchLlmArgs
Add LLmArgs option to force using dynamic quantization
Change default LoRA cache sizes and change peft_cache_config cache size fields to take effect when not explicitly set in lora_config
Remove deprecated LoRA LLM args, that are already specified in lora_config
Add request_perf_metrics to LLMAPI
Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead
Remove TrtGptModelOptionalParams
Remove ptuning knobs from TorchLlmArgs

Fixed Issues

Fix illegal memory access in MLA (#6437)
Fix nemotronNAS loading for TP>1 (#6447)
Fix wide EP when using DeepEP with online EPLB (#6429)
Fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344)
Fix PD + MTP + overlap scheduler accuracy issue (#6136)
Fix bug of Qwen3 when using fp4 on sm120 (#6065)
Fix TMA error with GEMM+AR on TP=2 (#6075)
Fix scaffolding aime test in test_e2e (#6140)
Fix KV Cache overrides in trtllm-bench (#6103)
Fix MOE benchmark to rotate buffers to prevent L2 cache reuse (#4135)
Fix eagle3 two model disaggregated serving test (#6014)
Fix chunked prefill + overlap scheduling (#5761)
Fix mgmn postprocess error (#5835)
Fallback to cubins for fp8 fmha kernels on Ada (#5779)
Fix disagg + speculative decoding (#5558)
Fix test_generate_with_seed CI failure. (#5772)
Fix prompt adapter TP2 case (#5782)
Fix disaggregate serving with attention DP (#4993)
Fix a quote error introduced in [#5534] (#5816)
Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. (#5801)
Fix lost requests for disaggregated serving (#5815)
Update unit tests: skip all_close assert for dropout in attention, increase tolerance for rope op test (#5855)
Fix GEMM+AR fusion on blackwell (#5563)
Fix llama4 multimodal support (#5809)
Fix Llama4 Scout FP4 crash issue (#5925)
Fix max batch size and max tokens in kv cache estimations for Nemotron-H (#5371)
Fix moe regression for sm120 (#5823)
Fix Qwen2.5VL FP8 support (#5029)
Fix the illegal memory access issue in moe gemm on SM120 (#5636)
Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
Fix incremental detokenization (#5825)
Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900)
Fix mistral unit tests due to transformers upgrade (#5904)
Fix the Llama3.1 405B hanging issue. (#5698) (#5925)
Fix Gemma3 unit tests due to transformers upgrade (#5921)
Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902)
Remove SpecConfig and fix thread leak issues (#5931)
Fast redux detection in trtllm gen routing kernel (#5941)
Fix cancel request logic (#5800)
Fix errors in wide-ep scripts (#5992)
Fix error in post-merge-tests (#5949)
Fix missing arg to alltoall_prepare_maybe_dispatch (#5669)
Fix attention DP doesn't work with embedding TP (#5642)
Fix broken cyclic reference detect (#5417)
Fix permission for local user issues in NGC docker container. (#5373)
Fix mtp vanilla draft inputs (#5568)
Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519)
Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514)
Fix the issue MoE autotune fallback failed to query default heuristic (#5520)
Fix the unexpected keyword argument 'streaming' (#5436)

Known Issues

When using disaggregated serving with pipeline parallelism and KV cache reuse, a hang can occur. This will be fixed in a future release. In the meantime, disabling KV cache reuse will fix this issue.
Running multi-node cases where each node has just a single GPU is known to fail. This will be addressed in a future release.
For the Llama 3.x and Llama 4 models, there is an issue with pipeline parallelism when using FP8 and NVFP4 weights. As a workaround, you can set the environment variable export TRTLLM_LLAMA_EAGER_FUSION_DISABLED=1.

What's Changed

Qwen3: Fix eagle hidden states by @IzzyPutterman in https://github.com/NVIDIA/TensorRT-LLM/pull/6199
[None][fix] Upgrade dependencies version to avoid security vulnerability by @yibinl-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/6506
[None][chore] update readme for perf release test by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/6664
[None][test] remove trt backend cases in release perf test and move NIM cases to llm_perf_nim.yml by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/6662
[None][fix] Explicitly add tiktoken as required by kimi k2 by @pengbowang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6663
[None][doc]: remove the outdated features which marked as Experimental by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/5995
[https://nvbugs/5375966][chore] Unwaive test_disaggregated_deepseek_v3_lite_fp8_attention_dp_one by @yweng0828 in https://github.com/NVIDIA/TensorRT-LLM/pull/6658
[TRTLLM-6892][infra] Run guardwords scan first in Release Check stage by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/6659
[None][chore] optimize kv cache transfer for context TEP and gen DEP by @chuangz0 in https://github.com/NVIDIA/TensorRT-LLM/pull/6657
[None][chore] Bump version to 1.0.0 by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/6652
[None][test] Add Mistral Small 3.1 24B accuracy test to QA test list by @StanleySun639 in https://github.com/NVIDIA/TensorRT-LLM/pull/6682
[None][test] cherry-pick: correct test-db context for perf yaml file and add mistral cases by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/6688
[None][package] Pin cuda-python version to >=12,<13 by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/6703
[TRTLLM-6656][chore] Validate FP8 support for Gemma3 by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6678
[TRTLLM-5574][test] Add NIM required VLM models multi-gpu test by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/6687
[TRTLLM-6675][infra] Nixl test completion by @bo-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6623
[None][test] fix yml condition error under qa folder by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/6733
[None][doc] Add doc for multimodal feature support matrix (#6619) by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/6739
[https://nvbugs/5344910][fix] Corrected memory position when setting buffers to 0 in standalone_stable_radix_topk_ by @stnie in https://github.com/NVIDIA/TensorRT-LLM/pull/6712
[https://nvbugs/5442608][fix] Update CUDA graph config for get_model_yaml_config. by @yuxianq in https://github.com/NVIDIA/TensorRT-LLM/pull/6693
[TRTLLM-4721][test] Add qa test for llm-api by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/6727
[https://nvbugs/5409420][fix] Fix test_ptp_star_attention_example by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/6584
[https://nvbugs/5444624][fix] Fix LLM_ROOT in triton_backend build.sh by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/6744
[https://nvbugs/5429689][fix] Fix mllama model structure update with transformers issue by @dominicshanshan in https://github.com/NVIDIA/TensorRT-LLM/pull/6699
[None][chore] remove out-of-date comment in star attention test by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/6773
[https://nvbugs/5431127][fix] Run test_disaggregated_deepseek_v3_lite_fp8_nixl[DeepSeek-V3-Lite-fp8] only on hopper by @bo-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6737
[None][infra] Waive failed tests on release branch 0811 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/6782
[https://nvbugs/5444095][infra] waive test_ptp_quickstart_multimodal llava test by @yechank-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/6795
[TRTLLM-5252][fix] Propagate mapping to intermediate layers (#6611) by @2ez4bz in https://github.com/NVIDIA/TensorRT-LLM/pull/6765
[TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter by @amitz-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6786
[None][feat] adding support for disaggregated multi-instance tests by @raayandhar in https://github.com/NVIDIA/TensorRT-LLM/pull/6674
[None][infra] Avoid intermittent access broken to nvcr.io by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/6715
[https://nvbugs/5383702][fix] error propagation in GenerationExecutor by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/6793
[https://nvbugs/5445774][fix] Unwaive Gemma3 27B fp8 test by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6799
[None][fix] fix CUDA graph config for test_llm_api_pytorch.py. by @yuxianq in https://github.com/NVIDIA/TensorRT-LLM/pull/6826
[TRTLLM-6975][test] Add multi-turn test cases for VLM models by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/6749
[None][chore] waive GB300 known issues by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6812
[None][fix] fix Llama3 eagle3 test case OOM by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/6832
[https://nvbugs/5375594][fix] fix oom issue on structural_tag test case by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/6838
[https://nvbugs/5401114][fix] Unwaive Gemma3 tests by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6870
[TRTLLM-5252][feat] Add fp8 support for Mistral Small 3.1 by @2ez4bz in https://github.com/NVIDIA/TensorRT-LLM/pull/6731
[None][infra] Setup the code review rule on the release branch by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/6725
[TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/6820
[None][fix] Fix batching bug in Mistral3 model by @2ez4bz in https://github.com/NVIDIA/TensorRT-LLM/pull/6841
[None][fix] Revert phi4-mm aggregate mode by @amukkara in https://github.com/NVIDIA/TensorRT-LLM/pull/6907
[None][fix] Complete the last missing allreduce op in Llama3/4. by @hyukn in https://github.com/NVIDIA/TensorRT-LLM/pull/6850
[None][chore] Add docs for Gemma3 VLMs by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6880
[None][doc] add legacy section for tensorrt engine by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/6724
[TRTLLM-7048][feat] add benchmark TRT flow test for MIG by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6884
[https://nvbugs/5451434][fix] Fix triton docker build by @Tabrizian in https://github.com/NVIDIA/TensorRT-LLM/pull/6898
[TRTLLM-6481][fix] Fix deepseek r1 accuracy issue by @pengbowang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6868
[None][ci] unwaive test_ptp_star_attention_example by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/6943
[https://nvbugs/5455836][fix] Fix llama 4 FP4 by @mikeiovine in https://github.com/NVIDIA/TensorRT-LLM/pull/6911
[None][infra] update CODEOWNERS for release by @venkywonka in https://github.com/NVIDIA/TensorRT-LLM/pull/6905
[https://nvbugs/5453667] [fix] reverting a breaking change: make trtllm-bench enable_chunked_context defaults backend-dependent by @venkywonka in https://github.com/NVIDIA/TensorRT-LLM/pull/6956
[https://nvbugs/5405041][fix] Update wide ep doc by @qiaoxj07 in https://github.com/NVIDIA/TensorRT-LLM/pull/6950
[https://nvbugs/5412562][feat] Allocate MoE workspace only when necessary (release/1.0 retargeted) by @nv-yilinf in https://github.com/NVIDIA/TensorRT-LLM/pull/6955
[TRTLLM-6835][fix] Fix potential hang caused by python multiprocessing when prefetching weights by @lancelly in https://github.com/NVIDIA/TensorRT-LLM/pull/6927
[https://nvbugs/5448525][fix] Mistral Small 3.1 accuracy tests by @2ez4bz in https://github.com/NVIDIA/TensorRT-LLM/pull/6909
[https://nvbugs/5375646][fix] update waives.txt for nvbug 5375646 by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/6847
[None][fix] update skip config by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/6891
[https://nvbugs/5449218][fix] Fix KvCacheConfig error in test_perf by @peaceh-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6937
[None][infra] Waive failed tests for release branch 0818 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/6993
[None][chore] Remove duplicate test waives by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/6999
[None][infra] Cherry-pick [#6836] from main branch and improve SSH connection by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/6971
[https://nvbugs/5462007][ci] Unwaive Mistral Small 3.1 FP8 test by @2ez4bz in https://github.com/NVIDIA/TensorRT-LLM/pull/7008
[https://nvbugs/5449155][fix] Fix DeepSeek R1 weight loading for TP16 by @achartier in https://github.com/NVIDIA/TensorRT-LLM/pull/6913
[https://nvbugs/5374016][fix] improve error message by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/6893
[https://nvbugs/5394685][fix] proper fix for the accuracy issue in 2CTA MLA kernels (release 1.0) by @PerkzZheng in https://github.com/NVIDIA/TensorRT-LLM/pull/6946
[https://nvbugs/5451028][fix] Constrain NemotronSuper test parameters… by @Naveassaf in https://github.com/NVIDIA/TensorRT-LLM/pull/6987
[https://nvbugs/5448579][fix] EXAONE-4.0 accuracy test bugfix by @yechank-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/6888
[None][chore] Waive E2E GB200 tests for Gemma3 27B by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6916
[https://nvbugs/5451296][bug] Fix a thread leak in test_llm_args.py by @Tabrizian in https://github.com/NVIDIA/TensorRT-LLM/pull/7017
[None][infra] Waive failed tests for release branch 08/19 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/7036
[None][doc] add status labels to LLM class's api reference by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/6899
[https://nvbugs/5448437][fix] fix some nixl tests by @bo-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6940
[https://nvbugs/5427801][fix] Torch compile support for Llama4 and Ea… by @liji-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6978
[https://nvbugs/5394392][fix] Enlarge scheduler capacity under disagg bs == 1 by @yifeizhang-c in https://github.com/NVIDIA/TensorRT-LLM/pull/6975
[TRTLLM-7263][fix] Prevent recreation of cublas handles in lora_grouped_gemm every call by @amitz-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7053
[None][fix] Fix build of tritonbuild/tritonrelease image by @dbari in https://github.com/NVIDIA/TensorRT-LLM/pull/7003
[None][doc] update v1.0 doc for trtllm-serve by @hchings in https://github.com/NVIDIA/TensorRT-LLM/pull/7056
[https://nvbugs/5440241][fix] Fix 70B GSM8K Accuracy drop by @chenfeiz0326 in https://github.com/NVIDIA/TensorRT-LLM/pull/7075
[https://nvbugs/5451296][fix] zmq nonblock bug with retry by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7019
[https://nvbugs/5383702][fix] test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8_4gpus by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/6889
[https://nvbugs/5392414] [fix] For release 1.0 cherry pick. Add customized default routing method by @ChristinaZ in https://github.com/NVIDIA/TensorRT-LLM/pull/7068
[https://nvbugs/5464088] [fix] Guard against fp8 activations in lora forward; update perf test config by @venkywonka in https://github.com/NVIDIA/TensorRT-LLM/pull/7014
[None][infra] Skip failed tests for release branch 08/21 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/7130
[https://nvbugs/5448442][fix] Skip trtllm moe backend for sm120 by @pamelap-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/7010
[https://nvbugs/5449032][fix] Add more llm-args to llm_mgmn_trtllm_bench.sh by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7144
[https://nvbugs/5410391][bug] Support to share device buffers in attention meta by @HuiGao-NV in https://github.com/NVIDIA/TensorRT-LLM/pull/6557
[https://nvbugs/5467062][fix] pass logitsPostProcessorBatched by reference by @milesial in https://github.com/NVIDIA/TensorRT-LLM/pull/7110
[https://nvbugs/5450074][fix] Reduce the device memory requirements for testing by @Shixiaowei02 in https://github.com/NVIDIA/TensorRT-LLM/pull/6990
[https://nvbugs/5474037][fix] Fix building tritonbuild/tritonrelease images by @dbari in https://github.com/NVIDIA/TensorRT-LLM/pull/7157
[https://nvbugs/5433545][fix] TestPhi4MiniInstruct::test_auto_dtype - Use max_seq_len=4096 to fallback to the short RoPE factor by @moraxu in https://github.com/NVIDIA/TensorRT-LLM/pull/6895
[https://nvbugs/5461712] [fix] Disable deep_gemm for Qwen3 due to accuracy issues by @DomBrown in https://github.com/NVIDIA/TensorRT-LLM/pull/7170
[TRTLLM-6825][fix] Update lora for phi4-mm by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/7149
[https://nvbugs/5448426][fix] Fix illegal memory access in cuda graph by @peaceh-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7127
[None][fix] Switch llm api quickstart example location per workflow. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7182
[https://nvbugs/5467232][fix] Fix load_torch_hf_lora to override lora_config.trtllm_modules_to_hf_modules with default only when it has no value by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/7168
[None][doc] fix tensorrt legacy quickstart page by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7190
[TRTLLM-7030][fix] BREAKING CHANGE: Mismatch between docs and actual commands by @Shixiaowei02 in https://github.com/NVIDIA/TensorRT-LLM/pull/7191
[https://nvbugs/5470840][fix] Disaggregated unit test MPI Init handling by @pcastonguay in https://github.com/NVIDIA/TensorRT-LLM/pull/7139
[None][test] add kv cache size in bench metric and fix failed cases by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/7211
[None][fix] update skip case by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/7193
[https://nvbugs/5409416][fix] test_openai_multi_chat_example by @Linda-Stadter in https://github.com/NVIDIA/TensorRT-LLM/pull/7174
[https://nvbugs/5473789][bug] install cuda-toolkit to fix sanity check by @HuiGao-NV in https://github.com/NVIDIA/TensorRT-LLM/pull/7159
[None][fix] fix log_once usage by @yuxianq in https://github.com/NVIDIA/TensorRT-LLM/pull/7210
[None][infra] Waive failed cases for release/1.0 08/26 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/7258
[https://nvbugs/5451342][fix] Use runtime max_batch_size when cuda_graph_config.max_batch_size is not provided in trtllm-bench by @jiaganc in https://github.com/NVIDIA/TensorRT-LLM/pull/7031
[None][feat] Skip prefetching consolidated safetensors when appropriate by @2ez4bz in https://github.com/NVIDIA/TensorRT-LLM/pull/7225
[https://nvbugs/5430125][ci] Unwaive test case for mistral 3.1 small by @2ez4bz in https://github.com/NVIDIA/TensorRT-LLM/pull/7265
[https://nvbugs/5478151][fix] Add missing spec for Llama-3.3 70B by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7267
[https://nvbugs/5451426][fix] Avoid torch compile on full eagle3 worker by @liji-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7245
[https://nvbugs/5448767][fix] fix mpi4py deadlocks in pp event-loop by @reasonsolo in https://github.com/NVIDIA/TensorRT-LLM/pull/6976
[https://nvbugs/5463720][fix] tp-split the inferred mlp_hidden_size for nemotron-nas by @venkywonka in https://github.com/NVIDIA/TensorRT-LLM/pull/7231
[https://nvbugs/5480550][fix] Increase timeout for Gemma3 27B test by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7271
[https://nvbugs/5434320][bug] Fix disagg pp bug by @Tabrizian in https://github.com/NVIDIA/TensorRT-LLM/pull/7099
[https://nvbugs/5480415][fix] Fix phi4mm multi-gpu test by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/7275
[TRTLLM-7346][fix] Improve performance of PyTorchModelEngine._get_lora_params_from_requests by @amitz-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7203
[https://nvbugs/5467548][fix] DeepSeek illegal memory access. by @bobboli in https://github.com/NVIDIA/TensorRT-LLM/pull/7298
[https://nvbugs/5448767][fix] disable kv cache reuse for disagg pp>1 tests by @reasonsolo in https://github.com/NVIDIA/TensorRT-LLM/pull/7354
[https://nvbugs/5445466][fix] Eliminate race when loading HF dynamic modules (#7268) by @chang-l in https://github.com/NVIDIA/TensorRT-LLM/pull/7379
[https://nvbugs/5474169][fix]Adjust max seq len for kvcache for memory estimation by @HuiGao-NV in https://github.com/NVIDIA/TensorRT-LLM/pull/7391
[https://nvbugs/5448754][fix] Download HF model for all nodes. by @yuxianq in https://github.com/NVIDIA/TensorRT-LLM/pull/6824
[None][infra] Waive failed tests on release branch 0901 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/7448
[None][doc] add blackwell information into support matrix by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/6740
[TRTLLM-7008][fix] cherrypick fix to 1.0 Add automatic shared memory delete if already exist by @dongxuy04 in https://github.com/NVIDIA/TensorRT-LLM/pull/7433
[https://nvbugs/5351244][fix] test_mpi_session by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7501
[https://nvbugs/5461761][fix] Remove the waiver by @ziyixiong-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7427
[TRTLLM-5930][doc] 1.0 Documentation. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/6696
[https://nvbugs/5496960][fix] Fix Gemma model forward. by @hyukn in https://github.com/NVIDIA/TensorRT-LLM/pull/7509
[None][doc] Update kvcache part by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7549
[None][doc] Rename TensorRT-LLM to TensorRT LLM. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7554
[https://nvbugs/5416501][doc] add known issues to llmapi doc by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7560
[None][doc] Fix a invalid link. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7617
[https://nvbugs/5474169][fix] seq_len mismatch between kv cache manager and graph attn metadata by @HuiGao-NV in https://github.com/NVIDIA/TensorRT-LLM/pull/7606
[https://nvbugs/5503423][waive] Waive Llama3.1-70B-FP8 test on RTX PRO 6000 by @peaceh-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7603
[https://nvbugs/5455140][fix] unwaive release/1.0 DS R1 test cases with bug already fixed by @lancelly in https://github.com/NVIDIA/TensorRT-LLM/pull/7432
[https://nvbugs/5470782][chore] Remove the skip statement in 1.0 rele… by @SimengLiu-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7573
[None][doc] Fix a invalid link and a typo. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7634
[None][doc] Use hash id for external link by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7641
[https://nvbugs/5477730][fix] Fix the alltoall case when tp_size larg… by @WeiHaocheng in https://github.com/NVIDIA/TensorRT-LLM/pull/7671
[https://nvbugs/5436461][fix] Adjust free_gpu_memory_fraction of test_eagle3 by @leslie-fang25 in https://github.com/NVIDIA/TensorRT-LLM/pull/7673
[https://nvbugs/5474409][fix] Disable concurrent loading by default by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7663
[https://nvbugs/5501557][fix] Fix out-of-bounds vector access for model with multiple layer types by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7636
[https://nvbugs/5398180][feat] Improve Llama4 performance for small max_seqlen cases by @nv-yilinf in https://github.com/NVIDIA/TensorRT-LLM/pull/7681
[None][ci] Test waives for the release/1.0 branch 09/15 by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/7700
[None][doc] Add labels description note into llm api section by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7696
[https://nvbugs/5437405][fix] cherry-pick PR 7000 (qwen3 235b eagle3 ci) by @byshiue in https://github.com/NVIDIA/TensorRT-LLM/pull/7702
[https://nvbugs/5512734][fix] Update kv cache config for maverick by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/7710
[https://nvbugs/5355219][fix] Fix trtllm moe backend test config and Qwen3 MoE multi node by @yizhang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7724
[None][doc] Fix the link in the doc by @Shixiaowei02 in https://github.com/NVIDIA/TensorRT-LLM/pull/7754
[https://nvbugs/5519525][fix] fix doc invalid link for bug 5519525 by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7753
[https://nvbugs/5509024][fix] Print full parsed outputs and update keywords for multimodal model by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/7670
[None][doc] Enhance api reference doc by labeling stable APIs by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7751
[https://nvbugs/5468897][fix] fix invalid expression for disabling pa… by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7762
[https://nvbugs/5517023][fix] Pass allreduce strategy and force NCCL on pre-Blackwell arch by @hyukn in https://github.com/NVIDIA/TensorRT-LLM/pull/7768
[TRTLLM-7958][doc] add 1.0 release notes by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7605
[https://nvbugs/5522332][fix] Pin numpy version for Gemma. by @yuxianq in https://github.com/NVIDIA/TensorRT-LLM/pull/7783
[None][doc] Update docker cmd in quick start guide and trtllm-serve … by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7787
[https://nvbugs/1234567][fix] Revert https://github.com/NVIDIA/TensorRT-LLM/pull/7768/files by @litaotju in https://github.com/NVIDIA/TensorRT-LLM/pull/7813
[https://nvbugs/5516710][fix] fix Llama 3.3 TP PP case by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7717
[None][doc] Replace the main in the examples' link with commit id. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7837
[None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7850
[None][doc] add a guide for modifying APIs by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7866
[None][doc] Update Perf-Overview.md for release/1.0 by @zbpatel in https://github.com/NVIDIA/TensorRT-LLM/pull/7848
[None][doc] add stable label to all the un-labelled arguments in LLM class by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7863
[None][fix] api stability bug in status label by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7861
[https://nvbugs/5427043][fix] cherrypick: request length exceeds max_num_tokens by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7718
[https://nvbugs/5531963][fix] cherry pick [#7725] by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/7907
[None][chroe] Rename TensorRT-LLM to TensorRT LLM for source code. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7851
[None][doc] fix invalid links in perf benchmarking. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7933

Full Changelog: https://github.com/NVIDIA/TensorRT-LLM/compare/v1.0.0rc6...v1.0.0

Source: README.md, updated 2025-09-23

TensorRT LLM Files

TensorRT LLM provides users with an easy-to-use Python API