| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| ATTRIBUTIONS-CPP-aarch64.md | 2025-10-16 | 719.0 kB | |
| ATTRIBUTIONS-Python.md | 2025-10-15 | 2.1 MB | |
| ATTRIBUTIONS-CPP-x86_64.md | 2025-10-15 | 710.7 kB | |
| README.md | 2025-09-23 | 33.0 kB | |
| v1.0.0 source code.tar.gz | 2025-09-23 | 321.9 MB | |
| v1.0.0 source code.zip | 2025-09-23 | 325.6 MB | |
| Totals: 6 Items | 651.1 MB | 0 | |
TensorRT LLM Release 1.0
TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now stable and the default experience, and the LLM API is now stable. For more details on new developments in 1.0, please see below.
Key Features and Enhancements
- Model Support
- Add Mistral3.1 VLM model support
- Add TensorRT-Engine Qwen3 (dense) model support
- Add phi-4-multimodal model support
- Add EXAONE 4.0 model support
-
Add Qwen3 MoE support to TensorRT backend
-
Features
- Add support for sm121
- Add LoRA support for Gemma3
- Support PyTorch LoRA adapter eviction
- Add LoRA support for PyTorch backend in trtllm-serve
- Add support of scheduling attention dp request
- Remove padding of FusedMoE in attention DP
- Support torch compile for attention dp
- Add KV events support for sliding window attention
- Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE
- Add Piecewise CUDA Graph support for MLA
- Support mutliCtasKvMode for high-throughput MLA kernels
- Enable kvcache to be reused during request generation
- Add ADP schedule balance optimization
- Add chunked prefill support for MLA (Blackwell)
- Enable Multi-block mode for Hopper spec dec XQA kernel
- Add vLLM KV Pool support for XQA kernel
- Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5
- Add support for fused gate_up_proj scales for FP8 blockwise
- Support FP8 row-wise dense GEMM in torch flow
- Enable fp8 SwiGLU to minimize host overhead
- Add Deepseek R1 FP8 Support on Blackwell
- Add support for MXFP8xMXFP4 in pytorch
- Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)
- Opensource MOE MXFP8-MXFP4 implementation
- Add support for Modelopt fp8_pb_wo quantization scheme
- Support deepEP fp4 post quant all2all dispatch
- Fuse w4a8 moe pre-quant scale on Hopper
- Support Weight-Only-Quantization in PyTorch Workflow
- Add support for per expert activation scaling factors
- Add ReDrafter support for Qwen
- Enable CUDA Graph for Nemotron-H
- Add support for YARN in NemotronNAS models
- Switch to internal version of MMProjector in Gemma3
- Disable add special tokens for Llama3.3 70B
- Auto-enable ngram with concurrency <= 32
- Support turning on/off spec decoding dynamically
- Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21
- Add support for external multimodal embeddings
- Add support for disaggregation with pp with pytorch backend
- Add status tags to LLM API reference
- Support JSON Schema in OpenAI-Compatible API
- Support chunked prefill on spec decode 2 model
- Add KV cache reuse support for multimodal models
- Support nanobind bindings
- Add support for two-model engine KV cache reuse
- Add Eagle-3 support for qwen3 dense model
- Migrate Eagle-3 and draft/target speculation to Drafter
- Enable guided decoding with overlap scheduler
- Support n-gram speculative decoding with disagg
- Add beam search support to the PyTorch Workflow
- Add LLGuidance Support for PyTorch Backend
- Add NGrams V2 support
- Add MTP support for Online EPLB
- Support disaggregated serving in TRTLLM Sampler
- Add core infrastructure to enable loading of custom checkpoint formats
- Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs
- Use huge page mapping for host accessible memory on GB200
- Add user-provided speculative decoding support
- Add streaming scaffolding_llm.generate_async support
- Detokenize option in /v1/completions request
- Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner
- Remove support for llmapi + TRT backend in Triton
- Add request_perf_metrics to triton LLMAPI backend
-
Add support for Triton request cancellation
-
Benchmark:
- Add support for benchmarking individual gemms in MOE benchmark (#6080)
- Add speculative metrics for trtllm-bench
- Add the ability to write a request timeline for trtllm-bench
- Add no_kv_cache_reuse option and streaming support for trtllm-serve bench
- Add latency support for trtllm-bench
- Add Acceptance Rate calculation to benchmark_serving
- Add wide-ep benchmarking scripts
- Update trtllm-bench to support new Pytorch default
- Add support for TRTLLM CustomDataset
-
Make benchmark_serving part of the library
-
Documentation:
- Refactored the doc structure to focus on the PyTorch workflow.
- Improved the LLMAPI and API reference documentation. Stable APIs are now protected and will remain consistent in subsequent versions following v1.0.
- Removed legacy documentation related to the TensorRT workflow.
Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:25.06-py3. - The base Docker image for TensorRT-LLM Backend is updated to
nvcr.io/nvidia/tritonserver:25.06-py3. - The dependent NVIDIA ModelOpt version is updated to 0.33.
- The dependent xgrammar version is updated to 0.1.21.
- The dependent transformers version is updated to 4.53.1.
API Changes
- BREAKING CHANGE Promote PyTorch to be the default LLM backend
- BREAKING CHANGE Change default backend to PyTorch in trtllm-serve
- BREAKING CHANGE Unify KvCacheConfig in LLM class for pytorch backend
- BREAKING CHANGE Rename cuda_graph_config padding_enabled field
- BREAKING CHANGE Rename mixed_sampler to enable_mixed_sampler
- BREAKING CHANGE Rename LLM.autotuner_enabled to enable_autotuner
- Add back allreduce_strategy parameter into TorchLlmArgs
- Add LLmArgs option to force using dynamic quantization
- Change default LoRA cache sizes and change peft_cache_config cache size fields to take effect when not explicitly set in lora_config
- Remove deprecated LoRA LLM args, that are already specified in lora_config
- Add request_perf_metrics to LLMAPI
- Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead
- Remove TrtGptModelOptionalParams
- Remove ptuning knobs from TorchLlmArgs
Fixed Issues
- Fix illegal memory access in MLA (#6437)
- Fix nemotronNAS loading for TP>1 (#6447)
- Fix wide EP when using DeepEP with online EPLB (#6429)
- Fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344)
- Fix PD + MTP + overlap scheduler accuracy issue (#6136)
- Fix bug of Qwen3 when using fp4 on sm120 (#6065)
- Fix TMA error with GEMM+AR on TP=2 (#6075)
- Fix scaffolding aime test in test_e2e (#6140)
- Fix KV Cache overrides in trtllm-bench (#6103)
- Fix MOE benchmark to rotate buffers to prevent L2 cache reuse (#4135)
- Fix eagle3 two model disaggregated serving test (#6014)
- Fix chunked prefill + overlap scheduling (#5761)
- Fix mgmn postprocess error (#5835)
- Fallback to cubins for fp8 fmha kernels on Ada (#5779)
- Fix disagg + speculative decoding (#5558)
- Fix test_generate_with_seed CI failure. (#5772)
- Fix prompt adapter TP2 case (#5782)
- Fix disaggregate serving with attention DP (#4993)
- Fix a quote error introduced in [#5534] (#5816)
- Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. (#5801)
- Fix lost requests for disaggregated serving (#5815)
- Update unit tests: skip all_close assert for dropout in attention, increase tolerance for rope op test (#5855)
- Fix GEMM+AR fusion on blackwell (#5563)
- Fix llama4 multimodal support (#5809)
- Fix Llama4 Scout FP4 crash issue (#5925)
- Fix max batch size and max tokens in kv cache estimations for Nemotron-H (#5371)
- Fix moe regression for sm120 (#5823)
- Fix Qwen2.5VL FP8 support (#5029)
- Fix the illegal memory access issue in moe gemm on SM120 (#5636)
- Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
- Fix incremental detokenization (#5825)
- Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900)
- Fix mistral unit tests due to transformers upgrade (#5904)
- Fix the Llama3.1 405B hanging issue. (#5698) (#5925)
- Fix Gemma3 unit tests due to transformers upgrade (#5921)
- Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902)
- Remove SpecConfig and fix thread leak issues (#5931)
- Fast redux detection in trtllm gen routing kernel (#5941)
- Fix cancel request logic (#5800)
- Fix errors in wide-ep scripts (#5992)
- Fix error in post-merge-tests (#5949)
- Fix missing arg to alltoall_prepare_maybe_dispatch (#5669)
- Fix attention DP doesn't work with embedding TP (#5642)
- Fix broken cyclic reference detect (#5417)
- Fix permission for local user issues in NGC docker container. (#5373)
- Fix mtp vanilla draft inputs (#5568)
- Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519)
- Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514)
- Fix the issue MoE autotune fallback failed to query default heuristic (#5520)
- Fix the unexpected keyword argument 'streaming' (#5436)
Known Issues
- When using disaggregated serving with pipeline parallelism and KV cache reuse, a hang can occur. This will be fixed in a future release. In the meantime, disabling KV cache reuse will fix this issue.
- Running multi-node cases where each node has just a single GPU is known to fail. This will be addressed in a future release.
- For the Llama 3.x and Llama 4 models, there is an issue with pipeline parallelism when using FP8 and NVFP4 weights. As a workaround, you can set the environment variable
export TRTLLM_LLAMA_EAGER_FUSION_DISABLED=1.
What's Changed
- Qwen3: Fix eagle hidden states by @IzzyPutterman in https://github.com/NVIDIA/TensorRT-LLM/pull/6199
- [None][fix] Upgrade dependencies version to avoid security vulnerability by @yibinl-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/6506
- [None][chore] update readme for perf release test by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/6664
- [None][test] remove trt backend cases in release perf test and move NIM cases to llm_perf_nim.yml by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/6662
- [None][fix] Explicitly add tiktoken as required by kimi k2 by @pengbowang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6663
- [None][doc]: remove the outdated features which marked as Experimental by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/5995
- [https://nvbugs/5375966][chore] Unwaive test_disaggregated_deepseek_v3_lite_fp8_attention_dp_one by @yweng0828 in https://github.com/NVIDIA/TensorRT-LLM/pull/6658
- [TRTLLM-6892][infra] Run guardwords scan first in Release Check stage by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/6659
- [None][chore] optimize kv cache transfer for context TEP and gen DEP by @chuangz0 in https://github.com/NVIDIA/TensorRT-LLM/pull/6657
- [None][chore] Bump version to 1.0.0 by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/6652
- [None][test] Add Mistral Small 3.1 24B accuracy test to QA test list by @StanleySun639 in https://github.com/NVIDIA/TensorRT-LLM/pull/6682
- [None][test] cherry-pick: correct test-db context for perf yaml file and add mistral cases by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/6688
- [None][package] Pin cuda-python version to >=12,<13 by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/6703
- [TRTLLM-6656][chore] Validate FP8 support for Gemma3 by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6678
- [TRTLLM-5574][test] Add NIM required VLM models multi-gpu test by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/6687
- [TRTLLM-6675][infra] Nixl test completion by @bo-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6623
- [None][test] fix yml condition error under qa folder by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/6733
- [None][doc] Add doc for multimodal feature support matrix (#6619) by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/6739
- [https://nvbugs/5344910][fix] Corrected memory position when setting buffers to 0 in standalone_stable_radix_topk_ by @stnie in https://github.com/NVIDIA/TensorRT-LLM/pull/6712
- [https://nvbugs/5442608][fix] Update CUDA graph config for get_model_yaml_config. by @yuxianq in https://github.com/NVIDIA/TensorRT-LLM/pull/6693
- [TRTLLM-4721][test] Add qa test for llm-api by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/6727
- [https://nvbugs/5409420][fix] Fix test_ptp_star_attention_example by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/6584
- [https://nvbugs/5444624][fix] Fix LLM_ROOT in triton_backend build.sh by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/6744
- [https://nvbugs/5429689][fix] Fix mllama model structure update with transformers issue by @dominicshanshan in https://github.com/NVIDIA/TensorRT-LLM/pull/6699
- [None][chore] remove out-of-date comment in star attention test by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/6773
- [https://nvbugs/5431127][fix] Run test_disaggregated_deepseek_v3_lite_fp8_nixl[DeepSeek-V3-Lite-fp8] only on hopper by @bo-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6737
- [None][infra] Waive failed tests on release branch 0811 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/6782
- [https://nvbugs/5444095][infra] waive test_ptp_quickstart_multimodal llava test by @yechank-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/6795
- [TRTLLM-5252][fix] Propagate mapping to intermediate layers (#6611) by @2ez4bz in https://github.com/NVIDIA/TensorRT-LLM/pull/6765
- [TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter by @amitz-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6786
- [None][feat] adding support for disaggregated multi-instance tests by @raayandhar in https://github.com/NVIDIA/TensorRT-LLM/pull/6674
- [None][infra] Avoid intermittent access broken to nvcr.io by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/6715
- [https://nvbugs/5383702][fix] error propagation in GenerationExecutor by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/6793
- [https://nvbugs/5445774][fix] Unwaive Gemma3 27B fp8 test by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6799
- [None][fix] fix CUDA graph config for test_llm_api_pytorch.py. by @yuxianq in https://github.com/NVIDIA/TensorRT-LLM/pull/6826
- [TRTLLM-6975][test] Add multi-turn test cases for VLM models by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/6749
- [None][chore] waive GB300 known issues by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6812
- [None][fix] fix Llama3 eagle3 test case OOM by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/6832
- [https://nvbugs/5375594][fix] fix oom issue on structural_tag test case by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/6838
- [https://nvbugs/5401114][fix] Unwaive Gemma3 tests by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6870
- [TRTLLM-5252][feat] Add fp8 support for Mistral Small 3.1 by @2ez4bz in https://github.com/NVIDIA/TensorRT-LLM/pull/6731
- [None][infra] Setup the code review rule on the release branch by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/6725
- [TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/6820
- [None][fix] Fix batching bug in Mistral3 model by @2ez4bz in https://github.com/NVIDIA/TensorRT-LLM/pull/6841
- [None][fix] Revert phi4-mm aggregate mode by @amukkara in https://github.com/NVIDIA/TensorRT-LLM/pull/6907
- [None][fix] Complete the last missing allreduce op in Llama3/4. by @hyukn in https://github.com/NVIDIA/TensorRT-LLM/pull/6850
- [None][chore] Add docs for Gemma3 VLMs by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6880
- [None][doc] add legacy section for tensorrt engine by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/6724
- [TRTLLM-7048][feat] add benchmark TRT flow test for MIG by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6884
- [https://nvbugs/5451434][fix] Fix triton docker build by @Tabrizian in https://github.com/NVIDIA/TensorRT-LLM/pull/6898
- [TRTLLM-6481][fix] Fix deepseek r1 accuracy issue by @pengbowang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6868
- [None][ci] unwaive test_ptp_star_attention_example by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/6943
- [https://nvbugs/5455836][fix] Fix llama 4 FP4 by @mikeiovine in https://github.com/NVIDIA/TensorRT-LLM/pull/6911
- [None][infra] update CODEOWNERS for release by @venkywonka in https://github.com/NVIDIA/TensorRT-LLM/pull/6905
- [https://nvbugs/5453667] [fix] reverting a breaking change: make trtllm-bench
enable_chunked_contextdefaults backend-dependent by @venkywonka in https://github.com/NVIDIA/TensorRT-LLM/pull/6956 - [https://nvbugs/5405041][fix] Update wide ep doc by @qiaoxj07 in https://github.com/NVIDIA/TensorRT-LLM/pull/6950
- [https://nvbugs/5412562][feat] Allocate MoE workspace only when necessary (release/1.0 retargeted) by @nv-yilinf in https://github.com/NVIDIA/TensorRT-LLM/pull/6955
- [TRTLLM-6835][fix] Fix potential hang caused by python multiprocessing when prefetching weights by @lancelly in https://github.com/NVIDIA/TensorRT-LLM/pull/6927
- [https://nvbugs/5448525][fix] Mistral Small 3.1 accuracy tests by @2ez4bz in https://github.com/NVIDIA/TensorRT-LLM/pull/6909
- [https://nvbugs/5375646][fix] update waives.txt for nvbug 5375646 by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/6847
- [None][fix] update skip config by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/6891
- [https://nvbugs/5449218][fix] Fix KvCacheConfig error in test_perf by @peaceh-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6937
- [None][infra] Waive failed tests for release branch 0818 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/6993
- [None][chore] Remove duplicate test waives by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/6999
- [None][infra] Cherry-pick [#6836] from main branch and improve SSH connection by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/6971
- [https://nvbugs/5462007][ci] Unwaive Mistral Small 3.1 FP8 test by @2ez4bz in https://github.com/NVIDIA/TensorRT-LLM/pull/7008
- [https://nvbugs/5449155][fix] Fix DeepSeek R1 weight loading for TP16 by @achartier in https://github.com/NVIDIA/TensorRT-LLM/pull/6913
- [https://nvbugs/5374016][fix] improve error message by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/6893
- [https://nvbugs/5394685][fix] proper fix for the accuracy issue in 2CTA MLA kernels (release 1.0) by @PerkzZheng in https://github.com/NVIDIA/TensorRT-LLM/pull/6946
- [https://nvbugs/5451028][fix] Constrain NemotronSuper test parameters… by @Naveassaf in https://github.com/NVIDIA/TensorRT-LLM/pull/6987
- [https://nvbugs/5448579][fix] EXAONE-4.0 accuracy test bugfix by @yechank-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/6888
- [None][chore] Waive E2E GB200 tests for Gemma3 27B by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6916
- [https://nvbugs/5451296][bug] Fix a thread leak in test_llm_args.py by @Tabrizian in https://github.com/NVIDIA/TensorRT-LLM/pull/7017
- [None][infra] Waive failed tests for release branch 08/19 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/7036
- [None][doc] add status labels to LLM class's api reference by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/6899
- [https://nvbugs/5448437][fix] fix some nixl tests by @bo-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6940
- [https://nvbugs/5427801][fix] Torch compile support for Llama4 and Ea… by @liji-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/6978
- [https://nvbugs/5394392][fix] Enlarge scheduler capacity under disagg bs == 1 by @yifeizhang-c in https://github.com/NVIDIA/TensorRT-LLM/pull/6975
- [TRTLLM-7263][fix] Prevent recreation of cublas handles in lora_grouped_gemm every call by @amitz-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7053
- [None][fix] Fix build of tritonbuild/tritonrelease image by @dbari in https://github.com/NVIDIA/TensorRT-LLM/pull/7003
- [None][doc] update v1.0 doc for trtllm-serve by @hchings in https://github.com/NVIDIA/TensorRT-LLM/pull/7056
- [https://nvbugs/5440241][fix] Fix 70B GSM8K Accuracy drop by @chenfeiz0326 in https://github.com/NVIDIA/TensorRT-LLM/pull/7075
- [https://nvbugs/5451296][fix] zmq nonblock bug with retry by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7019
- [https://nvbugs/5383702][fix] test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8_4gpus by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/6889
- [https://nvbugs/5392414] [fix] For release 1.0 cherry pick. Add customized default routing method by @ChristinaZ in https://github.com/NVIDIA/TensorRT-LLM/pull/7068
- [https://nvbugs/5464088] [fix] Guard against fp8 activations in lora forward; update perf test config by @venkywonka in https://github.com/NVIDIA/TensorRT-LLM/pull/7014
- [None][infra] Skip failed tests for release branch 08/21 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/7130
- [https://nvbugs/5448442][fix] Skip trtllm moe backend for sm120 by @pamelap-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/7010
- [https://nvbugs/5449032][fix] Add more llm-args to llm_mgmn_trtllm_bench.sh by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7144
- [https://nvbugs/5410391][bug] Support to share device buffers in attention meta by @HuiGao-NV in https://github.com/NVIDIA/TensorRT-LLM/pull/6557
- [https://nvbugs/5467062][fix] pass logitsPostProcessorBatched by reference by @milesial in https://github.com/NVIDIA/TensorRT-LLM/pull/7110
- [https://nvbugs/5450074][fix] Reduce the device memory requirements for testing by @Shixiaowei02 in https://github.com/NVIDIA/TensorRT-LLM/pull/6990
- [https://nvbugs/5474037][fix] Fix building tritonbuild/tritonrelease images by @dbari in https://github.com/NVIDIA/TensorRT-LLM/pull/7157
- [https://nvbugs/5433545][fix] TestPhi4MiniInstruct::test_auto_dtype - Use max_seq_len=4096 to fallback to the short RoPE factor by @moraxu in https://github.com/NVIDIA/TensorRT-LLM/pull/6895
- [https://nvbugs/5461712] [fix] Disable deep_gemm for Qwen3 due to accuracy issues by @DomBrown in https://github.com/NVIDIA/TensorRT-LLM/pull/7170
- [TRTLLM-6825][fix] Update lora for phi4-mm by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/7149
- [https://nvbugs/5448426][fix] Fix illegal memory access in cuda graph by @peaceh-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7127
- [None][fix] Switch llm api quickstart example location per workflow. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7182
- [https://nvbugs/5467232][fix] Fix load_torch_hf_lora to override lora_config.trtllm_modules_to_hf_modules with default only when it has no value by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/7168
- [None][doc] fix tensorrt legacy quickstart page by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7190
- [TRTLLM-7030][fix] BREAKING CHANGE: Mismatch between docs and actual commands by @Shixiaowei02 in https://github.com/NVIDIA/TensorRT-LLM/pull/7191
- [https://nvbugs/5470840][fix] Disaggregated unit test MPI Init handling by @pcastonguay in https://github.com/NVIDIA/TensorRT-LLM/pull/7139
- [None][test] add kv cache size in bench metric and fix failed cases by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/7211
- [None][fix] update skip case by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/7193
- [https://nvbugs/5409416][fix] test_openai_multi_chat_example by @Linda-Stadter in https://github.com/NVIDIA/TensorRT-LLM/pull/7174
- [https://nvbugs/5473789][bug] install cuda-toolkit to fix sanity check by @HuiGao-NV in https://github.com/NVIDIA/TensorRT-LLM/pull/7159
- [None][fix] fix log_once usage by @yuxianq in https://github.com/NVIDIA/TensorRT-LLM/pull/7210
- [None][infra] Waive failed cases for release/1.0 08/26 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/7258
- [https://nvbugs/5451342][fix] Use runtime max_batch_size when cuda_graph_config.max_batch_size is not provided in trtllm-bench by @jiaganc in https://github.com/NVIDIA/TensorRT-LLM/pull/7031
- [None][feat] Skip prefetching consolidated safetensors when appropriate by @2ez4bz in https://github.com/NVIDIA/TensorRT-LLM/pull/7225
- [https://nvbugs/5430125][ci] Unwaive test case for mistral 3.1 small by @2ez4bz in https://github.com/NVIDIA/TensorRT-LLM/pull/7265
- [https://nvbugs/5478151][fix] Add missing spec for Llama-3.3 70B by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7267
- [https://nvbugs/5451426][fix] Avoid torch compile on full eagle3 worker by @liji-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7245
- [https://nvbugs/5448767][fix] fix mpi4py deadlocks in pp event-loop by @reasonsolo in https://github.com/NVIDIA/TensorRT-LLM/pull/6976
- [https://nvbugs/5463720][fix] tp-split the inferred
mlp_hidden_sizefor nemotron-nas by @venkywonka in https://github.com/NVIDIA/TensorRT-LLM/pull/7231 - [https://nvbugs/5480550][fix] Increase timeout for Gemma3 27B test by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7271
- [https://nvbugs/5434320][bug] Fix disagg pp bug by @Tabrizian in https://github.com/NVIDIA/TensorRT-LLM/pull/7099
- [https://nvbugs/5480415][fix] Fix phi4mm multi-gpu test by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/7275
- [TRTLLM-7346][fix] Improve performance of PyTorchModelEngine._get_lora_params_from_requests by @amitz-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7203
- [https://nvbugs/5467548][fix] DeepSeek illegal memory access. by @bobboli in https://github.com/NVIDIA/TensorRT-LLM/pull/7298
- [https://nvbugs/5448767][fix] disable kv cache reuse for disagg pp>1 tests by @reasonsolo in https://github.com/NVIDIA/TensorRT-LLM/pull/7354
- [https://nvbugs/5445466][fix] Eliminate race when loading HF dynamic modules (#7268) by @chang-l in https://github.com/NVIDIA/TensorRT-LLM/pull/7379
- [https://nvbugs/5474169][fix]Adjust max seq len for kvcache for memory estimation by @HuiGao-NV in https://github.com/NVIDIA/TensorRT-LLM/pull/7391
- [https://nvbugs/5448754][fix] Download HF model for all nodes. by @yuxianq in https://github.com/NVIDIA/TensorRT-LLM/pull/6824
- [None][infra] Waive failed tests on release branch 0901 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/7448
- [None][doc] add blackwell information into support matrix by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/6740
- [TRTLLM-7008][fix] cherrypick fix to 1.0 Add automatic shared memory delete if already exist by @dongxuy04 in https://github.com/NVIDIA/TensorRT-LLM/pull/7433
- [https://nvbugs/5351244][fix] test_mpi_session by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7501
- [https://nvbugs/5461761][fix] Remove the waiver by @ziyixiong-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7427
- [TRTLLM-5930][doc] 1.0 Documentation. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/6696
- [https://nvbugs/5496960][fix] Fix Gemma model forward. by @hyukn in https://github.com/NVIDIA/TensorRT-LLM/pull/7509
- [None][doc] Update kvcache part by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7549
- [None][doc] Rename TensorRT-LLM to TensorRT LLM. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7554
- [https://nvbugs/5416501][doc] add known issues to llmapi doc by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7560
- [None][doc] Fix a invalid link. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7617
- [https://nvbugs/5474169][fix] seq_len mismatch between kv cache manager and graph attn metadata by @HuiGao-NV in https://github.com/NVIDIA/TensorRT-LLM/pull/7606
- [https://nvbugs/5503423][waive] Waive Llama3.1-70B-FP8 test on RTX PRO 6000 by @peaceh-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7603
- [https://nvbugs/5455140][fix] unwaive release/1.0 DS R1 test cases with bug already fixed by @lancelly in https://github.com/NVIDIA/TensorRT-LLM/pull/7432
- [https://nvbugs/5470782][chore] Remove the skip statement in 1.0 rele… by @SimengLiu-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7573
- [None][doc] Fix a invalid link and a typo. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7634
- [None][doc] Use hash id for external link by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7641
- [https://nvbugs/5477730][fix] Fix the alltoall case when tp_size larg… by @WeiHaocheng in https://github.com/NVIDIA/TensorRT-LLM/pull/7671
- [https://nvbugs/5436461][fix] Adjust free_gpu_memory_fraction of test_eagle3 by @leslie-fang25 in https://github.com/NVIDIA/TensorRT-LLM/pull/7673
- [https://nvbugs/5474409][fix] Disable concurrent loading by default by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7663
- [https://nvbugs/5501557][fix] Fix out-of-bounds vector access for model with multiple layer types by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7636
- [https://nvbugs/5398180][feat] Improve Llama4 performance for small max_seqlen cases by @nv-yilinf in https://github.com/NVIDIA/TensorRT-LLM/pull/7681
- [None][ci] Test waives for the release/1.0 branch 09/15 by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/7700
- [None][doc] Add labels description note into llm api section by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7696
- [https://nvbugs/5437405][fix] cherry-pick PR 7000 (qwen3 235b eagle3 ci) by @byshiue in https://github.com/NVIDIA/TensorRT-LLM/pull/7702
- [https://nvbugs/5512734][fix] Update kv cache config for maverick by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/7710
- [https://nvbugs/5355219][fix] Fix trtllm moe backend test config and Qwen3 MoE multi node by @yizhang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7724
- [None][doc] Fix the link in the doc by @Shixiaowei02 in https://github.com/NVIDIA/TensorRT-LLM/pull/7754
- [https://nvbugs/5519525][fix] fix doc invalid link for bug 5519525 by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7753
- [https://nvbugs/5509024][fix] Print full parsed outputs and update keywords for multimodal model by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/7670
- [None][doc] Enhance api reference doc by labeling stable APIs by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7751
- [https://nvbugs/5468897][fix] fix invalid expression for disabling pa… by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7762
- [https://nvbugs/5517023][fix] Pass allreduce strategy and force NCCL on pre-Blackwell arch by @hyukn in https://github.com/NVIDIA/TensorRT-LLM/pull/7768
- [TRTLLM-7958][doc] add 1.0 release notes by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7605
- [https://nvbugs/5522332][fix] Pin numpy version for Gemma. by @yuxianq in https://github.com/NVIDIA/TensorRT-LLM/pull/7783
- [None][doc] Update docker cmd in quick start guide and trtllm-serve … by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7787
- [https://nvbugs/1234567][fix] Revert https://github.com/NVIDIA/TensorRT-LLM/pull/7768/files by @litaotju in https://github.com/NVIDIA/TensorRT-LLM/pull/7813
- [https://nvbugs/5516710][fix] fix Llama 3.3 TP PP case by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7717
- [None][doc] Replace the main in the examples' link with commit id. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7837
- [None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7850
- [None][doc] add a guide for modifying APIs by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7866
- [None][doc] Update Perf-Overview.md for release/1.0 by @zbpatel in https://github.com/NVIDIA/TensorRT-LLM/pull/7848
- [None][doc] add stable label to all the un-labelled arguments in LLM class by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7863
- [None][fix] api stability bug in status label by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7861
- [https://nvbugs/5427043][fix] cherrypick: request length exceeds max_num_tokens by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7718
- [https://nvbugs/5531963][fix] cherry pick [#7725] by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/7907
- [None][chroe] Rename TensorRT-LLM to TensorRT LLM for source code. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7851
- [None][doc] fix invalid links in perf benchmarking. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7933
Full Changelog: https://github.com/NVIDIA/TensorRT-LLM/compare/v1.0.0rc6...v1.0.0