PyTorch 2.8, CUDA 12.8 TensorRT 10.12, Python 3.13
Torch-TensorRT 2.8.0 Standard Linux x86-64 and Windows targets PyTorch 2.8, TensorRT 10.12, CUDA 12.6, 12.8, 12.9 and Python 3.9 ~ 3.13
- Linux x86-64 + Windows
- CUDA 12.8 + Python 3.9-3.13 is Available via PyPI: https://pypi.org/project/torch-tensorrt/
- CUDA 12.6/12.8/12.9 + Python 3.9-3.13 is also Available via Pytorch Index: https://download.pytorch.org/whl/torch-tensorrt
Platform support
In addition to the standard Windows x86-64 and Linux x86-64 releases, we now provide binary builds for SBSA and Jetson:
- SBSA aarch64
- CUDA 12.9 + Python 3.9–3.13 + Torch 2.8 + TensorRT 10.12
- Available via PyPI: https://pypi.org/project/torch-tensorrt/
-
Available via PyTorch index: https://download.pytorch.org/whl/torch-tensorrt
-
Jetson Orin
- CUDA 12.6 + Python 3.10 + Torch 2.8 + TensorRT 10.3.0
- Available at https://pypi.jetson-ai-lab.dev/jp6/cu126/
Deprecations
- TensorRT implicit quantization support has been deprecated since TensorRT 10.1. Torch-TensorRT APIs related to the INT8Calibrator will be removed in Torch-TensorRT 2.9.0. Quantization users should move to a workflow based on TensorRT-Model-Optimizer Toolkit. See: https://docs.pytorch.org/TensorRT/tutorials/_rendered_examples/dynamo/vgg16_ptq.html for more information
New Features
AOT-Inductor Pythonless Deployment
Stability: Beta
Historically TorchScript has been used to run Torch-TensorRT programs outside of a Python interpreter. Both the dynamo
/torch.compile
frontend and the TorchScript frontends supported this TorchScript deployment workflow.
Old
:::py
trt_model = torch_tensorrt.compile(model, ir="dynamo", arg_inputs=[...])
ts_model = torch.jit.trace(trt_model, inputs=[...])
ts_model.save("trt_model.ts")
Now you can achieve a similar result using AOT-Inductor. AOTInductor is a specialized version of TorchInductor, designed to process exported PyTorch models, optimize them, and produce shared libraries as well as other relevant artifacts. These compiled artifacts are specifically crafted for deployment in non-Python environments.
Torch-TensorRT can embed TensorRT engines in AOTInductor libraries to accelerate models further. You are also able to combine Inductor kernels with TensorRT engines via this method. This allows users to deploy their models outside of Python using torch-compile native technologies.
New
:::py
with torch.no_grad():
cg_trt_module = torch_tensorrt.compile(model, **compile_settings)
torch_tensorrt.save(
cg_trt_module,
file_path=os.path.join(os.getcwd(), "model.pt2"),
output_format="aot_inductor",
retrace=True,
arg_inputs=example_inputs,
)
This model.pt2
file can then be loaded in either Python or C++ using Torch APIs.
:::py
import torch
import torch_tensorrt
model = torch._inductor.aoti_load_package(os.path.join(os.getcwd(), "model.pt2"))
:::c++
#include <iostream>
#include <vector>
#include "torch/torch.h"
#include "torch/csrc/inductor/aoti_package/model_package_loader.h"
int main(int argc, const char* argv[]) {
std::string trt_aoti_module_path = "model.pt2";
c10::InferenceMode mode;
torch::inductor::AOTIModelPackageLoader loader(trt_aoti_module_path);
std::vector<torch::Tensor> inputs = {torch::randn({8, 10}, at::kCUDA)};
std::vector<torch::Tensor> outputs = loader.run(inputs);
std::cout << "Result from the first inference:"<< std::endl;
std::cout << outputs << std::endl;
return 0;
}
More information can be found here https://docs.pytorch.org/TensorRT/user_guide/runtime.html as we as a code example here: https://github.com/pytorch/TensorRT/blob/release/2.8/examples/torchtrt_aoti_example/inference.cpp
PTX Plugins
Stability: Stable
In Torch-TensorRT 2.7.0 we introduced auto-generated plugins which allows users to automatically wrap kernels / PyTorch custom Operators into TensorRT plugins to run their models without a graph break. In 2.8.0 we extend this system to support PTX based plugins which enables users to serialize and run their TensorRT engines without requiring any PyTorch / Triton / Python in the runtime or access to the original kernel implementation. This approach also has the added benefit of lower overhead than the auto-generated plugin system for achieving maximum performance.
Example below shows how to register a custom operator, generate the necessary plugin, and integrate it into the TensorRT execution graph. [the example] (https://github.com/pytorch/TensorRT/blob/main/examples/dynamo/aot_plugin.py)
Hierarchical Multi-backend Adjacency Partitioner
Stability: Experimental
The Hierarchical Multi-backend Adjacency Partitioner enables sophisticated model partitioning strategies for distributing PyTorch models across multiple backends based on operator support and priority ordering. A prototype partitioner has been added to the package which allows graphs to be split across multiple backends (e.g., TensorRT, PyTorch Inductor) based on operator capabilities. By providing a backend preference order operators are assigned to the highest-priority backend that supports them.
Please refer to the example for usage.
Model Optimizer-Based NVFP4 Quantization (PTQ) Support for Linux
Stability: Stable
Introducing NVFP4 for efficient and accurate low-precision inference on the Blackwell GPU architecture. Currently, the workflow supports quantizing models from FP16 → NVFP4.
Directly quantizing from FP32 → NVFP4 is not recommended as it may lead to accuracy degradation. Instead, first convert or train the model in FP16, then quantize to NVFP4.
Full example: https://github.com/pytorch/TensorRT/blob/release/2.8/examples/apps/flux_demo.py
run_llm
and KV Caching
Stability: Beta
We’ve introduced a KV caching implementation for Torch-TensorRT using native TensorRT operations, yielding significant improvements in inference performance for autoregressive large language models (LLMs). KV caching is a crucial optimization that reduces latency by reusing attention activations across decoding steps. In our approach, the KV cache is modeled as fixed-size tensor inputs and outputs, with outputs from each decoding step looped back as inputs to update the cache incrementally. This update is performed using TensorRT-supported operations such as slice
, concat
, and pad
. The design allows step-wise cache updates while preserving compatibility with TensorRT’s optimization workflow and engine serialization.
We’ve also introduced a new utility, run_llm.py
, to run inference on popular LLMs with KV caching enabled.
To run a Qwen3 model using KV caching with Torch-TensorRT, use the following command:
:::py
python run_llm.py --model Qwen/Qwen3-8B --prompt "What is parallel programming?" --precision FP16 --num_tokens 128 --cache static_v2 --benchmark
Please refer to Compiling LLM models from Huggingface for more details and limitations.
Debugger
We introduced a new debugger for better usability and a debugging experience for Torch-TensorRT. The debugger centralized all debugger settings, such as logging level from critical to info, and engine profiling. We also introduced fx graph visualization in the debugger, where you can specify the specific lowering pass before/which you want to draw the graph. Moreover, the debugger can provide engine profiling and layer information that is compatible with TREX, an engine visualization tool developed by TensorRT, that better explains the engine structure.
Model Zoo
We have expanded support to include several popular models from the Qwen3 and Llama3 series. In this release, we’ve also addressed various performance and accuracy issues to improve overall stability. For a complete list of supported models, please refer to the Supported Models section.
Bug Fixes
Refit
Refit has been re-enabled for Python 3.13 after being disabled in 2.7.0
- Reduced memory overhead by offloading model to CPU
Performance improvements
- Linear converter was reverted to the earlier implementation because it shows perf improvements in fp16 on some models (e.g., BERT)
- Group Norm converter was simplified to reduce unnecessary TensorRT ILayers
- The constants in the BatchNorm converter are now folded at compile time, leading to significant performance improvements.
- SDPA op decomposition is optimized, resulting in same or better performance as ONNX-TensorRT for transformer-based diffusion models such as Stable Diffusion 3/WAN2.1/FLUX
What's Changed
- chore: bump torch to 2.8.0.dev by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3449
- Nccl ops correction changes by @apbose in https://github.com/pytorch/TensorRT/pull/3387
- fix: Change the translational layer from numpy to torch during conversion to handle additional data types by @peri044 in https://github.com/pytorch/TensorRT/pull/3445
- Fix grid_sample by @HolyWu in https://github.com/pytorch/TensorRT/pull/3340
- fix: Destory cuda graphs before setting weight streaming by @keehyuna in https://github.com/pytorch/TensorRT/pull/3461
- tool: uv setting to avoid the pip install -e by @narendasan in https://github.com/pytorch/TensorRT/pull/3468
- chore: reenable py313 by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3455
- bf16 support for elementwise operation by @apbose in https://github.com/pytorch/TensorRT/pull/3462
- feat: rmsnorm lowering by @bowang007 in https://github.com/pytorch/TensorRT/pull/3440
- feat: Support flashinfer.rmsnorm by @bowang007 in https://github.com/pytorch/TensorRT/pull/3424
- fix: support masked_scatter by lowering path and corner case of maske… by @chohk88 in https://github.com/pytorch/TensorRT/pull/3476
- fix: index_put converter to handle multi-shape slicing with None by @chohk88 in https://github.com/pytorch/TensorRT/pull/3475
- slight code reorg and bug correction for cross_compile by @apbose in https://github.com/pytorch/TensorRT/pull/3472
- Enabled refit on Python 3.13 by @cehongwang in https://github.com/pytorch/TensorRT/pull/3481
- fix: l2_limit_for_tiling by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3479
- chore: test bf16 fixes in CI by @peri044 in https://github.com/pytorch/TensorRT/pull/3491
- add python3.13 into the final release artifact by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3499
- chore: remove pre-cxx11 abi by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3473
- disabling dla args for hope igx platform by @apbose in https://github.com/pytorch/TensorRT/pull/3487
- chore: remove pre-cxx11 abi references in doc by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3503
- Fix Windows CI for Release 2.7 (#3505) by @narendasan in https://github.com/pytorch/TensorRT/pull/3506
- upgrade modelopt by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3511
- chore: miscellaneous fixes for handling graph breaks by @peri044 in https://github.com/pytorch/TensorRT/pull/3488
- add nspect ignore file by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3514
- Update mutable_torchtrt_module_example.py by @cehongwang in https://github.com/pytorch/TensorRT/pull/3519
- Add Linux CI build for aarch64 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3516
- chore: update the docstring for llama2 rmsnorm automatic plugin example by @bowang007 in https://github.com/pytorch/TensorRT/pull/3512
- chore(deps): bump undici from 5.28.5 to 5.29.0 in /.github/actions/assigner by @dependabot[bot] in https://github.com/pytorch/TensorRT/pull/3520
- fix docker build failure: add allow_empty to true by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3526
- Added CPU offloading by @cehongwang in https://github.com/pytorch/TensorRT/pull/3452
- chore(deps): bump setuptools from 70.2.0 to 78.1.1 in /toolchains/jp_workspaces by @dependabot[bot] in https://github.com/pytorch/TensorRT/pull/3523
- add feature gate for tensorrt plugin by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3518
- chore(deps): bump transformers from 4.48.0 to 4.50.0 in /examples/dynamo by @dependabot[bot] in https://github.com/pytorch/TensorRT/pull/3497
- Minor fix - check for DTensor on igpu platform by @apbose in https://github.com/pytorch/TensorRT/pull/3531
- fix: wrong dtype and device in
aten.full_like
decomposition by @junstar92 in https://github.com/pytorch/TensorRT/pull/3535 - feat: Implement SDPA op converter / lowering pass as extensions by @peri044 in https://github.com/pytorch/TensorRT/pull/3534
- nvidia-modelopt dependency fix by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3544
- Add jetson build on CI by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3524
- feat: TensorRT AOT Plugin by @bowang007 in https://github.com/pytorch/TensorRT/pull/3504
- Publish jetson wheel to pytorch nightly index by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3550
- fix: handle device in the same way as dtype in
aten.full_like
decomposition by @junstar92 in https://github.com/pytorch/TensorRT/pull/3538 - fix the jetson nightly build check bug by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3552
- fix int8/fp8 constant folding issue by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3543
- Upgrade to TensorRT 10.11 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3557
- Cross compile guard by @apbose in https://github.com/pytorch/TensorRT/pull/3486
- fix: Fix constant folding failure due to modelopt by @peri044 in https://github.com/pytorch/TensorRT/pull/3565
- add --no-deps for tests/py/requirements.txt by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3569
- Add fp4 support by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3532
- fix: Fix a perf regression due to weights being ITensors by @peri044 in https://github.com/pytorch/TensorRT/pull/3568
- Added flux demo by @cehongwang in https://github.com/pytorch/TensorRT/pull/3418
- FX graph visualization by @cehongwang in https://github.com/pytorch/TensorRT/pull/3528
- fix main test failure bug by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3590
- Verify C++ tests, fix cuda graphs union issue by @narendasan in https://github.com/pytorch/TensorRT/pull/3589
- Fix: fix aot plugin example docstring issue by @bowang007 in https://github.com/pytorch/TensorRT/pull/3595
- feat: working uv pyproject.toml by @narendasan in https://github.com/pytorch/TensorRT/pull/3597
- remove torchvision dependency from build, optional for test by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3598
- Changed weight map to tensor and fix the refit bug by @cehongwang in https://github.com/pytorch/TensorRT/pull/3573
- test failed but displayed as green by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3599
- Import dllist only on linux by @HolyWu in https://github.com/pytorch/TensorRT/pull/3592
- feat: Hierarchical Partitioner to support multi-backends by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3539
- fix dynamo converter test case failure by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3594
- feat: Saving modules using the AOTI format by @narendasan in https://github.com/pytorch/TensorRT/pull/3567
- skip flashinfer-python for py3.9 due to upstream error by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3605
- fix enabled_precisions error in test cases by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3606
- debug flag is deprecated, remove it so that test won't complain by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3610
- fix: add prefix in hierarchical_partitioner_example by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3607
- fix: pre-commit issues by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3603
- py39 does not like | E TypeError: unsupported operand type(s) for |: 'type' and 'EnumMeta' by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3611
- fix cross compilation test bug by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3609
- TorchTensorRTModule Serialization Fix by @cehongwang in https://github.com/pytorch/TensorRT/pull/3572
- a few CI changes by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3612
- remove debug flag by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3618
- fix: Fix unbacked sym int not found issue by @peri044 in https://github.com/pytorch/TensorRT/pull/3617
- fix ts fe test error. by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3619
- disable test on aarch64 for now by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3623
- disable aoti format in windows by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3632
- release 2.8 branch cut by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3638
- cherry pick 3636 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3640
- cherry pick 3642 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3655
- Lluo/cherry pick 3629 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3656
- Lluo/cherry pick 3620 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3658
- cherry pick 3663: fix the int8 quantization error, remove duplicated lines by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3665
- cherry pick 3660 to release/2.8 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3661
- cherry pick 3685: disable jetson build in ci by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3688
- cherry pick 3680: fix refit test bug by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3687
- cherry-pick 3686: upgrade tensorrt from 10.11 to 10.12 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3690
- cherry pick 3689 to 2.8 release:flux fp4 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3696
- chore: cherry pick of KV cache PR (3527) by @peri044 in https://github.com/pytorch/TensorRT/pull/3667
- Cherrypick of PR 3513 by @apbose in https://github.com/pytorch/TensorRT/pull/3664
- Cherrypick of PR 3570 by @apbose in https://github.com/pytorch/TensorRT/pull/3662
- chore: cherry pick of bf16 cast PR (3643) by @peri044 in https://github.com/pytorch/TensorRT/pull/3666
- Cherrypick [#3719] for release/2.8 by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3734
- Cherrypick [#3703] for release/2.8 by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3735
- enable back jetpack build by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3720
- add typing_extensions as test dependencies which is required by modelopt by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3743
- broadcast_remove - cherry pick 3700 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3757
- fix typing-extensions issue by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3761
- Fix Jetson FP4 gate issue by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3764
- fix build cancellation issue by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3768
Full Changelog: https://github.com/pytorch/TensorRT/compare/v2.7.0...v2.8.0-rc6