Torch-TensorRT - Browse /v2.8.0 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
torch_tensorrt-2.8.0-cp313-cp313-manylinux_2_28_aarch64.whl	2025-08-09	3.5 MB	0
torch_tensorrt-2.8.0-cp312-cp312-manylinux_2_28_aarch64.whl	2025-08-09	3.5 MB	0
torch_tensorrt-2.8.0-cp311-cp311-manylinux_2_28_aarch64.whl	2025-08-09	3.5 MB	0
torch_tensorrt-2.8.0-cp310-cp310-manylinux_2_28_aarch64.whl	2025-08-09	3.5 MB	0
torch_tensorrt-2.8.0-cp39-cp39-manylinux_2_28_aarch64.whl	2025-08-09	3.5 MB	0
torch_tensorrt-2.8.0+cu126-cp310-cp310-linux_aarch64.whl	2025-08-08	3.3 MB	0
torch_tensorrt-2.8.0-cp312-cp312-win_amd64.whl	2025-08-08	1.8 MB	0
torch_tensorrt-2.8.0-cp313-cp313-win_amd64.whl	2025-08-08	1.8 MB	0
torch_tensorrt-2.8.0-cp310-cp310-win_amd64.whl	2025-08-08	1.8 MB	0
torch_tensorrt-2.8.0-cp311-cp311-win_amd64.whl	2025-08-08	1.8 MB	0
torch_tensorrt-2.8.0-cp39-cp39-win_amd64.whl	2025-08-08	1.8 MB	0
libtorchtrt-2.8.0-tensorrt10.12.0-cuda128-libtorch2.8.0-x86_64-linux.tar.gz	2025-08-08	2.7 MB	0
torch_tensorrt-2.8.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_34_x86_64.whl	2025-08-08	15.1 MB	0
torch_tensorrt-2.8.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_34_x86_64.whl	2025-08-08	15.1 MB	0
torch_tensorrt-2.8.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_34_x86_64.whl	2025-08-08	15.1 MB	0
torch_tensorrt-2.8.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_34_x86_64.whl	2025-08-08	15.0 MB	0
torch_tensorrt-2.8.0-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_34_x86_64.whl	2025-08-08	15.0 MB	0
README.md	2025-08-08	20.0 kB	0
Torch-TensorRT v2.8.0 source code.tar.gz	2025-08-08	67.5 MB	0
Torch-TensorRT v2.8.0 source code.zip	2025-08-08	73.8 MB	1
Totals: 20 Items		249.3 MB	1

PyTorch 2.8, CUDA 12.8 TensorRT 10.12, Python 3.13

Torch-TensorRT 2.8.0 Standard Linux x86-64 and Windows targets PyTorch 2.8, TensorRT 10.12, CUDA 12.6, 12.8, 12.9 and Python 3.9 ~ 3.13

Linux x86-64 + Windows
CUDA 12.8 + Python 3.9-3.13 is Available via PyPI: https://pypi.org/project/torch-tensorrt/
CUDA 12.6/12.8/12.9 + Python 3.9-3.13 is also Available via Pytorch Index: https://download.pytorch.org/whl/torch-tensorrt

Platform support

In addition to the standard Windows x86-64 and Linux x86-64 releases, we now provide binary builds for SBSA and Jetson:

SBSA aarch64
CUDA 12.9 + Python 3.9–3.13 + Torch 2.8 + TensorRT 10.12
Available via PyPI: https://pypi.org/project/torch-tensorrt/
Available via PyTorch index: https://download.pytorch.org/whl/torch-tensorrt
Jetson Orin
CUDA 12.6 + Python 3.10 + Torch 2.8 + TensorRT 10.3.0
Available at https://pypi.jetson-ai-lab.dev/jp6/cu126/

Deprecations

TensorRT implicit quantization support has been deprecated since TensorRT 10.1. Torch-TensorRT APIs related to the INT8Calibrator will be removed in Torch-TensorRT 2.9.0. Quantization users should move to a workflow based on TensorRT-Model-Optimizer Toolkit. See: https://docs.pytorch.org/TensorRT/tutorials/_rendered_examples/dynamo/vgg16_ptq.html for more information

New Features

AOT-Inductor Pythonless Deployment

Stability: Beta

Historically TorchScript has been used to run Torch-TensorRT programs outside of a Python interpreter. Both the dynamo/torch.compile frontend and the TorchScript frontends supported this TorchScript deployment workflow.

Old

:::py
trt_model = torch_tensorrt.compile(model, ir="dynamo", arg_inputs=[...])
ts_model = torch.jit.trace(trt_model, inputs=[...])
ts_model.save("trt_model.ts")

Now you can achieve a similar result using AOT-Inductor. AOTInductor is a specialized version of TorchInductor, designed to process exported PyTorch models, optimize them, and produce shared libraries as well as other relevant artifacts. These compiled artifacts are specifically crafted for deployment in non-Python environments.

Torch-TensorRT can embed TensorRT engines in AOTInductor libraries to accelerate models further. You are also able to combine Inductor kernels with TensorRT engines via this method. This allows users to deploy their models outside of Python using torch-compile native technologies.

New

:::py
with torch.no_grad():
    cg_trt_module = torch_tensorrt.compile(model, **compile_settings)
    torch_tensorrt.save(
        cg_trt_module,
        file_path=os.path.join(os.getcwd(), "model.pt2"),
        output_format="aot_inductor",
        retrace=True,
        arg_inputs=example_inputs,
    )

This model.pt2 file can then be loaded in either Python or C++ using Torch APIs.

:::py
import torch
import torch_tensorrt
model = torch._inductor.aoti_load_package(os.path.join(os.getcwd(), "model.pt2"))

:::c++
#include <iostream>
#include <vector>

#include "torch/torch.h"
#include "torch/csrc/inductor/aoti_package/model_package_loader.h"

int main(int argc, const char* argv[]) {

  std::string trt_aoti_module_path = "model.pt2";
  c10::InferenceMode mode;

  torch::inductor::AOTIModelPackageLoader loader(trt_aoti_module_path);
  std::vector<torch::Tensor> inputs = {torch::randn({8, 10}, at::kCUDA)};
  std::vector<torch::Tensor> outputs = loader.run(inputs);
  std::cout << "Result from the first inference:"<< std::endl;
  std::cout << outputs << std::endl;

  return 0;
}

More information can be found here https://docs.pytorch.org/TensorRT/user_guide/runtime.html as we as a code example here: https://github.com/pytorch/TensorRT/blob/release/2.8/examples/torchtrt_aoti_example/inference.cpp

PTX Plugins

Stability: Stable

In Torch-TensorRT 2.7.0 we introduced auto-generated plugins which allows users to automatically wrap kernels / PyTorch custom Operators into TensorRT plugins to run their models without a graph break. In 2.8.0 we extend this system to support PTX based plugins which enables users to serialize and run their TensorRT engines without requiring any PyTorch / Triton / Python in the runtime or access to the original kernel implementation. This approach also has the added benefit of lower overhead than the auto-generated plugin system for achieving maximum performance.

Example below shows how to register a custom operator, generate the necessary plugin, and integrate it into the TensorRT execution graph. [the example] (https://github.com/pytorch/TensorRT/blob/main/examples/dynamo/aot_plugin.py)

Hierarchical Multi-backend Adjacency Partitioner

Stability: Experimental

The Hierarchical Multi-backend Adjacency Partitioner enables sophisticated model partitioning strategies for distributing PyTorch models across multiple backends based on operator support and priority ordering. A prototype partitioner has been added to the package which allows graphs to be split across multiple backends (e.g., TensorRT, PyTorch Inductor) based on operator capabilities. By providing a backend preference order operators are assigned to the highest-priority backend that supports them.

Please refer to the example for usage.

Model Optimizer-Based NVFP4 Quantization (PTQ) Support for Linux

Stability: Stable

Introducing NVFP4 for efficient and accurate low-precision inference on the Blackwell GPU architecture. Currently, the workflow supports quantizing models from FP16 → NVFP4.

Directly quantizing from FP32 → NVFP4 is not recommended as it may lead to accuracy degradation. Instead, first convert or train the model in FP16, then quantize to NVFP4.

Full example: https://github.com/pytorch/TensorRT/blob/release/2.8/examples/apps/flux_demo.py

`run_llm` and KV Caching

Stability: Beta

We’ve introduced a KV caching implementation for Torch-TensorRT using native TensorRT operations, yielding significant improvements in inference performance for autoregressive large language models (LLMs). KV caching is a crucial optimization that reduces latency by reusing attention activations across decoding steps. In our approach, the KV cache is modeled as fixed-size tensor inputs and outputs, with outputs from each decoding step looped back as inputs to update the cache incrementally. This update is performed using TensorRT-supported operations such as slice, concat, and pad. The design allows step-wise cache updates while preserving compatibility with TensorRT’s optimization workflow and engine serialization.

We’ve also introduced a new utility, run_llm.py, to run inference on popular LLMs with KV caching enabled.

To run a Qwen3 model using KV caching with Torch-TensorRT, use the following command:

:::py
python run_llm.py --model Qwen/Qwen3-8B --prompt "What is parallel programming?" --precision FP16 --num_tokens 128 --cache static_v2 --benchmark

Please refer to Compiling LLM models from Huggingface for more details and limitations.

Debugger

We introduced a new debugger for better usability and a debugging experience for Torch-TensorRT. The debugger centralized all debugger settings, such as logging level from critical to info, and engine profiling. We also introduced fx graph visualization in the debugger, where you can specify the specific lowering pass before/which you want to draw the graph. Moreover, the debugger can provide engine profiling and layer information that is compatible with TREX, an engine visualization tool developed by TensorRT, that better explains the engine structure.

Model Zoo

We have expanded support to include several popular models from the Qwen3 and Llama3 series. In this release, we’ve also addressed various performance and accuracy issues to improve overall stability. For a complete list of supported models, please refer to the Supported Models section.

Bug Fixes

Refit

Refit has been re-enabled for Python 3.13 after being disabled in 2.7.0

Reduced memory overhead by offloading model to CPU

Performance improvements

Linear converter was reverted to the earlier implementation because it shows perf improvements in fp16 on some models (e.g., BERT)
Group Norm converter was simplified to reduce unnecessary TensorRT ILayers
The constants in the BatchNorm converter are now folded at compile time, leading to significant performance improvements.
SDPA op decomposition is optimized, resulting in same or better performance as ONNX-TensorRT for transformer-based diffusion models such as Stable Diffusion 3/WAN2.1/FLUX

What's Changed

chore: bump torch to 2.8.0.dev by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3449
Nccl ops correction changes by @apbose in https://github.com/pytorch/TensorRT/pull/3387
fix: Change the translational layer from numpy to torch during conversion to handle additional data types by @peri044 in https://github.com/pytorch/TensorRT/pull/3445
Fix grid_sample by @HolyWu in https://github.com/pytorch/TensorRT/pull/3340
fix: Destory cuda graphs before setting weight streaming by @keehyuna in https://github.com/pytorch/TensorRT/pull/3461
tool: uv setting to avoid the pip install -e by @narendasan in https://github.com/pytorch/TensorRT/pull/3468
chore: reenable py313 by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3455
bf16 support for elementwise operation by @apbose in https://github.com/pytorch/TensorRT/pull/3462
feat: rmsnorm lowering by @bowang007 in https://github.com/pytorch/TensorRT/pull/3440
feat: Support flashinfer.rmsnorm by @bowang007 in https://github.com/pytorch/TensorRT/pull/3424
fix: support masked_scatter by lowering path and corner case of maske… by @chohk88 in https://github.com/pytorch/TensorRT/pull/3476
fix: index_put converter to handle multi-shape slicing with None by @chohk88 in https://github.com/pytorch/TensorRT/pull/3475
slight code reorg and bug correction for cross_compile by @apbose in https://github.com/pytorch/TensorRT/pull/3472
Enabled refit on Python 3.13 by @cehongwang in https://github.com/pytorch/TensorRT/pull/3481
fix: l2_limit_for_tiling by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3479
chore: test bf16 fixes in CI by @peri044 in https://github.com/pytorch/TensorRT/pull/3491
add python3.13 into the final release artifact by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3499
chore: remove pre-cxx11 abi by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3473
disabling dla args for hope igx platform by @apbose in https://github.com/pytorch/TensorRT/pull/3487
chore: remove pre-cxx11 abi references in doc by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3503
Fix Windows CI for Release 2.7 (#3505) by @narendasan in https://github.com/pytorch/TensorRT/pull/3506
upgrade modelopt by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3511
chore: miscellaneous fixes for handling graph breaks by @peri044 in https://github.com/pytorch/TensorRT/pull/3488
add nspect ignore file by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3514
Update mutable_torchtrt_module_example.py by @cehongwang in https://github.com/pytorch/TensorRT/pull/3519
Add Linux CI build for aarch64 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3516
chore: update the docstring for llama2 rmsnorm automatic plugin example by @bowang007 in https://github.com/pytorch/TensorRT/pull/3512
chore(deps): bump undici from 5.28.5 to 5.29.0 in /.github/actions/assigner by @dependabot[bot] in https://github.com/pytorch/TensorRT/pull/3520
fix docker build failure: add allow_empty to true by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3526
Added CPU offloading by @cehongwang in https://github.com/pytorch/TensorRT/pull/3452
chore(deps): bump setuptools from 70.2.0 to 78.1.1 in /toolchains/jp_workspaces by @dependabot[bot] in https://github.com/pytorch/TensorRT/pull/3523
add feature gate for tensorrt plugin by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3518
chore(deps): bump transformers from 4.48.0 to 4.50.0 in /examples/dynamo by @dependabot[bot] in https://github.com/pytorch/TensorRT/pull/3497
Minor fix - check for DTensor on igpu platform by @apbose in https://github.com/pytorch/TensorRT/pull/3531
fix: wrong dtype and device in aten.full_like decomposition by @junstar92 in https://github.com/pytorch/TensorRT/pull/3535
feat: Implement SDPA op converter / lowering pass as extensions by @peri044 in https://github.com/pytorch/TensorRT/pull/3534
nvidia-modelopt dependency fix by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3544
Add jetson build on CI by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3524
feat: TensorRT AOT Plugin by @bowang007 in https://github.com/pytorch/TensorRT/pull/3504
Publish jetson wheel to pytorch nightly index by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3550
fix: handle device in the same way as dtype in aten.full_like decomposition by @junstar92 in https://github.com/pytorch/TensorRT/pull/3538
fix the jetson nightly build check bug by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3552
fix int8/fp8 constant folding issue by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3543
Upgrade to TensorRT 10.11 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3557
Cross compile guard by @apbose in https://github.com/pytorch/TensorRT/pull/3486
fix: Fix constant folding failure due to modelopt by @peri044 in https://github.com/pytorch/TensorRT/pull/3565
add --no-deps for tests/py/requirements.txt by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3569
Add fp4 support by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3532
fix: Fix a perf regression due to weights being ITensors by @peri044 in https://github.com/pytorch/TensorRT/pull/3568
Added flux demo by @cehongwang in https://github.com/pytorch/TensorRT/pull/3418
FX graph visualization by @cehongwang in https://github.com/pytorch/TensorRT/pull/3528
fix main test failure bug by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3590
Verify C++ tests, fix cuda graphs union issue by @narendasan in https://github.com/pytorch/TensorRT/pull/3589
Fix: fix aot plugin example docstring issue by @bowang007 in https://github.com/pytorch/TensorRT/pull/3595
feat: working uv pyproject.toml by @narendasan in https://github.com/pytorch/TensorRT/pull/3597
remove torchvision dependency from build, optional for test by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3598
Changed weight map to tensor and fix the refit bug by @cehongwang in https://github.com/pytorch/TensorRT/pull/3573
test failed but displayed as green by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3599
Import dllist only on linux by @HolyWu in https://github.com/pytorch/TensorRT/pull/3592
feat: Hierarchical Partitioner to support multi-backends by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3539
fix dynamo converter test case failure by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3594
feat: Saving modules using the AOTI format by @narendasan in https://github.com/pytorch/TensorRT/pull/3567
skip flashinfer-python for py3.9 due to upstream error by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3605
fix enabled_precisions error in test cases by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3606
debug flag is deprecated, remove it so that test won't complain by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3610
fix: add prefix in hierarchical_partitioner_example by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3607
fix: pre-commit issues by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3603
py39 does not like | E TypeError: unsupported operand type(s) for |: 'type' and 'EnumMeta' by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3611
fix cross compilation test bug by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3609
TorchTensorRTModule Serialization Fix by @cehongwang in https://github.com/pytorch/TensorRT/pull/3572
a few CI changes by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3612
remove debug flag by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3618
fix: Fix unbacked sym int not found issue by @peri044 in https://github.com/pytorch/TensorRT/pull/3617
fix ts fe test error. by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3619
disable test on aarch64 for now by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3623
disable aoti format in windows by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3632
release 2.8 branch cut by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3638
cherry pick 3636 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3640
cherry pick 3642 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3655
Lluo/cherry pick 3629 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3656
Lluo/cherry pick 3620 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3658
cherry pick 3663: fix the int8 quantization error, remove duplicated lines by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3665
cherry pick 3660 to release/2.8 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3661
cherry pick 3685: disable jetson build in ci by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3688
cherry pick 3680: fix refit test bug by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3687
cherry-pick 3686: upgrade tensorrt from 10.11 to 10.12 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3690
cherry pick 3689 to 2.8 release:flux fp4 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3696
chore: cherry pick of KV cache PR (3527) by @peri044 in https://github.com/pytorch/TensorRT/pull/3667
Cherrypick of PR 3513 by @apbose in https://github.com/pytorch/TensorRT/pull/3664
Cherrypick of PR 3570 by @apbose in https://github.com/pytorch/TensorRT/pull/3662
chore: cherry pick of bf16 cast PR (3643) by @peri044 in https://github.com/pytorch/TensorRT/pull/3666
Cherrypick [#3719] for release/2.8 by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3734
Cherrypick [#3703] for release/2.8 by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3735
enable back jetpack build by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3720
add typing_extensions as test dependencies which is required by modelopt by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3743
broadcast_remove - cherry pick 3700 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3757
fix typing-extensions issue by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3761
Fix Jetson FP4 gate issue by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3764
fix build cancellation issue by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3768

Full Changelog: https://github.com/pytorch/TensorRT/compare/v2.7.0...v2.8.0-rc6

Source: README.md, updated 2025-08-08

Torch-TensorRT Files

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT

Platform support

Deprecations

New Features

AOT-Inductor Pythonless Deployment

Old

New

PTX Plugins

Hierarchical Multi-backend Adjacency Partitioner

Model Optimizer-Based NVFP4 Quantization (PTQ) Support for Linux

`run_llm` and KV Caching

Debugger

Model Zoo

Bug Fixes

Refit

Performance improvements

What's Changed

Torch-TensorRT Files

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT

Get an email when there's a new version of Torch-TensorRT

Platform support

Deprecations

New Features

AOT-Inductor Pythonless Deployment

Old

New

PTX Plugins

Hierarchical Multi-backend Adjacency Partitioner

Model Optimizer-Based NVFP4 Quantization (PTQ) Support for Linux

run_llm and KV Caching

Debugger

Model Zoo

Bug Fixes

Refit

Performance improvements

What's Changed

`run_llm` and KV Caching