Download Latest Version libtorchtrt-2.8.0-tensorrt10.12.0-cuda128-libtorch2.8.0-x86_64-linux.tar.gz (2.7 MB)
Email in envelope

Get an email when there's a new version of Torch-TensorRT

Home / v2.8.0
Name Modified Size InfoDownloads / Week
Parent folder
torch_tensorrt-2.8.0-cp313-cp313-manylinux_2_28_aarch64.whl 2025-08-09 3.5 MB
torch_tensorrt-2.8.0-cp312-cp312-manylinux_2_28_aarch64.whl 2025-08-09 3.5 MB
torch_tensorrt-2.8.0-cp311-cp311-manylinux_2_28_aarch64.whl 2025-08-09 3.5 MB
torch_tensorrt-2.8.0-cp310-cp310-manylinux_2_28_aarch64.whl 2025-08-09 3.5 MB
torch_tensorrt-2.8.0-cp39-cp39-manylinux_2_28_aarch64.whl 2025-08-09 3.5 MB
torch_tensorrt-2.8.0+cu126-cp310-cp310-linux_aarch64.whl 2025-08-08 3.3 MB
torch_tensorrt-2.8.0-cp312-cp312-win_amd64.whl 2025-08-08 1.8 MB
torch_tensorrt-2.8.0-cp313-cp313-win_amd64.whl 2025-08-08 1.8 MB
torch_tensorrt-2.8.0-cp310-cp310-win_amd64.whl 2025-08-08 1.8 MB
torch_tensorrt-2.8.0-cp311-cp311-win_amd64.whl 2025-08-08 1.8 MB
torch_tensorrt-2.8.0-cp39-cp39-win_amd64.whl 2025-08-08 1.8 MB
libtorchtrt-2.8.0-tensorrt10.12.0-cuda128-libtorch2.8.0-x86_64-linux.tar.gz 2025-08-08 2.7 MB
torch_tensorrt-2.8.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_34_x86_64.whl 2025-08-08 15.1 MB
torch_tensorrt-2.8.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_34_x86_64.whl 2025-08-08 15.1 MB
torch_tensorrt-2.8.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_34_x86_64.whl 2025-08-08 15.1 MB
torch_tensorrt-2.8.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_34_x86_64.whl 2025-08-08 15.0 MB
torch_tensorrt-2.8.0-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_34_x86_64.whl 2025-08-08 15.0 MB
README.md 2025-08-08 20.0 kB
Torch-TensorRT v2.8.0 source code.tar.gz 2025-08-08 67.5 MB
Torch-TensorRT v2.8.0 source code.zip 2025-08-08 73.8 MB
Totals: 20 Items   249.3 MB 1

PyTorch 2.8, CUDA 12.8 TensorRT 10.12, Python 3.13

Torch-TensorRT 2.8.0 Standard Linux x86-64 and Windows targets PyTorch 2.8, TensorRT 10.12, CUDA 12.6, 12.8, 12.9 and Python 3.9 ~ 3.13

Platform support

In addition to the standard Windows x86-64 and Linux x86-64 releases, we now provide binary builds for SBSA and Jetson:

Deprecations

New Features

AOT-Inductor Pythonless Deployment

Stability: Beta

Historically TorchScript has been used to run Torch-TensorRT programs outside of a Python interpreter. Both the dynamo/torch.compile frontend and the TorchScript frontends supported this TorchScript deployment workflow.

Old
:::py
trt_model = torch_tensorrt.compile(model, ir="dynamo", arg_inputs=[...])
ts_model = torch.jit.trace(trt_model, inputs=[...])
ts_model.save("trt_model.ts")

Now you can achieve a similar result using AOT-Inductor. AOTInductor is a specialized version of TorchInductor, designed to process exported PyTorch models, optimize them, and produce shared libraries as well as other relevant artifacts. These compiled artifacts are specifically crafted for deployment in non-Python environments.

Torch-TensorRT can embed TensorRT engines in AOTInductor libraries to accelerate models further. You are also able to combine Inductor kernels with TensorRT engines via this method. This allows users to deploy their models outside of Python using torch-compile native technologies.

New
:::py
with torch.no_grad():
    cg_trt_module = torch_tensorrt.compile(model, **compile_settings)
    torch_tensorrt.save(
        cg_trt_module,
        file_path=os.path.join(os.getcwd(), "model.pt2"),
        output_format="aot_inductor",
        retrace=True,
        arg_inputs=example_inputs,
    )

This model.pt2 file can then be loaded in either Python or C++ using Torch APIs.

:::py
import torch
import torch_tensorrt
model = torch._inductor.aoti_load_package(os.path.join(os.getcwd(), "model.pt2"))

:::c++
#include <iostream>
#include <vector>

#include "torch/torch.h"
#include "torch/csrc/inductor/aoti_package/model_package_loader.h"

int main(int argc, const char* argv[]) {

  std::string trt_aoti_module_path = "model.pt2";
  c10::InferenceMode mode;

  torch::inductor::AOTIModelPackageLoader loader(trt_aoti_module_path);
  std::vector<torch::Tensor> inputs = {torch::randn({8, 10}, at::kCUDA)};
  std::vector<torch::Tensor> outputs = loader.run(inputs);
  std::cout << "Result from the first inference:"<< std::endl;
  std::cout << outputs << std::endl;

  return 0;
}

More information can be found here https://docs.pytorch.org/TensorRT/user_guide/runtime.html as we as a code example here: https://github.com/pytorch/TensorRT/blob/release/2.8/examples/torchtrt_aoti_example/inference.cpp

PTX Plugins

Stability: Stable

In Torch-TensorRT 2.7.0 we introduced auto-generated plugins which allows users to automatically wrap kernels / PyTorch custom Operators into TensorRT plugins to run their models without a graph break. In 2.8.0 we extend this system to support PTX based plugins which enables users to serialize and run their TensorRT engines without requiring any PyTorch / Triton / Python in the runtime or access to the original kernel implementation. This approach also has the added benefit of lower overhead than the auto-generated plugin system for achieving maximum performance.

Example below shows how to register a custom operator, generate the necessary plugin, and integrate it into the TensorRT execution graph. [the example] (https://github.com/pytorch/TensorRT/blob/main/examples/dynamo/aot_plugin.py)

Hierarchical Multi-backend Adjacency Partitioner

Stability: Experimental

The Hierarchical Multi-backend Adjacency Partitioner enables sophisticated model partitioning strategies for distributing PyTorch models across multiple backends based on operator support and priority ordering. A prototype partitioner has been added to the package which allows graphs to be split across multiple backends (e.g., TensorRT, PyTorch Inductor) based on operator capabilities. By providing a backend preference order operators are assigned to the highest-priority backend that supports them.

Please refer to the example for usage.

Model Optimizer-Based NVFP4 Quantization (PTQ) Support for Linux

Stability: Stable

Introducing NVFP4 for efficient and accurate low-precision inference on the Blackwell GPU architecture. Currently, the workflow supports quantizing models from FP16 → NVFP4.

Directly quantizing from FP32 → NVFP4 is not recommended as it may lead to accuracy degradation. Instead, first convert or train the model in FP16, then quantize to NVFP4.

Full example: https://github.com/pytorch/TensorRT/blob/release/2.8/examples/apps/flux_demo.py

run_llm and KV Caching

Stability: Beta

We’ve introduced a KV caching implementation for Torch-TensorRT using native TensorRT operations, yielding significant improvements in inference performance for autoregressive large language models (LLMs). KV caching is a crucial optimization that reduces latency by reusing attention activations across decoding steps. In our approach, the KV cache is modeled as fixed-size tensor inputs and outputs, with outputs from each decoding step looped back as inputs to update the cache incrementally. This update is performed using TensorRT-supported operations such as slice, concat, and pad. The design allows step-wise cache updates while preserving compatibility with TensorRT’s optimization workflow and engine serialization.

We’ve also introduced a new utility, run_llm.py, to run inference on popular LLMs with KV caching enabled.

To run a Qwen3 model using KV caching with Torch-TensorRT, use the following command:

:::py
python run_llm.py --model Qwen/Qwen3-8B --prompt "What is parallel programming?" --precision FP16 --num_tokens 128 --cache static_v2 --benchmark

Please refer to Compiling LLM models from Huggingface for more details and limitations.

Debugger

We introduced a new debugger for better usability and a debugging experience for Torch-TensorRT. The debugger centralized all debugger settings, such as logging level from critical to info, and engine profiling. We also introduced fx graph visualization in the debugger, where you can specify the specific lowering pass before/which you want to draw the graph. Moreover, the debugger can provide engine profiling and layer information that is compatible with TREX, an engine visualization tool developed by TensorRT, that better explains the engine structure.

Model Zoo

We have expanded support to include several popular models from the Qwen3 and Llama3 series. In this release, we’ve also addressed various performance and accuracy issues to improve overall stability. For a complete list of supported models, please refer to the Supported Models section.

Bug Fixes

Refit

Refit has been re-enabled for Python 3.13 after being disabled in 2.7.0

  • Reduced memory overhead by offloading model to CPU

Performance improvements

  • Linear converter was reverted to the earlier implementation because it shows perf improvements in fp16 on some models (e.g., BERT)
  • Group Norm converter was simplified to reduce unnecessary TensorRT ILayers
  • The constants in the BatchNorm converter are now folded at compile time, leading to significant performance improvements.
  • SDPA op decomposition is optimized, resulting in same or better performance as ONNX-TensorRT for transformer-based diffusion models such as Stable Diffusion 3/WAN2.1/FLUX

What's Changed

Full Changelog: https://github.com/pytorch/TensorRT/compare/v2.7.0...v2.8.0-rc6

Source: README.md, updated 2025-08-08