TensorRT-LLM is an open-source high-performance inference library specifically designed to optimize and accelerate large language model deployment on NVIDIA GPUs. It provides a Python-based API built on top of PyTorch that allows developers to define, customize, and deploy LLMs efficiently across a variety of hardware configurations, from single GPUs to large multi-node clusters. The library focuses on maximizing throughput and minimizing latency through advanced techniques such as quantization, custom attention kernels, and optimized memory management strategies. It includes support for cutting-edge inference methods like speculative decoding and inflight batching, enabling real-time and large-scale AI applications. TensorRT-LLM integrates seamlessly with NVIDIA’s broader inference ecosystem, including Triton Inference Server and distributed deployment frameworks, making it suitable for production environments.

Features

  • Advanced quantization support including FP8, FP4, and INT8
  • Custom attention kernels for optimized inference
  • Support for multi-GPU and multi-node deployments
  • In-flight batching and paged KV cache optimization
  • Modular PyTorch-based API for customization
  • Integration with Triton Inference Server and NVIDIA ecosystem

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow TensorRT LLM

TensorRT LLM Web Site

Other Useful Business Software
Custom VMs From 1 to 96 vCPUs With 99.95% Uptime Icon
Custom VMs From 1 to 96 vCPUs With 99.95% Uptime

General-purpose, compute-optimized, or GPU/TPU-accelerated. Built to your exact specs.

Live migration and automatic failover keep workloads online through maintenance. One free e2-micro VM every month.
Try Free
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of TensorRT LLM!

Additional Project Details

Operating Systems

Linux, Mac, Windows

Programming Language

Python

Related Categories

Python Artificial Intelligence Software

Registered

2026-03-18