Download Latest Version v1.2.1 source code.tar.gz (417.5 MB)
Email in envelope

Get an email when there's a new version of TensorRT LLM

Home / v1.1.0
Name Modified Size InfoDownloads / Week
Parent folder
README.md 2025-12-11 125.0 kB
v1.1.0 source code.tar.gz 2025-12-11 338.9 MB
v1.1.0 source code.zip 2025-12-11 343.1 MB
Totals: 3 Items   682.2 MB 0

Know Issue

  • If users create project with tensorrt-llm==1.1.0 in pyproject.toml file as dependency as below: toml dependencies = [ "tensorrt-llm==1.1.0", ]

when users install project dependencies with command uv sync, error will happend with message: No solution found when resolving dependencies for split (markers: python_full_version >= '3.13' and sys_platform == 'darwin'): ╰─▶ Because patchelf==0.18.0.0 was yanked (reason: https://github.com/mayeut/patchelf-pypi/issues/87) and tensorrt-llm==1.1.0 depends on patchelf==0.18.0, we can conclude that tensorrt-llm==1.1.0 cannot be used. And because your project depends on tensorrt-llm==1.1.0, we can conclude that your project's requirements are unsatisfiable.".

That's because patchelf 0.18.0 was yanked by author.

A valid work around for this issue is to add block in pyproject.toml: toml [tool.uv] override-dependencies = [ "patchelf==0.17.2.4", ]

Key Features and Enhancements

  • Model Support
  • Add model supports of GPT-OSS, Hunyuan-Dense (contribution from @sorenwu), Hunyuan-MoE (contribution from @qianbiaoxiang), Seed-OSS (contribution from @Nekofish-Li).
  • Features
  • Connector API: Introduced a new KV Cache Connector API for state transfer in disaggregated serving.
  • Reuse & Offloading: Enabled KV cache reuse for MLA (Multi-Head Latent Attention) and added examples for host offloading.
  • Salting: Implemented KV cache salting for secure cache reuse.
  • Guided Decoding Integration: Enabled guided decoding to work in conjunction with speculative decoding (including 2-model and draft model chunked prefill).
  • Eagle: Added multi-layer Eagle support and optimizations.
  • CuteDSL: Integrated CuteDSL NVFP4 grouped GEMM for Blackwell.
  • B300/GB300: Added support for B300/GB300.
  • Documentation
  • Deployment Guides: Added comprehensive deployment guides for GPT-OSS, DeepSeek-R1, and VDR 1.0.
  • Feature Documentation: Created new documentation for KV Cache Connector, LoRA feature usage, and AutoDeploy.
  • Tech Blogs: Published blogs on “Combining Guided Decoding and Speculative Decoding” and “ADP Balance Strategy”.
  • Quick Start: Refined Quick Start guides with new links to ModelOpt checkpoints and updated installation steps (Linux/Windows).
  • API Reference: Enhanced LLM API documentation by explicitly labeling stable vs. unstable APIs.
  • Performance: Updated online benchmarking documentation and performance overview pages.
  • Examples: Refined Slurm examples and added K2 tool calling examples.
  • Infrastructure Changes
  • The base Docker image for TensorRT-LLM: nvcr.io/nvidia/pytorch:25.10-py3.
  • The base Docker image for TensorRT-LLM Backend: nvcr.io/nvidia/tritonserver:25.10-py3.
  • The dependent public PyTorch version: 2.9.0.
  • The dependent NVIDIA ModelOpt version: 0.37.
  • The dependent xgrammar version: 0.1.25.
  • The dependent transformers version: 4.56.0.
  • The dependent NIXL version: 0.5.0.
  • API Changes
  • Breaking Change: The C++ TRTLLM sampler is now enabled by default, replacing the legacy implementation.
  • KV Cache Connector API: Introduced a new KV Cache Connector API.
  • Standardized topk logprob returns across TRT and PyTorch backends.
  • Added stable labels to arguments in the LLM class to better indicate API stability.
  • Wait and Cancel API: Added tests and support for handling non-existent and completed request cancellations in the executor.
  • Fixed multiple Issues
  • illegal memory access, weight loading issues for DeepSeek-R1 W4A8, CUDA graph warmup issues with speculative decoding, memory leaks
  • Known Issues
  • GB300 Multi-Node: Support for GB300 in multi-node configurations is currently in beta and not fully validated in this release. GB300 multi-node configurations have been validated in 1.2.0rc4+.

What's Changed

New Contributors

Full Changelog: https://github.com/NVIDIA/TensorRT-LLM/compare/v1.0.0...v1.1.0

Source: README.md, updated 2025-12-11