TensorRT-LLM v0.17 Release #2725

schetlur-nv · 2025-01-30T21:20:39Z

TensorRT-LLM Release 0.17.0 [UPDATED 1/31]

Key Features and Enhancements

Blackwell support
- NOTE: pip installation is not supported for TRT-LLM 0.17 on Blackwell platforms only. Instead, it is recommended that the user build from source using NVIDIA NGC 25.01 PyTorch container.
- Added support for B200.
- Added support for GeForce RTX 50 series using Windows Subsystem for Linux (WSL) for limited models.
- Added NVFP4 Gemm support for Llama and Mixtral models.
- Added NVFP4 support for the LLM API and trtllm-bench command.
- GB200 NVL is not fully supported.
- Added benchmark script to measure perf benefits of KV cache host offload with expected runtime improvements from GH200.
PyTorch workflow
- The PyTorch workflow is an experimental feature in tensorrt_llm._torch. The following is a list of supported infrastructure, models, and features that can be used with the PyTorch workflow.
- Added support for H100/H200/B200.
- Added support for Llama models, Mixtral, QWen, Vila.
- Added support for FP16/BF16/FP8/NVFP4 Gemm and fused Mixture-Of-Experts (MOE), FP16/BF16/FP8 KVCache.
- Added custom context and decoding attention kernels support via PyTorch custom op.
- Added support for chunked context (default off).
- Added CudaGraph support for decoding only.
- Added overlap scheduler support to overlap prepare inputs and model forward by decoding 1 extra token.
Added FP8 context FMHA support for the W4A8 quantization workflow.
Added ModelOpt quantized checkpoint support for the LLM API.
Added FP8 support for the Llama-3.2 VLM model. Refer to the “MLLaMA” section in examples/multimodal/README.md.
Added PDL support for userbuffer based AllReduce-Norm fusion kernel.
Added runtime support for seamless lookahead decoding.
Added token-aligned arbitrary output tensors support for the C++ executor API.

API Changes

[BREAKING CHANGE] KV cache reuse is enabled automatically when paged_context_fmha is enabled.
Added --concurrency support for the throughput subcommand of trtllm-bench.

Fixed Issues

Fixed incorrect LoRA output dimension. Thanks for the contribution from @akhoroshev in bugfix/incorrect lora out dims #2484.
Added NVIDIA H200 GPU into the cluster_key for auto parallelism feature. ([feature request] Can we add H200 in infer_cluster_key() method? #2552)
Fixed a typo in the __post_init__ function of LLmArgs Class. Thanks for the contribution from @topenkoff in Fix kwarg name #2691.
Fixed workspace size issue in the GPT attention plugin. Thanks for the contribution from @AIDC-AI.
Fixed Deepseek-V2 model accuracy.

Infrastructure Changes

The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:25.01-py3.
The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:25.01-py3.
The dependent TensorRT version is updated to 10.8.0.
The dependent CUDA version is updated to 12.8.0.
The dependent ModelOpt version is updated to 0.23 for Linux platform, while 0.17 is still used on Windows platform.

aikitoria · 2025-01-30T23:46:18Z

This release mentions min_p parameter but it is nowhere to be found in the actual code.

juney-nvidia · 2025-01-31T02:04:03Z

This release mentions min_p parameter but it is nowhere to be found in the actual code.

Sorry for the mis-leading. The min_p support is already done in our internal repo and not merged timely into the 0.17 release. It will be part of the next release. I just updated the release announcement to make it more precise. Thanks for the reminder!

June

aikitoria · 2025-01-31T02:05:46Z

I see. Perhaps the main branch could be updated with the new preview code ahead of the next release? 👀

MahmoudAshraf97 · 2025-01-31T20:11:55Z

Are you sure you included the enc dec FP8 in the readme? Because it's not in the 0.17 branch or the main branch

juney-nvidia · 2025-01-31T22:36:35Z

I see. Perhaps the main branch could be updated with the new preview code ahead of the next release? 👀

Thanks for the suggestion. The main branch will be updated later in February with incorporating the min_p change(BTW, thanks for your contribution of min_p to TRT-LLM which is the foundation of this feature support in TRT-LLM).

The reason that this time the release/0.17 branch contains more new features than main is due to the Blackwell release which makes the github update process a little bit special :)

Thanks
June

juney-nvidia · 2025-01-31T22:39:49Z

Are you sure you included the enc dec FP8 in the readme? Because it's not in the 0.17 branch or the main branch

Thanks for catching this, @MahmoudAshraf97 . I just communicated with the team and there are something wrong about the doc update process for 0.17 release. We just updated the 0.17 release info now. Pls help review it to see whether there is anything else incorrect.

Thanks again for reminding us about this doc issue.

June

Co-authored-by: Kaiyu Xie <[email protected]> open source f8c0381a2bc50ee2739c3d8c2be481b31e5f00bd (#2736) Co-authored-by: Kaiyu Xie <[email protected]> Add note for blackwell (#2742) Update the docs to workaround the extra-index-url issue (#2744) update README.md (#2751) Fix github io pages (#2761) Update

yiakwy-xpu-ml-framework-team · 2025-03-16T13:34:11Z

Hi @schetlur-nv the file https://github.com/NVIDIA/TensorRT-LLM/blob/v0.16.0/cpp/tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.cu has been removed from the repo in release v0.17.0. I wonder how deepseek fused gate is supported by our engineer team.

Thank you !

open source 09df54c0cc99354a60bbc0303e3e8ea33a96bef0

7bb9b5f

zeroepoch approved these changes Jan 30, 2025

View reviewed changes

zeroepoch merged commit 71213f7 into rel Jan 30, 2025

zeroepoch deleted the preview/rel branch January 30, 2025 21:32

remusao mentioned this pull request Jan 31, 2025

[bug] forwardAsync assertion failed: Input length (6973) + max new tokens (4095) + draft tokens (0) must be less than max sequence length (8192) #2494

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TensorRT-LLM v0.17 Release #2725

TensorRT-LLM v0.17 Release #2725

Uh oh!

schetlur-nv commented Jan 30, 2025 •

edited

Loading

Uh oh!

aikitoria commented Jan 30, 2025

Uh oh!

juney-nvidia commented Jan 31, 2025

Uh oh!

aikitoria commented Jan 31, 2025

Uh oh!

MahmoudAshraf97 commented Jan 31, 2025

Uh oh!

juney-nvidia commented Jan 31, 2025

Uh oh!

juney-nvidia commented Jan 31, 2025

Uh oh!

yiakwy-xpu-ml-framework-team commented Mar 16, 2025

Uh oh!

Uh oh!

TensorRT-LLM v0.17 Release #2725

TensorRT-LLM v0.17 Release #2725

Uh oh!

Conversation

schetlur-nv commented Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TensorRT-LLM Release 0.17.0 [UPDATED 1/31]

Key Features and Enhancements

API Changes

Fixed Issues

Infrastructure Changes

Uh oh!

aikitoria commented Jan 30, 2025

Uh oh!

juney-nvidia commented Jan 31, 2025

Uh oh!

aikitoria commented Jan 31, 2025

Uh oh!

MahmoudAshraf97 commented Jan 31, 2025

Uh oh!

juney-nvidia commented Jan 31, 2025

Uh oh!

juney-nvidia commented Jan 31, 2025

Uh oh!

yiakwy-xpu-ml-framework-team commented Mar 16, 2025

Uh oh!

Uh oh!

schetlur-nv commented Jan 30, 2025 •

edited

Loading