The NVIDIA TensorRT Model Optimizer (referred to as Model Optimizer, or ModelOpt) is a library comprising state-of-the-art model optimization techniques including quantization, distillation, pruning, speculative decoding and sparsity to accelerate models.
[Input] Model Optimizer currently supports inputs of a Hugging Face, PyTorch or ONNX model.
[Optimize] Model Optimizer provides Python APIs for users to easily compose the above model optimization techniques and export an optimized quantized checkpoint. Model Optimizer is also integrated with NVIDIA NeMo, Megatron-LM and Hugging Face Accelerate for training required inference optimization techniques.
[Export for deployment] Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like SGLang, TensorRT-LLM, TensorRT, or vLLM.
- [2025/08/29] Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training
- [2025/08/01] Optimizing LLMs for Performance and Accuracy with Post-Training Quantization
- [2025/06/24] Introducing NVFP4 for Efficient and Accurate Low-Precision Inference
- [2025/05/14] NVIDIA TensorRT Unlocks FP4 Image Generation for NVIDIA Blackwell GeForce RTX 50 Series GPUs
- [2025/04/21] Adobe optimized deployment using TensorRT-Model-Optimizer + TensorRT leading to a 60% reduction in diffusion latency, a 40% reduction in total cost of ownership
- [2025/04/05] NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick. Check out how to quantize Llama4 for deployment acceleration here
- [2025/03/18] World's Fastest DeepSeek-R1 Inference with Blackwell FP4 & Increasing Image Generation Efficiency on Blackwell
- [2025/02/25] Model Optimizer quantized NVFP4 models available on Hugging Face for download: DeepSeek-R1-FP4, Llama-3.3-70B-Instruct-FP4, Llama-3.1-405B-Instruct-FP4
- [2025/01/28] Model Optimizer has added support for NVFP4. Check out an example of NVFP4 PTQ here.
- [2025/01/28] Model Optimizer is now open source!
- [2024/10/23] Model Optimizer quantized FP8 Llama-3.1 Instruct models available on Hugging Face for download: 8B, 70B, 405B.
- [2024/09/10] Post-Training Quantization of LLMs with NVIDIA NeMo and TensorRT Model Optimizer.
Previous News
- [2024/08/28] Boosting Llama 3.1 405B Performance up to 44% with TensorRT Model Optimizer on NVIDIA H200 GPUs
- [2024/08/28] Up to 1.9X Higher Llama 3.1 Performance with Medusa
- [2024/08/15] New features in recent releases: Cache Diffusion, QLoRA workflow with NVIDIA NeMo, and more. Check out our blog for details.
- [2024/06/03] Model Optimizer now has an experimental feature to deploy to vLLM as part of our effort to support popular deployment frameworks. Check out the workflow here
- [2024/05/08] Announcement: Model Optimizer Now Formally Available to Further Accelerate GenAI Inference Performance
- [2024/03/27] Model Optimizer supercharges TensorRT-LLM to set MLPerf LLM inference records
- [2024/03/18] GTC Session: Optimize Generative AI Inference with Quantization in TensorRT-LLM and TensorRT
- [2024/03/07] Model Optimizer's 8-bit Post-Training Quantization enables TensorRT to accelerate Stable Diffusion to nearly 2x faster
- [2024/02/01] Speed up inference with Model Optimizer quantization techniques in TRT-LLM
To install stable release packages for Model Optimizer with pip
from PyPI:
pip install nvidia-modelopt[all]
To install from source in editable mode with all development dependencies or to test the latest changes, run:
# Clone the Model Optimizer repository
git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git
cd TensorRT-Model-Optimizer
pip install -e .[dev]
Visit our installation guide for more fine-grained control on installed dependencies or view our pre-made dockerfiles for more information.
Technique | Description | Examples | Docs |
---|---|---|---|
Post Training Quantization | Compress model size by 2x-4x, speeding up inference while preserving model quality! | [LLMs] [diffusers] [VLMs] [onnx] [windows] | [docs] |
Quantization Aware Training | Refine accuracy even further with a few training steps! | [NeMo] [Hugging Face] | [docs] |
Pruning | Reduce your model size and accelerate inference by removing unnecessary weights! | [PyTorch] | [docs] |
Distillation | Reduce deployment model size by teaching small models to behave like larger models! | [NeMo] [Hugging Face] | [docs] |
Speculative Decoding | Train draft modules to predict extra tokens during inference! | [Megatron] [Hugging Face] | [docs] |
Sparsity | Efficiently compress your model by storing only its non-zero parameter values and their locations | [PyTorch] | [docs] |
- Ready-to-deploy checkpoints [π€ Hugging Face - Nvidia TensorRT Model Optimizer Collection]
- Deployable on TensorRT-LLM, vLLM and SGLang
- More models coming soon!
- π Roadmap
- π Documentation
- π― Benchmarks
- π‘ Release Notes
- π File a bug
- β¨ File a Feature Request
Model Type | Support Matrix |
---|---|
LLM Quantization | View Support Matrix |
Diffusers Quantization | View Support Matrix |
VLM Quantization | View Support Matrix |
ONNX Quantization | View Support Matrix |
Windows Quantization | View Support Matrix |
Quantization Aware Training | View Support Matrix |
Pruning | View Support Matrix |
Distillation | View Support Matrix |
Speculative Decoding | View Support Matrix |
Model Optimizer is now open source! We welcome any feedback, feature requests and PRs. Please read our Contributing guidelines for details on how to contribute to this project.
Happy optimizing!