Update TensorRT-LLM (NVIDIA#1315)

nicks64 · Mar 19, 2024 · 66ca337 · 66ca337
1 parent 4bb65f2
commit 66ca337
Show file tree

Hide file tree

Showing 341 changed files with 18,906 additions and 17,300 deletions.
diff --git a/README.md b/README.md
@@ -206,17 +206,16 @@ architectures as well as important features implemented in TensorRT-LLM.
 
 ### Devices
 
-TensorRT-LLM is rigorously tested on the following GPUs:
+TensorRT-LLM supports the following architectures:
 
-* [H100](https://www.nvidia.com/en-us/data-center/h100/)
-* [L40S](https://www.nvidia.com/en-us/data-center/l40s/)
-* [A100](https://www.nvidia.com/en-us/data-center/a100/)
-* [A30](https://www.nvidia.com/en-us/data-center/products/a30-gpu/)
-* [V100](https://www.nvidia.com/en-us/data-center/v100/) (experimental)
+* [NVIDIA Hopper](https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/) (SM90), for example, H200, H100, H20
+* [NVIDIA Ada Lovelace](https://www.nvidia.com/en-us/geforce/ada-lovelace-architecture/) (SM89), for example, L40S, L20, L4
+* [NVIDIA Ampere](https://www.nvidia.com/en-us/data-center/ampere-architecture/) (SM80, SM86), for example, A100, A30, A10G
+* [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/) (SM75), for example, T4
+* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) (SM70 - experimental), for example, V100
 
-If a GPU is not listed above, it is important to note that TensorRT-LLM is
-expected to work on GPUs based on the Volta, Turing, Ampere, Hopper and Ada
-Lovelace architectures. Certain limitations may, however, apply.
+
+It is important to note that TensorRT-LLM is expected to work on all GPUs based on the Volta, Turing, Ampere, Hopper, and Ada Lovelace architectures. Certain limitations may apply.
 
 ### Precision
 
@@ -273,7 +272,7 @@ The list of supported models is:
 * [Blip2](examples/blip2)
 * [BLOOM](examples/bloom)
 * [ChatGLM](examples/chatglm)
-* [FairSeq NMT](examples/nmt)
+* [FairSeq NMT](examples/enc_dec/nmt)
 * [Falcon](examples/falcon)
 * [Flan-T5](examples/enc_dec)
 * [GPT](examples/gpt)
@@ -406,7 +405,7 @@ As a rule of thumb, if you are running TensorRT-LLM interactively on a Slurm
 node, prefix your commands with `mpirun -n 1` to run TensorRT-LLM in a
 dedicated MPI environment, not the one provided by your Slurm allocation.
 
-For example: `mpirun -n 1 python3 examples/gpt/build.py ...`
+For example: `mpirun -n 1 python3 examples/run.py ...`
 
 ## Release notes
 

diff --git a/benchmarks/cpp/README.md b/benchmarks/cpp/README.md
@@ -22,7 +22,7 @@ instead, and be sure to set DLL paths as specified in
 
 Before you launch C++ benchmarking, please make sure that you have already built engine(s) using TensorRT-LLM API, C++ benchmarking code cannot generate engine(s) for you.
 
-You can use the [`build.py`](source:benchmarks/python/build.py) script to build the engine(s). Alternatively, if you have already benchmarked Python Runtime, you can reuse the engine(s) built previously, please see that [`document`](../python/README.md).
+Use `trtllm-build` to build the TRT-LLM engine. Alternatively, if you have already benchmarked Python Runtime, you can reuse the engine(s) built previously, please see that [`document`](../python/README.md).
 
 ####  Launch benchmarking
 
@@ -73,19 +73,39 @@ This tool can be used in 2 different modes of traffic generation.
 
 ##### 1 – Dataset
 
-“Prompt”, “Instruction” (optional) and “Answer” specified as sentences in a Json file
-
 The tool will tokenize the words and instruct the model to generate a specified number of output tokens for a request.
 
 ```
 python3 prepare_dataset.py \
+    --tokenizer <path/to/tokenizer> \
     --output preprocessed_dataset.json
-    --request-rate 10 \
-    --time-delay-dist exponential_dist \
+    [--request-rate 10] \
+    [--time-delay-dist exponential_dist] \
+    dataset
+    --dataset-name <name of the dataset> \
+    --dataset-input-key <dataset dictionary key for input> \
+    --dataset-prompt-key <dataset dictionary key for prompt> \
+    --dataset-output-key <dataset dictionary key for output> \
+    [--num-requests 100] \
+    [--max-input-len 1000] \
+    [--output-len-dist 100,10]
+```
+
+For datasets that don't have prompt key, set --dataset-prompt instead.
+Take [cnn_dailymail dataset](https://huggingface.co/datasets/cnn_dailymail) for example:
+```
+python3 prepare_dataset.py \
     --tokenizer <path/to/tokenizer> \
+    --output cnn_dailymail.json
     dataset
-    --dataset <path/to/dataset> \
-    --max-input-len 300
+    --dataset-name cnn_dailymail \
+    --dataset-config-name 3.0.0 \
+    --dataset-input-key article \
+    --dataset-prompt "Summarize the following article:" \
+    --dataset-output-key "highlights" \
+    [--num-requests 100] \
+    [--max-input-len 1000] \
+    [--output-len-dist 100,10]
 ```
 
 ##### 2 – Normal token length distribution
@@ -94,7 +114,7 @@ This mode allows the user to generate normal token length distributions with a m
 For example, setting mean=100 and std dev=10 would generate requests where 95.4% of values are in <80,120> range following the normal probability distribution. Setting std dev=0 will generate all requests with the same mean number of tokens.
 
 ```
- python prepare_dataset.py \
+python prepare_dataset.py \
   --output token-norm-dist.json \
   --request-rate 10 \
   --time-delay-dist constant \