Skip to content

Commit

Permalink
Update TensorRT-LLM (NVIDIA#1315)
Browse files Browse the repository at this point in the history
  • Loading branch information
kaiyux authored Mar 19, 2024
1 parent 4bb65f2 commit 66ca337
Show file tree
Hide file tree
Showing 341 changed files with 18,906 additions and 17,300 deletions.
21 changes: 10 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -206,17 +206,16 @@ architectures as well as important features implemented in TensorRT-LLM.

### Devices

TensorRT-LLM is rigorously tested on the following GPUs:
TensorRT-LLM supports the following architectures:

* [H100](https://www.nvidia.com/en-us/data-center/h100/)
* [L40S](https://www.nvidia.com/en-us/data-center/l40s/)
* [A100](https://www.nvidia.com/en-us/data-center/a100/)
* [A30](https://www.nvidia.com/en-us/data-center/products/a30-gpu/)
* [V100](https://www.nvidia.com/en-us/data-center/v100/) (experimental)
* [NVIDIA Hopper](https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/) (SM90), for example, H200, H100, H20
* [NVIDIA Ada Lovelace](https://www.nvidia.com/en-us/geforce/ada-lovelace-architecture/) (SM89), for example, L40S, L20, L4
* [NVIDIA Ampere](https://www.nvidia.com/en-us/data-center/ampere-architecture/) (SM80, SM86), for example, A100, A30, A10G
* [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/) (SM75), for example, T4
* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) (SM70 - experimental), for example, V100

If a GPU is not listed above, it is important to note that TensorRT-LLM is
expected to work on GPUs based on the Volta, Turing, Ampere, Hopper and Ada
Lovelace architectures. Certain limitations may, however, apply.

It is important to note that TensorRT-LLM is expected to work on all GPUs based on the Volta, Turing, Ampere, Hopper, and Ada Lovelace architectures. Certain limitations may apply.

### Precision

Expand Down Expand Up @@ -273,7 +272,7 @@ The list of supported models is:
* [Blip2](examples/blip2)
* [BLOOM](examples/bloom)
* [ChatGLM](examples/chatglm)
* [FairSeq NMT](examples/nmt)
* [FairSeq NMT](examples/enc_dec/nmt)
* [Falcon](examples/falcon)
* [Flan-T5](examples/enc_dec)
* [GPT](examples/gpt)
Expand Down Expand Up @@ -406,7 +405,7 @@ As a rule of thumb, if you are running TensorRT-LLM interactively on a Slurm
node, prefix your commands with `mpirun -n 1` to run TensorRT-LLM in a
dedicated MPI environment, not the one provided by your Slurm allocation.

For example: `mpirun -n 1 python3 examples/gpt/build.py ...`
For example: `mpirun -n 1 python3 examples/run.py ...`

## Release notes

Expand Down
36 changes: 28 additions & 8 deletions benchmarks/cpp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ instead, and be sure to set DLL paths as specified in

Before you launch C++ benchmarking, please make sure that you have already built engine(s) using TensorRT-LLM API, C++ benchmarking code cannot generate engine(s) for you.

You can use the [`build.py`](source:benchmarks/python/build.py) script to build the engine(s). Alternatively, if you have already benchmarked Python Runtime, you can reuse the engine(s) built previously, please see that [`document`](../python/README.md).
Use `trtllm-build` to build the TRT-LLM engine. Alternatively, if you have already benchmarked Python Runtime, you can reuse the engine(s) built previously, please see that [`document`](../python/README.md).

#### Launch benchmarking

Expand Down Expand Up @@ -73,19 +73,39 @@ This tool can be used in 2 different modes of traffic generation.

##### 1 – Dataset

“Prompt”, “Instruction” (optional) and “Answer” specified as sentences in a Json file

The tool will tokenize the words and instruct the model to generate a specified number of output tokens for a request.

```
python3 prepare_dataset.py \
--tokenizer <path/to/tokenizer> \
--output preprocessed_dataset.json
--request-rate 10 \
--time-delay-dist exponential_dist \
[--request-rate 10] \
[--time-delay-dist exponential_dist] \
dataset
--dataset-name <name of the dataset> \
--dataset-input-key <dataset dictionary key for input> \
--dataset-prompt-key <dataset dictionary key for prompt> \
--dataset-output-key <dataset dictionary key for output> \
[--num-requests 100] \
[--max-input-len 1000] \
[--output-len-dist 100,10]
```

For datasets that don't have prompt key, set --dataset-prompt instead.
Take [cnn_dailymail dataset](https://huggingface.co/datasets/cnn_dailymail) for example:
```
python3 prepare_dataset.py \
--tokenizer <path/to/tokenizer> \
--output cnn_dailymail.json
dataset
--dataset <path/to/dataset> \
--max-input-len 300
--dataset-name cnn_dailymail \
--dataset-config-name 3.0.0 \
--dataset-input-key article \
--dataset-prompt "Summarize the following article:" \
--dataset-output-key "highlights" \
[--num-requests 100] \
[--max-input-len 1000] \
[--output-len-dist 100,10]
```

##### 2 – Normal token length distribution
Expand All @@ -94,7 +114,7 @@ This mode allows the user to generate normal token length distributions with a m
For example, setting mean=100 and std dev=10 would generate requests where 95.4% of values are in <80,120> range following the normal probability distribution. Setting std dev=0 will generate all requests with the same mean number of tokens.

```
python prepare_dataset.py \
python prepare_dataset.py \
--output token-norm-dist.json \
--request-rate 10 \
--time-delay-dist constant \
Expand Down
Loading

0 comments on commit 66ca337

Please sign in to comment.