add chunked context/prefill runtime option to trtllm-serve #2731

tsnyder-sps · 2025-01-31T17:30:08Z

based on the newest v0.17.0 tag:

before:

-# trtllm-serve --host 0.0.0.0 --tokenizer Llama-3.1-8B-Instruct-FP8 engines/Llama-3.1-8B-Instruct-FP8
[TensorRT-LLM] TensorRT-LLM version: 0.17.0
[TensorRT-LLM] TensorRT-LLM version: 0.17.0
[TensorRT-LLM][INFO] Engine version 0.17.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 16
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 16
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 131072
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (131072) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
-[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 8738 MiB

after:

+# trtllm-serve --host 0.0.0.0 --tokenizer Llama-3.1-8B-Instruct-FP8/ engines/Llama-3.1-8B-Instruct-FP8 --chunked_context
[TensorRT-LLM] TensorRT-LLM version: 0.17.0
[TensorRT-LLM] TensorRT-LLM version: 0.17.0
[TensorRT-LLM][INFO] Engine version 0.17.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 16
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 16
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 131072
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (131072) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
+[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 131071  = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 131072 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 8738 MiB

add chunked context/prefill runtime option to trtllm-serve

e6c9695

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add chunked context/prefill runtime option to trtllm-serve #2731

add chunked context/prefill runtime option to trtllm-serve #2731

tsnyder-sps commented Jan 31, 2025

add chunked context/prefill runtime option to trtllm-serve #2731

Are you sure you want to change the base?

add chunked context/prefill runtime option to trtllm-serve #2731

Conversation

tsnyder-sps commented Jan 31, 2025