[bug] forwardAsync assertion failed: Input length (6973) + max new tokens (4095) + draft tokens (0) must be less than max sequence length (8192) #2494

akhoroshev · 2024-11-25T06:54:25Z

Assertion fails under load

[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: Input length (6973) + max new tokens (4095) + draft tokens (0) must be less than max sequence length (8192). (/sources/contrib/tensorrt-llm/cpp/tensorrt_llm/runtime/gptDecoderBatched.cpp:444)
1       0x7fa8df465992 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 78
2       0x7fa8df66b693 tensorrt_llm::runtime::GptDecoderBatched::newRequest(int, tensorrt_llm::runtime::decoder_batch::Request const&, tensorrt_llm::runtime::SamplingConfig const&) + 4307
3       0x7fa8df66d7cc tensorrt_llm::runtime::GptDecoderBatched::newRequests(std::vector<int, std::allocator<int> > const&, std::vector<tensorrt_llm::runtime::decoder_batch::Request, std::allocator<tensorrt_llm::runtime::decoder_batch::Request> > const&, std::vector<tensorrt_llm::runtime::SamplingConfig, std::allocator<tensorrt_llm::runtime::SamplingConfig> > const&) + 172
4       0x7fa8e15f93c5 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 725
5       0x7fa8e15fbb90 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 3792
6       0x7fa8e1625a71 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 353
7       0x7fa8e162a97f tensorrt_llm::executor::Executor::Impl::executionLoop() + 895
8       0x7fa8bafaba80 /opt/wmcore/lib/libtensorrt_llm_nvrtc_wrapper.so(+0x32c5a80) [0x7fa8bafaba80]
9       0x7fa8720d01ca /lib64/libpthread.so.0(+0x81ca) [0x7fa8720d01ca]
10      0x7fa87140de73 clone + 67

[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 8192
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (8192) * 28
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8191  = maxSequenceLen - 1 since chunked context is enabled

I don't know how it's possible because

for all my requests input_length <= 7168
for all my requests max_new_tokens=min(4096, 8192 - input_length)

Moreover, Executor additionally checks this invariant.

The only idea is that tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep is setting wrong max_new_tokens for decoder_batch::Request (under certain conditions)

The text was updated successfully, but these errors were encountered:

nekorobov · 2024-11-25T09:50:37Z

Hi @akhoroshev, thank you for taking time to report the issue. From just looking at code, the logic seems correct to me. I see no way how max_new_tokens can be equal to 4095. The check in the GenericLlmRequest::validate is called only via executor API. Old GptManager API does not call it.

Could you share a reproducer, please?

akhoroshev · 2024-11-25T10:01:09Z

@nekorobov

From just looking at code, the logic seems correct to me. I see no way how max_new_tokens can be equal to 4095

It happens under load, for example it's possible to have two requests (or more):

input_length=4097, max_new_tokens=4095
input_length=6973, max_new_tokens=1219

They are both valid (GenericLlmRequest::validate was called since I use Executor API)

But assertion fails

akhoroshev · 2024-11-25T10:04:40Z

Could you share a reproducer, please?

I can't because it's a closed model.

akhoroshev · 2024-12-06T08:51:08Z

Hi! Any updates here?

akhoroshev · 2024-12-09T16:59:04Z

@nekorobov @nv-guomingz

TriLoo · 2024-12-10T11:28:30Z

meet "Encountered an error in forwardAsync function: std::bad_cast" error when running BERT/Roberta,
install tensorrt-llm from source code

commit id: 340a1b6
GPU: A100
CUDA-12.4

File "/home/aiscuser/.conda/envs/py10/lib/python3.10/site-packages/fire/core.py", line 135, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/aiscuser/.conda/envs/py10/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/aiscuser/.conda/envs/py10/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home/aiscuser/DistillLLM/inference/tensorrt_llm/timing_rc2.py", line 137, in timing outputs = runner.generate( File "/home/aiscuser/.conda/envs/py10/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 747, in generate return self._initialize_and_fill_output( File "/home/aiscuser/.conda/envs/py10/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 886, in _initialize_and_fill_output return self._fill_output( File "/home/aiscuser/.conda/envs/py10/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 990, in _fill_output raise RuntimeError(response.error_msg) RuntimeError: Encountered an error in forwardAsync function: std::bad_cast

remusao · 2024-12-23T15:56:32Z

Hi,

We are also encountering a similar issue under load, but it is very hard to reproduce consistently (sometimes the issue will not appear before a few hours). Unfortunately this makes TensorRT-LLM a no-go for any production usage at the moment, which is a shame given the performance uplift compared to alternatives.

We were able to reproduce this issue a few times using https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct running on a 8xH100 node (using an in-house load-testing script which simulates arbitrary concurrency; but even using this script it is hard to understand what exactly triggers the issue on Triton's side; apart from "high enough concurrency"). The building is done using the following two commands:

MAX_BATCH_SIZE=64
TENSOR_PARALLELISM=8
MAX_NUM_TOKENS=16384
MAX_SEQ_LENGTH=131072

# Quantize model to fp8
python tensorrt_llm/examples/quantization/quantize.py \
    --model_dir ./Llama-3.3-70B-Instruct \
    --dtype auto \
    --qformat fp8 \
    --kv_cache_dtype fp8 \
    --calib_size 512 \
    --tp_size ${TENSOR_PARALLELISM} \
    --device cuda \
    --tokenizer_max_seq_length ${MAX_SEQ_LENGTH} \
    --batch_size ${MAX_BATCH_SIZE} \
    --output_dir ./checkpoint

# Build TensorRT engine (fp8)
trtllm-build \
    --checkpoint_dir ./checkpoint \
    --max_batch_size=${MAX_BATCH_SIZE} \
    --workers=8 \
    --context_fmha=enable \
    --kv_cache_type=paged \
    --use_paged_context_fmha=enable \
    --multiple_profiles=enable \
    --use_fp8_context_fmha=enable \
    --gpt_attention_plugin=auto \
    --remove_input_padding=enable \
    --use_fused_mlp=enable \
    --reduce_fusion=enable \
    --user_buffer=enable \
    --max_num_tokens=${MAX_NUM_TOKENS} \
    --output_dir ./engine

The configs are generated with the following:

python3 fill_template.py --in_place triton_models/preprocessing/config.pbtxt tokenizer_dir:./Llama-3.3-70B-Instruct ,triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${MAX_BATCH_SIZE},add_special_tokens:False
python3 fill_template.py --in_place triton_models/postprocessing/config.pbtxt tokenizer_dir:./Llama-3.3-70B-Instruct ,triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${MAX_BATCH_SIZE},skip_special_tokens:True
python3 fill_template.py --in_place triton_models/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:True,bls_instance_count:${MAX_BATCH_SIZE},accumulate_tokens:False,logits_datatype:TYPE_FP32
python3 fill_template.py --in_place triton_models/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},logits_datatype:TYPE_FP32
python3 fill_template.py --in_place triton_models/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:True,max_beam_width:1,engine_dir:./engine,kv_cache_free_gpu_mem_fraction:0.95,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:20000,encoder_input_features_data_type:TYPE_FP16,enable_chunked_context:True,multi_block_mode:True,cuda_graph_mode:True,cuda_graph_cache_size:50000,logits_datatype:TYPE_FP32

We were using these configs https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/inflight_batcher_llm as shown in the official tutorial to run Llama in TensorRT-LLM.

TensorRT-LLM was built from latest version on main: be17881 (but the issue was also observed in the previous commit so that does not necessarily seem to be a new issue).

You may need to adjust the paths for Llama model, checkpoints, engines, etc. to fully reproduce the setup.

Unfortunately we are not able to provide a minimum repro case, as the issue only seems to happen randomly and when using high-enough concurrency with the inflight batching enabled (sometimes Triton will run fine for a few hours and then start misbehaving, maybe indicating that specific concurrent inputs are required in order to create this particular issue). Nonetheless here is an example of trace we get from the client (reproduced with both GRPC and HTTP clients) when it happens:

in ensemble 'ensemble', Executor failed process requestId 6648 due to the following error: Encountered an error in forwardAsync function: Tensor 'input_ids' has invalid shape (19124), expected in range min (1024), max (16384) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:511)
1       0x7fafc10d98f1 tensorrt_llm::runtime::TllmRuntime::setInputTensorsImpl(int, std::unordered_map<std::string, std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::shared_ptr<tensorrt_llm::runtime::ITensor> > > > const&, bool) + 1665
2       0x7fafc10da8e7 tensorrt_llm::runtime::TllmRuntime::setInputTensors(int, std::unordered_map<std::string, std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::shared_ptr<tensorrt_llm::runtime::ITensor> > > > const&) + 71
3       0x7fafc150c56a tensorrt_llm::batch_manager::TrtGptModelInflightBatching::prepareBuffers(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) + 218
4       0x7fafc150cacf tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) + 1071
5       0x7fafc150d31e tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) + 222
6       0x7fafc150da37 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1719
7       0x7fafc15a4486 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 486
8       0x7fafc15aaec1 tensorrt_llm::executor::Executor::Impl::executionLoop() + 1281
9       0x7fb38206edb4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fb38206edb4]
10      0x7fb381e0ca94 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7fb381e0ca94]
11      0x7fb381e99a34 __clone + 68

We have also checked the logs on Triton in verbose mode and were able to see the messages like the following before the traceback:

Set dimension [24626] for tensor input_ids does not satisfy any optimization profiles. Valid range for profile 0: [1]..[64]. Valid range for profile 1: [64]..[128]. Valid range for profile 2: [128]..[256]. Valid range for profile 3: [256]..[512]. Valid range for profile 4: [512]..[1024]. Valid range for profile 5: [1024]..[16384].)

Lastly, when these errors happen, it is likely that many requests will start failing the same way (the traceback is repeated many times); maybe for the whole batch?

I hope this can help investigate the issue further and come up with a fix. We would love to use TensorRT-LLM in production but cannot consider doing so until stability issues such as this one are addressed.

akhoroshev · 2024-12-23T16:07:42Z

Unfortunately this makes TensorRT-LLM a no-go for any production usage at the moment

It is absolutely not clear to me why the auxiliary code that launches the engines is closed sourced (Executor for example). if it were open, the community would find and resolve bugs faster

remusao · 2024-12-23T16:17:10Z

@nekorobov let us know if you'd need any further information, we can try to build the engine in different ways and run experiments to pin-point the root cause. We'd love your insights in order to figure out a way forward.

akhoroshev · 2024-12-23T16:33:17Z

@remusao try to set --max_seq_len ${MAX_SEQ_LENGTH} for trtllm-build

remusao · 2024-12-23T16:37:36Z

@remusao try to set --max_seq_len ${MAX_SEQ_LENGTH} for trtllm-build

It already seems to deduce the correct value:

[12/23/2024-16:26:18] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 131072

remusao · 2024-12-23T16:46:18Z

@akhoroshev Would you be able to share the flags you used to build/optimize your own engines? And maybe the config.pbtxt values that are not the defaults? Are you using any quantization?

akhoroshev · 2024-12-23T16:54:45Z

@remusao

python examples/llama/convert_checkpoint.py --model_dir HF_MODEL_PATH  --dtype float16   --output_dir TRT_CHECKPOINT --tp_size 1

trtllm-build --checkpoint_dir TRT_CHECKPOINT --output_dir TRT_MODEL --max_input_len 7168 --max_seq_len 8192 --max_num_tokens 4096 --max_batch_size 256 --gemm_plugin float16 --use_paged_context_fmha enable

I use trtllm directly without triton

remusao · 2024-12-23T16:57:00Z

That seems to rule out any of the extra optimizations we enabled (e.g. --reduce_fusion or --user_buffer - Llama-specific -). I don't know enough about the internals of TensorRT-LLM but maybe something at the junction of inflight batcher, and chunked context (not sure if that's enabled in your case)?

akhoroshev · 2024-12-23T17:08:45Z

Chunked context is enabled in my case.

remusao · 2024-12-23T17:11:11Z

Thanks, I guess we'll need someone from Nvidia to chime in here to make progress. Given that it seems to happen in pretty different setups on a very common model (Llama); I would assume any production setup can potentially hit this bug so this seems pretty serious. Cc @byshiue @kaiyux

remusao · 2025-01-08T08:57:33Z

Hi, is there any update? This issue alone makes it pretty much impossible to use TensorRT-LLM for any serious production load (unless inflight batcher is not in use).

remusao · 2025-01-31T18:29:17Z

Hi, it seems like a new (pretty big) update was released yesterday: triton-inference-server/tensorrtllm_backend#687 + #2725

Skimming through the diff I did not see any changes on the inflight batcher so I assume the issue here might not be addressed yet. Is there any update or new insights about this? Cc @akhoroshev @nv-guomingz

Thanks in advance,

remusao · 2025-03-13T14:43:20Z

@hypdeb do you have any insights on this issue by any chance? I see you have commented on similar-looking issues recently.

hello-11 added triaged Issue has been triaged by maintainers runtime labels Nov 25, 2024

nekorobov self-assigned this Nov 25, 2024

nv-guomingz added Generic Runtime and removed runtime labels Nov 26, 2024

akhoroshev changed the title ~~[bug] forwardAsync assertion failed~~ [bug] forwardAsync assertion failed: Input length (6973) + max new tokens (4095) + draft tokens (0) must be less than max sequence length (8192) Dec 25, 2024

akhoroshev mentioned this issue Jan 22, 2025

[bug] Encountered an error in forwardAsync function: Assertion failed: mNextBlocks.empty() #2708

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] forwardAsync assertion failed: Input length (6973) + max new tokens (4095) + draft tokens (0) must be less than max sequence length (8192) #2494

[bug] forwardAsync assertion failed: Input length (6973) + max new tokens (4095) + draft tokens (0) must be less than max sequence length (8192) #2494

akhoroshev commented Nov 25, 2024 •

edited

Loading

nekorobov commented Nov 25, 2024

akhoroshev commented Nov 25, 2024 •

edited

Loading

akhoroshev commented Nov 25, 2024

akhoroshev commented Dec 6, 2024

akhoroshev commented Dec 9, 2024

TriLoo commented Dec 10, 2024

remusao commented Dec 23, 2024 •

edited

Loading

akhoroshev commented Dec 23, 2024 •

edited

Loading

remusao commented Dec 23, 2024

akhoroshev commented Dec 23, 2024

remusao commented Dec 23, 2024

remusao commented Dec 23, 2024

akhoroshev commented Dec 23, 2024

remusao commented Dec 23, 2024

akhoroshev commented Dec 23, 2024

remusao commented Dec 23, 2024

remusao commented Jan 8, 2025

remusao commented Jan 31, 2025

remusao commented Mar 13, 2025

[bug] forwardAsync assertion failed: Input length (6973) + max new tokens (4095) + draft tokens (0) must be less than max sequence length (8192) #2494

[bug] forwardAsync assertion failed: Input length (6973) + max new tokens (4095) + draft tokens (0) must be less than max sequence length (8192) #2494

Comments

akhoroshev commented Nov 25, 2024 • edited Loading

nekorobov commented Nov 25, 2024

akhoroshev commented Nov 25, 2024 • edited Loading

akhoroshev commented Nov 25, 2024

akhoroshev commented Dec 6, 2024

akhoroshev commented Dec 9, 2024

TriLoo commented Dec 10, 2024

remusao commented Dec 23, 2024 • edited Loading

akhoroshev commented Dec 23, 2024 • edited Loading

remusao commented Dec 23, 2024

akhoroshev commented Dec 23, 2024

remusao commented Dec 23, 2024

remusao commented Dec 23, 2024

akhoroshev commented Dec 23, 2024

remusao commented Dec 23, 2024

akhoroshev commented Dec 23, 2024

remusao commented Dec 23, 2024

remusao commented Jan 8, 2025

remusao commented Jan 31, 2025

remusao commented Mar 13, 2025

akhoroshev commented Nov 25, 2024 •

edited

Loading

akhoroshev commented Nov 25, 2024 •

edited

Loading

remusao commented Dec 23, 2024 •

edited

Loading

akhoroshev commented Dec 23, 2024 •

edited

Loading