-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] forwardAsync assertion failed: Input length (6973) + max new tokens (4095) + draft tokens (0) must be less than max sequence length (8192) #2494
Comments
Hi @akhoroshev, thank you for taking time to report the issue. From just looking at code, the logic seems correct to me. I see no way how Could you share a reproducer, please? |
It happens under load, for example it's possible to have two requests (or more):
They are both valid ( But assertion fails |
I can't because it's a closed model. |
Hi! Any updates here? |
meet "Encountered an error in forwardAsync function: std::bad_cast" error when running BERT/Roberta,
|
Hi, We are also encountering a similar issue under load, but it is very hard to reproduce consistently (sometimes the issue will not appear before a few hours). Unfortunately this makes TensorRT-LLM a no-go for any production usage at the moment, which is a shame given the performance uplift compared to alternatives. We were able to reproduce this issue a few times using https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct running on a 8xH100 node (using an in-house load-testing script which simulates arbitrary concurrency; but even using this script it is hard to understand what exactly triggers the issue on Triton's side; apart from "high enough concurrency"). The building is done using the following two commands: MAX_BATCH_SIZE=64
TENSOR_PARALLELISM=8
MAX_NUM_TOKENS=16384
MAX_SEQ_LENGTH=131072
# Quantize model to fp8
python tensorrt_llm/examples/quantization/quantize.py \
--model_dir ./Llama-3.3-70B-Instruct \
--dtype auto \
--qformat fp8 \
--kv_cache_dtype fp8 \
--calib_size 512 \
--tp_size ${TENSOR_PARALLELISM} \
--device cuda \
--tokenizer_max_seq_length ${MAX_SEQ_LENGTH} \
--batch_size ${MAX_BATCH_SIZE} \
--output_dir ./checkpoint
# Build TensorRT engine (fp8)
trtllm-build \
--checkpoint_dir ./checkpoint \
--max_batch_size=${MAX_BATCH_SIZE} \
--workers=8 \
--context_fmha=enable \
--kv_cache_type=paged \
--use_paged_context_fmha=enable \
--multiple_profiles=enable \
--use_fp8_context_fmha=enable \
--gpt_attention_plugin=auto \
--remove_input_padding=enable \
--use_fused_mlp=enable \
--reduce_fusion=enable \
--user_buffer=enable \
--max_num_tokens=${MAX_NUM_TOKENS} \
--output_dir ./engine The configs are generated with the following: python3 fill_template.py --in_place triton_models/preprocessing/config.pbtxt tokenizer_dir:./Llama-3.3-70B-Instruct ,triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${MAX_BATCH_SIZE},add_special_tokens:False
python3 fill_template.py --in_place triton_models/postprocessing/config.pbtxt tokenizer_dir:./Llama-3.3-70B-Instruct ,triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${MAX_BATCH_SIZE},skip_special_tokens:True
python3 fill_template.py --in_place triton_models/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:True,bls_instance_count:${MAX_BATCH_SIZE},accumulate_tokens:False,logits_datatype:TYPE_FP32
python3 fill_template.py --in_place triton_models/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},logits_datatype:TYPE_FP32
python3 fill_template.py --in_place triton_models/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:True,max_beam_width:1,engine_dir:./engine,kv_cache_free_gpu_mem_fraction:0.95,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:20000,encoder_input_features_data_type:TYPE_FP16,enable_chunked_context:True,multi_block_mode:True,cuda_graph_mode:True,cuda_graph_cache_size:50000,logits_datatype:TYPE_FP32 We were using these configs https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/inflight_batcher_llm as shown in the official tutorial to run Llama in TensorRT-LLM. TensorRT-LLM was built from latest version on main: be17881 (but the issue was also observed in the previous commit so that does not necessarily seem to be a new issue). You may need to adjust the paths for Llama model, checkpoints, engines, etc. to fully reproduce the setup. Unfortunately we are not able to provide a minimum repro case, as the issue only seems to happen randomly and when using high-enough concurrency with the inflight batching enabled (sometimes Triton will run fine for a few hours and then start misbehaving, maybe indicating that specific concurrent inputs are required in order to create this particular issue). Nonetheless here is an example of trace we get from the client (reproduced with both GRPC and HTTP clients) when it happens:
We have also checked the logs on Triton in verbose mode and were able to see the messages like the following before the traceback:
Lastly, when these errors happen, it is likely that many requests will start failing the same way (the traceback is repeated many times); maybe for the whole batch? I hope this can help investigate the issue further and come up with a fix. We would love to use TensorRT-LLM in production but cannot consider doing so until stability issues such as this one are addressed. |
It is absolutely not clear to me why the auxiliary code that launches the engines is closed sourced (Executor for example). if it were open, the community would find and resolve bugs faster |
@nekorobov let us know if you'd need any further information, we can try to build the engine in different ways and run experiments to pin-point the root cause. We'd love your insights in order to figure out a way forward. |
@remusao try to set |
It already seems to deduce the correct value:
|
@akhoroshev Would you be able to share the flags you used to build/optimize your own engines? And maybe the |
I use trtllm directly without triton |
That seems to rule out any of the extra optimizations we enabled (e.g. |
Chunked context is enabled in my case. |
Hi, is there any update? This issue alone makes it pretty much impossible to use TensorRT-LLM for any serious production load (unless inflight batcher is not in use). |
Hi, it seems like a new (pretty big) update was released yesterday: triton-inference-server/tensorrtllm_backend#687 + #2725 Skimming through the diff I did not see any changes on the inflight batcher so I assume the issue here might not be addressed yet. Is there any update or new insights about this? Cc @akhoroshev @nv-guomingz Thanks in advance, |
@hypdeb do you have any insights on this issue by any chance? I see you have commented on similar-looking issues recently. |
My version
Assertion fails under load
I don't know how it's possible because
input_length <= 7168
max_new_tokens=min(4096, 8192 - input_length)
Moreover, Executor additionally checks this invariant.
The only idea is that
tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep
is setting wrongmax_new_tokens
fordecoder_batch::Request
(under certain conditions)The text was updated successfully, but these errors were encountered: