Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] forwardAsync assertion failed: Input length (6973) + max new tokens (4095) + draft tokens (0) must be less than max sequence length (8192) #2494

Open
akhoroshev opened this issue Nov 25, 2024 · 19 comments
Assignees
Labels
Generic Runtime triaged Issue has been triaged by maintainers

Comments

@akhoroshev
Copy link
Contributor

akhoroshev commented Nov 25, 2024

My version

Assertion fails under load

[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: Input length (6973) + max new tokens (4095) + draft tokens (0) must be less than max sequence length (8192). (/sources/contrib/tensorrt-llm/cpp/tensorrt_llm/runtime/gptDecoderBatched.cpp:444)
1       0x7fa8df465992 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 78
2       0x7fa8df66b693 tensorrt_llm::runtime::GptDecoderBatched::newRequest(int, tensorrt_llm::runtime::decoder_batch::Request const&, tensorrt_llm::runtime::SamplingConfig const&) + 4307
3       0x7fa8df66d7cc tensorrt_llm::runtime::GptDecoderBatched::newRequests(std::vector<int, std::allocator<int> > const&, std::vector<tensorrt_llm::runtime::decoder_batch::Request, std::allocator<tensorrt_llm::runtime::decoder_batch::Request> > const&, std::vector<tensorrt_llm::runtime::SamplingConfig, std::allocator<tensorrt_llm::runtime::SamplingConfig> > const&) + 172
4       0x7fa8e15f93c5 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 725
5       0x7fa8e15fbb90 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 3792
6       0x7fa8e1625a71 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 353
7       0x7fa8e162a97f tensorrt_llm::executor::Executor::Impl::executionLoop() + 895
8       0x7fa8bafaba80 /opt/wmcore/lib/libtensorrt_llm_nvrtc_wrapper.so(+0x32c5a80) [0x7fa8bafaba80]
9       0x7fa8720d01ca /lib64/libpthread.so.0(+0x81ca) [0x7fa8720d01ca]
10      0x7fa87140de73 clone + 67
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 8192
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (8192) * 28
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8191  = maxSequenceLen - 1 since chunked context is enabled

I don't know how it's possible because

  1. for all my requests input_length <= 7168
  2. for all my requests max_new_tokens=min(4096, 8192 - input_length)

Moreover, Executor additionally checks this invariant.

The only idea is that tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep is setting wrong max_new_tokens for decoder_batch::Request (under certain conditions)

@hello-11 hello-11 added triaged Issue has been triaged by maintainers runtime labels Nov 25, 2024
@nekorobov
Copy link
Collaborator

Hi @akhoroshev, thank you for taking time to report the issue. From just looking at code, the logic seems correct to me. I see no way how max_new_tokens can be equal to 4095. The check in the GenericLlmRequest::validate is called only via executor API. Old GptManager API does not call it.

Could you share a reproducer, please?

@nekorobov nekorobov self-assigned this Nov 25, 2024
@akhoroshev
Copy link
Contributor Author

akhoroshev commented Nov 25, 2024

@nekorobov

From just looking at code, the logic seems correct to me. I see no way how max_new_tokens can be equal to 4095

It happens under load, for example it's possible to have two requests (or more):

  1. input_length=4097, max_new_tokens=4095
  2. input_length=6973, max_new_tokens=1219

They are both valid (GenericLlmRequest::validate was called since I use Executor API)

But assertion fails

@akhoroshev
Copy link
Contributor Author

Could you share a reproducer, please?

I can't because it's a closed model.

@akhoroshev
Copy link
Contributor Author

Hi! Any updates here?

@akhoroshev
Copy link
Contributor Author

@nekorobov @nv-guomingz

@TriLoo
Copy link

TriLoo commented Dec 10, 2024

meet "Encountered an error in forwardAsync function: std::bad_cast" error when running BERT/Roberta,
install tensorrt-llm from source code

  • commit id: 340a1b6
  • GPU: A100
  • CUDA-12.4

File "/home/aiscuser/.conda/envs/py10/lib/python3.10/site-packages/fire/core.py", line 135, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/aiscuser/.conda/envs/py10/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/aiscuser/.conda/envs/py10/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home/aiscuser/DistillLLM/inference/tensorrt_llm/timing_rc2.py", line 137, in timing outputs = runner.generate( File "/home/aiscuser/.conda/envs/py10/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 747, in generate return self._initialize_and_fill_output( File "/home/aiscuser/.conda/envs/py10/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 886, in _initialize_and_fill_output return self._fill_output( File "/home/aiscuser/.conda/envs/py10/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 990, in _fill_output raise RuntimeError(response.error_msg) RuntimeError: Encountered an error in forwardAsync function: std::bad_cast

@remusao
Copy link

remusao commented Dec 23, 2024

Hi,

We are also encountering a similar issue under load, but it is very hard to reproduce consistently (sometimes the issue will not appear before a few hours). Unfortunately this makes TensorRT-LLM a no-go for any production usage at the moment, which is a shame given the performance uplift compared to alternatives.

We were able to reproduce this issue a few times using https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct running on a 8xH100 node (using an in-house load-testing script which simulates arbitrary concurrency; but even using this script it is hard to understand what exactly triggers the issue on Triton's side; apart from "high enough concurrency"). The building is done using the following two commands:

MAX_BATCH_SIZE=64
TENSOR_PARALLELISM=8
MAX_NUM_TOKENS=16384
MAX_SEQ_LENGTH=131072

# Quantize model to fp8
python tensorrt_llm/examples/quantization/quantize.py \
    --model_dir ./Llama-3.3-70B-Instruct \
    --dtype auto \
    --qformat fp8 \
    --kv_cache_dtype fp8 \
    --calib_size 512 \
    --tp_size ${TENSOR_PARALLELISM} \
    --device cuda \
    --tokenizer_max_seq_length ${MAX_SEQ_LENGTH} \
    --batch_size ${MAX_BATCH_SIZE} \
    --output_dir ./checkpoint

# Build TensorRT engine (fp8)
trtllm-build \
    --checkpoint_dir ./checkpoint \
    --max_batch_size=${MAX_BATCH_SIZE} \
    --workers=8 \
    --context_fmha=enable \
    --kv_cache_type=paged \
    --use_paged_context_fmha=enable \
    --multiple_profiles=enable \
    --use_fp8_context_fmha=enable \
    --gpt_attention_plugin=auto \
    --remove_input_padding=enable \
    --use_fused_mlp=enable \
    --reduce_fusion=enable \
    --user_buffer=enable \
    --max_num_tokens=${MAX_NUM_TOKENS} \
    --output_dir ./engine

The configs are generated with the following:

python3 fill_template.py --in_place triton_models/preprocessing/config.pbtxt tokenizer_dir:./Llama-3.3-70B-Instruct ,triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${MAX_BATCH_SIZE},add_special_tokens:False
python3 fill_template.py --in_place triton_models/postprocessing/config.pbtxt tokenizer_dir:./Llama-3.3-70B-Instruct ,triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${MAX_BATCH_SIZE},skip_special_tokens:True
python3 fill_template.py --in_place triton_models/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:True,bls_instance_count:${MAX_BATCH_SIZE},accumulate_tokens:False,logits_datatype:TYPE_FP32
python3 fill_template.py --in_place triton_models/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},logits_datatype:TYPE_FP32
python3 fill_template.py --in_place triton_models/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:True,max_beam_width:1,engine_dir:./engine,kv_cache_free_gpu_mem_fraction:0.95,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:20000,encoder_input_features_data_type:TYPE_FP16,enable_chunked_context:True,multi_block_mode:True,cuda_graph_mode:True,cuda_graph_cache_size:50000,logits_datatype:TYPE_FP32

We were using these configs https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/inflight_batcher_llm as shown in the official tutorial to run Llama in TensorRT-LLM.

TensorRT-LLM was built from latest version on main: be17881 (but the issue was also observed in the previous commit so that does not necessarily seem to be a new issue).

You may need to adjust the paths for Llama model, checkpoints, engines, etc. to fully reproduce the setup.

Unfortunately we are not able to provide a minimum repro case, as the issue only seems to happen randomly and when using high-enough concurrency with the inflight batching enabled (sometimes Triton will run fine for a few hours and then start misbehaving, maybe indicating that specific concurrent inputs are required in order to create this particular issue). Nonetheless here is an example of trace we get from the client (reproduced with both GRPC and HTTP clients) when it happens:

in ensemble 'ensemble', Executor failed process requestId 6648 due to the following error: Encountered an error in forwardAsync function: Tensor 'input_ids' has invalid shape (19124), expected in range min (1024), max (16384) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:511)
1       0x7fafc10d98f1 tensorrt_llm::runtime::TllmRuntime::setInputTensorsImpl(int, std::unordered_map<std::string, std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::shared_ptr<tensorrt_llm::runtime::ITensor> > > > const&, bool) + 1665
2       0x7fafc10da8e7 tensorrt_llm::runtime::TllmRuntime::setInputTensors(int, std::unordered_map<std::string, std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::shared_ptr<tensorrt_llm::runtime::ITensor> > > > const&) + 71
3       0x7fafc150c56a tensorrt_llm::batch_manager::TrtGptModelInflightBatching::prepareBuffers(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) + 218
4       0x7fafc150cacf tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) + 1071
5       0x7fafc150d31e tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) + 222
6       0x7fafc150da37 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1719
7       0x7fafc15a4486 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 486
8       0x7fafc15aaec1 tensorrt_llm::executor::Executor::Impl::executionLoop() + 1281
9       0x7fb38206edb4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fb38206edb4]
10      0x7fb381e0ca94 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7fb381e0ca94]
11      0x7fb381e99a34 __clone + 68

We have also checked the logs on Triton in verbose mode and were able to see the messages like the following before the traceback:

Set dimension [24626] for tensor input_ids does not satisfy any optimization profiles. Valid range for profile 0: [1]..[64]. Valid range for profile 1: [64]..[128]. Valid range for profile 2: [128]..[256]. Valid range for profile 3: [256]..[512]. Valid range for profile 4: [512]..[1024]. Valid range for profile 5: [1024]..[16384].)

Lastly, when these errors happen, it is likely that many requests will start failing the same way (the traceback is repeated many times); maybe for the whole batch?

I hope this can help investigate the issue further and come up with a fix. We would love to use TensorRT-LLM in production but cannot consider doing so until stability issues such as this one are addressed.

@akhoroshev
Copy link
Contributor Author

akhoroshev commented Dec 23, 2024

Unfortunately this makes TensorRT-LLM a no-go for any production usage at the moment

It is absolutely not clear to me why the auxiliary code that launches the engines is closed sourced (Executor for example). if it were open, the community would find and resolve bugs faster

@remusao
Copy link

remusao commented Dec 23, 2024

@nekorobov let us know if you'd need any further information, we can try to build the engine in different ways and run experiments to pin-point the root cause. We'd love your insights in order to figure out a way forward.

@akhoroshev
Copy link
Contributor Author

@remusao try to set --max_seq_len ${MAX_SEQ_LENGTH} for trtllm-build

@remusao
Copy link

remusao commented Dec 23, 2024

@remusao try to set --max_seq_len ${MAX_SEQ_LENGTH} for trtllm-build

It already seems to deduce the correct value:

[12/23/2024-16:26:18] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 131072

@remusao
Copy link

remusao commented Dec 23, 2024

@akhoroshev Would you be able to share the flags you used to build/optimize your own engines? And maybe the config.pbtxt values that are not the defaults? Are you using any quantization?

@akhoroshev
Copy link
Contributor Author

@remusao

python examples/llama/convert_checkpoint.py --model_dir HF_MODEL_PATH  --dtype float16   --output_dir TRT_CHECKPOINT --tp_size 1

trtllm-build --checkpoint_dir TRT_CHECKPOINT --output_dir TRT_MODEL --max_input_len 7168 --max_seq_len 8192 --max_num_tokens 4096 --max_batch_size 256 --gemm_plugin float16 --use_paged_context_fmha enable

I use trtllm directly without triton

@remusao
Copy link

remusao commented Dec 23, 2024

That seems to rule out any of the extra optimizations we enabled (e.g. --reduce_fusion or --user_buffer - Llama-specific -). I don't know enough about the internals of TensorRT-LLM but maybe something at the junction of inflight batcher, and chunked context (not sure if that's enabled in your case)?

@akhoroshev
Copy link
Contributor Author

Chunked context is enabled in my case.

@remusao
Copy link

remusao commented Dec 23, 2024

Thanks, I guess we'll need someone from Nvidia to chime in here to make progress. Given that it seems to happen in pretty different setups on a very common model (Llama); I would assume any production setup can potentially hit this bug so this seems pretty serious. Cc @byshiue @kaiyux

@akhoroshev akhoroshev changed the title [bug] forwardAsync assertion failed [bug] forwardAsync assertion failed: Input length (6973) + max new tokens (4095) + draft tokens (0) must be less than max sequence length (8192) Dec 25, 2024
@remusao
Copy link

remusao commented Jan 8, 2025

Hi, is there any update? This issue alone makes it pretty much impossible to use TensorRT-LLM for any serious production load (unless inflight batcher is not in use).

@remusao
Copy link

remusao commented Jan 31, 2025

Hi, it seems like a new (pretty big) update was released yesterday: triton-inference-server/tensorrtllm_backend#687 + #2725

Skimming through the diff I did not see any changes on the inflight batcher so I assume the issue here might not be addressed yet. Is there any update or new insights about this? Cc @akhoroshev @nv-guomingz

Thanks in advance,

@remusao
Copy link

remusao commented Mar 13, 2025

@hypdeb do you have any insights on this issue by any chance? I see you have commented on similar-looking issues recently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Generic Runtime triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

6 participants