-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Description
Name and Version
6586 (835b2b9)
built with clang version 19.1.5 for x86_64-pc-windows-msvc
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server -m "Qwen_Qwen3-30B-A3B-Q6_K.gguf" --port 7861 -c 16384 -b 2048 --gpu-layers 99 --flash-attn on --no-mmap --main-gpu 1 --tensor-split 0,100
Problem description & steps to reproduce
The Web UI (over)estimates the prompt's token count yet entirely blocks prompts whose estimated token count exceeds the context size. For example, on a 16,384 context window, I provided a 8210-token prompt, and it estimated 23,199 tokens and showed the "Message Too Long" dialog without sending an HTTP request. Additionally, it sent successfully after I tried a few times, but then a chat reached 16,384 tokens, and my next new chat estimated the same prompt to be exactly 16,384 tokens and showed the warning dialog again. (That happened on build 6692.)
Easiest solution: make the dialog just a warning and send the request anyway. (I considered a "Send Anyway" button, but since it's an easy-to-cancel operation, it's probably better to just attempt it.)
Better solution: if the estimate exceeds the context window (and not by >5x), send the prompt + system prompt to /tokenize for a more accurate estimate. Alternatively, don't automatically call /tokenize, but provide a button in the dialog to do so.