-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misc. bug: The KV cache is sometimes truncated incorrectly when making v1/chat/completions API calls #11970
Comments
I suspect that for some reason the text with the new request tokenizes slightly different around the |
OK, if there is anything else you want me to test to find the cause, I'm available. |
In your example, what was the token after the |
Two new lines ('\n\n'), between two sentences:
|
I just repeated my last test. It was "skip" instead of "vivid" this time, but again followed by a dot, 16, then by 271 (two new lines). This time it wasn't even during the thinking, and it was the second instance of the 271 token. However, it was the first instance of a 16 followed by a 271.
|
Yes, as I expected it tokenizes differently on the way back to the server: ./bin/llama-tokenize -m r1.gguf -p " more vivid."
init: model is vocab-only -- no computation will be performed
0 -> '<|begin▁of▁sentence|>'
850 -> ' more'
33949 -> ' vivid'
16 -> '.' ./bin/llama-tokenize -m r1.gguf -p " more vivid.\n\n"
init: model is vocab-only -- no computation will be performed
0 -> '<|begin▁of▁sentence|>'
850 -> ' more'
33949 -> ' vivid'
339 -> '.
' This means that either we have a bug in the pre-processor regexes of the R1 tokenizer, or this is simply a limitation of the tokenizer. If the latter, then the client would have to start storing the raw token ids along with the text and send the ids for the new requests. |
Does the |
After reading the READMEs, the source code, and experimenting a bit, it seems receiving/sending arrays of token IDs instead of strings is not supported by the API. I managed to get the generated token IDs in the API response under Then I tried sending back the token IDs with the next request, and the server didn't like such a request at all, I got
Looking at the source code, aside from a simple string, the content can be an array of objects, and each object must have a |
I would first check if the tokenizer works correctly by comparing it to a reference implementation. |
While I have no experience with LLM tokenizers, my feeling is there is nothing wrong with the tokenizer. The problematic text generated initially came from 2 tokens, the dot as the first token, and the two new lines as a single separate token. And when you tried to tokenize it, it got converted to a single token, consisting of the dot and the two new lines. That seems normal, to me. For example, if for some strange reason a model generates the text "abc" as 3 different tokens, one for each character, and there is also a token that matches the entire sequence, I would expect the tokenizer to return that single token, for "abc", not 3 tokens for "a", "b" and "c". If the tokenizer didn't behave like that, everything would always be tokenized with a token for each character, which doesn't make any sense. I don't think it's impossible, in general, for a model to generate 2 (or more) non equal sequences of tokens that represent the same text, depending on the seed and temperature and other factors. In which case, if you try to tokenize the text, at best it would match one of the two sequences, so it would behave "incorrectly" in 50% of the cases for such sequences and there would be no way to "fix" that. In our particular case, involving new lines, maybe handling new lines differently in the tokenizer might solve the issue. But I'm not sure if it that would be and actual fix, or just a workaround for this particular situation. And there is also the risk that fixing this case might introduce issues in other cases. So, I have an idea that might make the differences between the initial generation and the tokenization irrelevant when determining how much of the cache to keep, but I'm not sure how feasible would be to implement it. Basically, if I understand correctly, right now we have the cache with an array of token IDs, the prompt received is converted to another array of token IDs (tokenized), then the two arrays are compared item by item until either a mismatch is found, or one of the two arrays ends, and the cache is truncated at that position. (There is also that similarity check for the situation when there are multiple slots, the minimum similarity to match a slot, but for simplicity I'll ignore it for now.) A possible way to fix this would be to do it the other way around: initially do not do the tokenization of the input prompt, just leave it as a string. Then convert the cached token IDs to text, one by one, and for each one check if the input prompt text matches, and advance in the input prompt by the token length if it does. When a difference is found between the text corresponding to a token ID from the cache and the text from the input prompt, the cache can be truncated at that position. The rest of the prompt string, starting with the position that didn't match the cache, can be then tokenized (if needed) and processed. I'm not very familiar with how this works, but I'm guessing you would need to tokenize at least the "new" part of the prompt (the part that doesn't match the cache) to be able to process it and then generate the assistant reply. |
Name and Version
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
When using the llama-server and its Web UI, sometimes parts of the KV cache are truncated when they shouldn't be. Steps to reproduce:
This is the 1.58bit quantized version of the DeepSeek-R1 model by unsloth. I've been able to reproduce the issue with the 2.22bit version too.
However, I've NOT been able to reproduce it with DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf and Qwen2.5-7B-Instruct-1M-Q4_K_M.gguf
In this way, the prompts sent should match the KV cache entirely within the same conversation, since the thinking that is included in the cache won't be excluded from the prompt.
Side note: In my opinion, including the thought processes in the prompts by the UI should be the default, as in my experience the quality of long conversations is negatively affected by excluding the thinking from the prompts. Also, not including the thinking means the cache needs to be recomputed starting with the end of the previous user input each time the user inputs something new in the chat, which slows down the assistant replies.
Basically this causes long pauses until the assistant starts generating new output after the new user input, as it needs to reprocess as a prompt the previous assistant output, without the thinking. In cases when the previous assistant reply is quite long, even without the thinking, this can take a long time (minutes, or even tens of minutes in extreme cases). I understand the advantages of removing thinking, as you can fit a long conversation in a smaller context if you keep removing the thinking from the context, but I'm not sure this outweighs the disadvantages.
So, in this case, the cache contained 1185 tokens after the assistant replied to my initial prompt.
This means that the cache from position 488 to position 1185 has been discarded, for some reason.
In my opinion, this shouldn't happen, it should keep the entire content of the cache and not remove anything, since the new prompt is a continuation of the same conversation.
During my test, I tried identifying exactly what was previously in the cache at position 488, and it was a word in a sentence towards the end of the thinking, but it doesn't seem special in any way. Just the word "vivid" before the end of a sentence, and that sentence wasn't even the last sentence in the thinking section of the reply:
I've even coded my own command line API client in Go, and I was still able to replicate the issue. So it doesn't seem to be an bug in the Web UI, but an issue with the /v1/chat/completions API itself.
I have NOT been able to replicate this using llama-cli.exe, it works properly without discarding any parts of the cache during such conversations.
Currently, I'm forced to use this CLI, otherwise a 2 hour conversation with DeepSeek can easily turn into a 3-4 hour conversation, due to the caching issues.
I attached the log from my latest test.
DeepSeek-R1-UD-IQ1_S.zip
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: