Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: The KV cache is sometimes truncated incorrectly when making v1/chat/completions API calls #11970

Open
vnicolici opened this issue Feb 20, 2025 · 10 comments

Comments

@vnicolici
Copy link

Name and Version

.\llama-server.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 4743 (d07c621)
built with MSVC 19.29.30158.0 for

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Command line

> .\llama-server.exe -m .\models\unsloth\DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --verbose-prompt --dump-kv-cache --log-timestamps --log-prefix --verbose --alias 'DeepSeek-R1-UD-IQ1_S' --log-file DeepSeek-R1-UD-IQ1_S.log

Problem description & steps to reproduce

When using the llama-server and its Web UI, sometimes parts of the KV cache are truncated when they shouldn't be. Steps to reproduce:

  1. Start llama-server with a command such as:
.\llama-server.exe -m .\models\unsloth\DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --verbose-prompt --dump-kv-cache --log-timestamps --log-prefix --verbose --alias 'DeepSeek-R1-UD-IQ1_S' --log-file DeepSeek-R1-UD-IQ1_S.log

This is the 1.58bit quantized version of the DeepSeek-R1 model by unsloth. I've been able to reproduce the issue with the 2.22bit version too.
However, I've NOT been able to reproduce it with DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf and Qwen2.5-7B-Instruct-1M-Q4_K_M.gguf

  1. Open the Web UI, and disable "Exclude thought process when sending request to API (Recommended for DeepSeek-R1)"

In this way, the prompts sent should match the KV cache entirely within the same conversation, since the thinking that is included in the cache won't be excluded from the prompt.

Side note: In my opinion, including the thought processes in the prompts by the UI should be the default, as in my experience the quality of long conversations is negatively affected by excluding the thinking from the prompts. Also, not including the thinking means the cache needs to be recomputed starting with the end of the previous user input each time the user inputs something new in the chat, which slows down the assistant replies.

Basically this causes long pauses until the assistant starts generating new output after the new user input, as it needs to reprocess as a prompt the previous assistant output, without the thinking. In cases when the previous assistant reply is quite long, even without the thinking, this can take a long time (minutes, or even tens of minutes in extreme cases). I understand the advantages of removing thinking, as you can fit a long conversation in a smaller context if you keep removing the thinking from the context, but I'm not sure this outweighs the disadvantages.

  1. Start a new conversation from the Web UI and enter a user prompt that is likely to cause significant amounts of assistant output, including "thinking", for example:
Tell me a romantic story, please.
  1. Wait for the reply to be generated, and check the log to see how much information has accumulated in the cache. An example from my latest test:
8.45.064.978 D slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 4096, n_past = 1185, n_cache_tokens = 1185, truncated = 0

So, in this case, the cache contained 1185 tokens after the assistant replied to my initial prompt.

  1. Add some new user input to the conversation. This time it doesn't necessarily need to generate a lot of output or cause a lot of thinking. For example:
Thank you, that's all.
  1. Check the log again, to see how much of the cache has been truncated, you will find something like this:
12.30.856.271 I slot update_slots: id  0 | task 1170 | kv cache rm [488, end)

This means that the cache from position 488 to position 1185 has been discarded, for some reason.

In my opinion, this shouldn't happen, it should keep the entire content of the cache and not remove anything, since the new prompt is a continuation of the same conversation.

During my test, I tried identifying exactly what was previously in the cache at position 488, and it was a word in a sentence towards the end of the thinking, but it doesn't seem special in any way. Just the word "vivid" before the end of a sentence, and that sentence wasn't even the last sentence in the thinking section of the reply:

4.11.215.220 D slot process_toke: id  0 | task 0 | n_decoded = 470, n_remaining = -1, next token:   850 ' more'
...
4.11.215.232 D slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 4096, n_past = 487, n_cache_tokens = 487, truncated = 0
...
4.11.578.617 D slot process_toke: id  0 | task 0 | n_decoded = 471, n_remaining = -1, next token: 33949 ' vivid'
...
4.11.578.630 D slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 4096, n_past = 488, n_cache_tokens = 488, truncated = 0
...
4.11.934.032 D slot process_toke: id  0 | task 0 | n_decoded = 472, n_remaining = -1, next token:    16 '.'
...
4.11.934.047 D slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 4096, n_past = 489, n_cache_tokens = 489, truncated = 0

I've even coded my own command line API client in Go, and I was still able to replicate the issue. So it doesn't seem to be an bug in the Web UI, but an issue with the /v1/chat/completions API itself.

I have NOT been able to replicate this using llama-cli.exe, it works properly without discarding any parts of the cache during such conversations.

Currently, I'm forced to use this CLI, otherwise a 2 hour conversation with DeepSeek can easily turn into a 3-4 hour conversation, due to the caching issues.

I attached the log from my latest test.

DeepSeek-R1-UD-IQ1_S.zip

First Bad Commit

No response

Relevant log output

@ggerganov
Copy link
Member

I suspect that for some reason the text with the new request tokenizes slightly different around the vivid word compared to what was generated.

@vnicolici
Copy link
Author

OK, if there is anything else you want me to test to find the cause, I'm available.

@ggerganov
Copy link
Member

In your example, what was the token after the . (id = 16)?

@vnicolici
Copy link
Author

Two new lines ('\n\n'), between two sentences:

4.12.294.383 D slot process_toke: id  0 | task 0 | n_decoded = 473, n_remaining = -1, next token:   271 '

'

@vnicolici
Copy link
Author

I just repeated my last test. It was "skip" instead of "vivid" this time, but again followed by a dot, 16, then by 271 (two new lines).

This time it wasn't even during the thinking, and it was the second instance of the 271 token. However, it was the first instance of a 16 followed by a 271.

9.23.600.369 D slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 4096, n_past = 1129, n_cache_tokens = 1129, truncated = 0
....
10.01.107.702 I slot update_slots: id  0 | task 1114 | kv cache rm [539, end)
...
5.35.683.014 D slot process_toke: id  0 | task 0 | n_decoded = 521, n_remaining = -1, next token:  4082 ' heart'
...
5.36.040.830 D slot process_toke: id  0 | task 0 | n_decoded = 522, n_remaining = -1, next token: 21429 ' skip'
...
5.36.407.388 D slot process_toke: id  0 | task 0 | n_decoded = 523, n_remaining = -1, next token:    16 '.'
...
5.36.764.273 D slot process_toke: id  0 | task 0 | n_decoded = 524, n_remaining = -1, next token:   271 '

'

@ggerganov
Copy link
Member

Yes, as I expected it tokenizes differently on the way back to the server:

./bin/llama-tokenize -m r1.gguf -p " more vivid."

init: model is vocab-only -- no computation will be performed
     0 -> '<|begin▁of▁sentence|>'
   850 -> ' more'
 33949 -> ' vivid'
    16 -> '.'
./bin/llama-tokenize -m r1.gguf -p " more vivid.\n\n"

init: model is vocab-only -- no computation will be performed
     0 -> '<|begin▁of▁sentence|>'
   850 -> ' more'
 33949 -> ' vivid'
   339 -> '.

'

This means that either we have a bug in the pre-processor regexes of the R1 tokenizer, or this is simply a limitation of the tokenizer. If the latter, then the client would have to start storing the raw token ids along with the text and send the ids for the new requests.

@vnicolici
Copy link
Author

If the latter, then the client would have to start storing the raw token ids along with the text and send the ids for the new requests.

Does the /v1/chat/completions API support this at the moment? I mean, returning a list of token ids, not just the tokens, and accepting a list of token ids instead of regular text content as the input? Also, is there some documentation for this API somewhere? If there is, I have trouble finding it. I only found links to the OpenAI documentation. I know they are compatible, but I assume there are differences.

@vnicolici
Copy link
Author

After reading the READMEs, the source code, and experimenting a bit, it seems receiving/sending arrays of token IDs instead of strings is not supported by the API.

I managed to get the generated token IDs in the API response under "__verbose" / "tokens" using "return_tokens":true, but it only works with the server started with --verbose and it doesn't work if I enable streaming with "stream": true.

Then I tried sending back the token IDs with the next request, and the server didn't like such a request at all, I got Failed to parse messages: Missing content part type: 128798 for this request:

{
  "messages":[
    {"content":"You are a helpful assistant.","role":"system"},
    {"content":"Hello.","role":"user"},
    {"content":[128798,271,128799,271,19923,3,1730,588,342,8233,440,4316,33,53769,235],"role":"assistant"},
    {"content":"Who are you?","role":"user"}],
  "return_tokens":true
}

Looking at the source code, aside from a simple string, the content can be an array of objects, and each object must have a type attribute that should equal text, in which case it expects the object to have another attribute also named text. Looking further into the code, the purpose of that seems to be to send multiple lines of text as an array, so nothing to do with tokens.

@ggerganov
Copy link
Member

I would first check if the tokenizer works correctly by comparing it to a reference implementation.

@vnicolici
Copy link
Author

While I have no experience with LLM tokenizers, my feeling is there is nothing wrong with the tokenizer. The problematic text generated initially came from 2 tokens, the dot as the first token, and the two new lines as a single separate token. And when you tried to tokenize it, it got converted to a single token, consisting of the dot and the two new lines. That seems normal, to me.

For example, if for some strange reason a model generates the text "abc" as 3 different tokens, one for each character, and there is also a token that matches the entire sequence, I would expect the tokenizer to return that single token, for "abc", not 3 tokens for "a", "b" and "c". If the tokenizer didn't behave like that, everything would always be tokenized with a token for each character, which doesn't make any sense.

I don't think it's impossible, in general, for a model to generate 2 (or more) non equal sequences of tokens that represent the same text, depending on the seed and temperature and other factors. In which case, if you try to tokenize the text, at best it would match one of the two sequences, so it would behave "incorrectly" in 50% of the cases for such sequences and there would be no way to "fix" that.

In our particular case, involving new lines, maybe handling new lines differently in the tokenizer might solve the issue. But I'm not sure if it that would be and actual fix, or just a workaround for this particular situation. And there is also the risk that fixing this case might introduce issues in other cases.

So, I have an idea that might make the differences between the initial generation and the tokenization irrelevant when determining how much of the cache to keep, but I'm not sure how feasible would be to implement it.

Basically, if I understand correctly, right now we have the cache with an array of token IDs, the prompt received is converted to another array of token IDs (tokenized), then the two arrays are compared item by item until either a mismatch is found, or one of the two arrays ends, and the cache is truncated at that position. (There is also that similarity check for the situation when there are multiple slots, the minimum similarity to match a slot, but for simplicity I'll ignore it for now.)

A possible way to fix this would be to do it the other way around: initially do not do the tokenization of the input prompt, just leave it as a string. Then convert the cached token IDs to text, one by one, and for each one check if the input prompt text matches, and advance in the input prompt by the token length if it does. When a difference is found between the text corresponding to a token ID from the cache and the text from the input prompt, the cache can be truncated at that position.

The rest of the prompt string, starting with the position that didn't match the cache, can be then tokenized (if needed) and processed. I'm not very familiar with how this works, but I'm guessing you would need to tokenize at least the "new" part of the prompt (the part that doesn't match the cache) to be able to process it and then generate the assistant reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants