KV Cache Usage #11980

Jzuni97 · 2025-02-20T18:32:55Z

Jzuni97
Feb 20, 2025

Perhaps I am not too familiar with either KV-Cache or Llama.cpp. But I currently am unsure of how the KV-Cache is used when running an inference with a model.

Consider the following flow (swift):

LLM is initialized and "warmed up" with a system prompt. the warmup can be seen below:

public func warmup() async throws {
        let prompt: String = "\(template.system.prefix)\(instruction_prompt)\(template.system.suffix)"
        completion_init(prompt)
}

public func completion_init(prompt: String) {
       let tokens = tokenize(prompt)

       for i1 in 0..<tokens_list.count {
            let i = Int(i1)
            llama_batch_add(&batch, tokens_list[i], Int32(i), [0], false)
        }
        batch.logits[Int(batch.n_tokens) - 1] = 1 // true

        if llama_decode(context, batch) != 0 {
            print("llama_decode() failed")
        }
}

private func llama_batch_clear(_ batch: inout llama_batch) {
        batch.n_tokens = 0
}

private func llama_batch_add(_ batch: inout llama_batch, _ id: llama_token, _ pos: llama_pos, _ seq_ids: [llama_seq_id], _ logits: Bool) {
    batch.token   [Int(batch.n_tokens)] = id
    batch.pos     [Int(batch.n_tokens)] = pos
    batch.n_seq_id[Int(batch.n_tokens)] = Int32(seq_ids.count)
        
    for i in 0..<seq_ids.count {
        guard batch.seq_id[Int(batch.n_tokens)] != nil else { continue }
        batch.seq_id[Int(batch.n_tokens)]![Int(i)] = seq_ids[i]
    }
    batch.logits  [Int(batch.n_tokens)] = logits ? 1 : 0

    batch.n_tokens += 1
}

the user now requests something "Tell me about quantum mechanics" and i run the completion init solely on the user input and then run the sampler to generate tokens:

public func run_inference(input: String) {
       completion_init(input)
       completion_stream()
}
private func completion_stream() {
       // sampler, context and batch exist as class variables and managed internally
       let new_token_id: Token = llama_sampler_sample(sampler, context, batch.n_tokens - 1)

       // other code...
}

Reasoning:

The reason why I don't clear the KV cache on every inference is because I believe that when llama_decode is called it updates the KV cache with the new tokens thus all I have to do is append the new user request rather than clearing the cache entirely and recalculating with entire system prompt + user request.

Question:

When i run the llama_sampler_sample on the second step (new user input) would the KV cache be used to sample the next token, i.e. would the inference run for the system prompt that was initialized when doing "warmup" + the new user input, or does it do the inference solely on the current batch state?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KV Cache Usage #11980

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

KV Cache Usage #11980

Uh oh!

Jzuni97 Feb 20, 2025

Reasoning:

Question:

Replies: 0 comments

Jzuni97
Feb 20, 2025