You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Perhaps I am not too familiar with either KV-Cache or Llama.cpp. But I currently am unsure of how the KV-Cache is used when running an inference with a model.
Consider the following flow (swift):
LLM is initialized and "warmed up" with a system prompt. the warmup can be seen below:
the user now requests something "Tell me about quantum mechanics" and i run the completion init solely on the user input and then run the sampler to generate tokens:
publicfunc run_inference(input:String){completion_init(input)completion_stream()}privatefunc completion_stream(){
// sampler, context and batch exist as class variables and managed internally
letnew_token_id:Token=llama_sampler_sample(sampler, context, batch.n_tokens -1)
// other code...
}
Reasoning:
The reason why I don't clear the KV cache on every inference is because I believe that when llama_decode is called it updates the KV cache with the new tokens thus all I have to do is append the new user request rather than clearing the cache entirely and recalculating with entire system prompt + user request.
Question:
When i run the llama_sampler_sample on the second step (new user input) would the KV cache be used to sample the next token, i.e. would the inference run for the system prompt that was initialized when doing "warmup" + the new user input, or does it do the inference solely on the current batch state?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Perhaps I am not too familiar with either KV-Cache or Llama.cpp. But I currently am unsure of how the KV-Cache is used when running an inference with a model.
Consider the following flow (swift):
Reasoning:
The reason why I don't clear the KV cache on every inference is because I believe that when llama_decode is called it updates the KV cache with the new tokens thus all I have to do is append the new user request rather than clearing the cache entirely and recalculating with entire system prompt + user request.
Question:
When i run the llama_sampler_sample on the second step (new user input) would the KV cache be used to sample the next token, i.e. would the inference run for the system prompt that was initialized when doing "warmup" + the new user input, or does it do the inference solely on the current batch state?
Beta Was this translation helpful? Give feedback.
All reactions