How to use prompt cache in llama.cpp with a fixed system prompt and dynamic user input #14171

sienaiwun · 2025-06-13T13:46:56Z

sienaiwun
Jun 13, 2025

Hi,

I've read through Discussion #8947, but I'm still unclear on how to properly use prompt caching in llama.cpp.

We are running llama.cpp directly on the device, so we're not using llama-server. Our use case follows the common chat-style template, like:

json
Copy
Edit
"messages": [
{
"role": "system",
"content": "system prompt"
},
{
"role": "user",
"content": "xxxx"
}
]
I've looked into llama-cli, and I see that it allows storing the system prompt tokens into a custom.bin file. However, that file seems fixed/static, so it's not clear how to dynamically insert a user-specific prompt into that cached context.

What I'm trying to do:

We have a fixed prompt template (e.g., system prompt), and we want to:
Cache the tokens from the fixed part (system prompt),
At runtime, append the user's input,

Then perform inference, ideally streaming the output.

The goal is to reuse the encoded system prompt tokens, to reduce total inference time—especially since the system prompt is reused for every interaction.

So my main question is:

How can I use prompt caching in llama.cpp to speed up inference for a chat-style prompt that combines a fixed system prompt with user input dynamically?

I’m not sure how to structure the token input, or whether this is supported in the current design of the prompt cache. Any guidance or examples would be appreciated.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to use prompt cache in llama.cpp with a fixed system prompt and dynamic user input #14171

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to use prompt cache in llama.cpp with a fixed system prompt and dynamic user input #14171

Uh oh!

Uh oh!

sienaiwun Jun 13, 2025

Replies: 0 comments

sienaiwun
Jun 13, 2025