You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've read through Discussion #8947, but I'm still unclear on how to properly use prompt caching in llama.cpp.
We are running llama.cpp directly on the device, so we're not using llama-server. Our use case follows the common chat-style template, like:
json
Copy
Edit
"messages": [
{
"role": "system",
"content": "system prompt"
},
{
"role": "user",
"content": "xxxx"
}
]
I've looked into llama-cli, and I see that it allows storing the system prompt tokens into a custom.bin file. However, that file seems fixed/static, so it's not clear how to dynamically insert a user-specific prompt into that cached context.
What I'm trying to do:
We have a fixed prompt template (e.g., system prompt), and we want to:
Cache the tokens from the fixed part (system prompt),
At runtime, append the user's input,
Then perform inference, ideally streaming the output.
The goal is to reuse the encoded system prompt tokens, to reduce total inference time—especially since the system prompt is reused for every interaction.
So my main question is:
How can I use prompt caching in llama.cpp to speed up inference for a chat-style prompt that combines a fixed system prompt with user input dynamically?
I’m not sure how to structure the token input, or whether this is supported in the current design of the prompt cache. Any guidance or examples would be appreciated.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I've read through Discussion #8947, but I'm still unclear on how to properly use prompt caching in llama.cpp.
We are running llama.cpp directly on the device, so we're not using llama-server. Our use case follows the common chat-style template, like:
json
Copy
Edit
"messages": [
{
"role": "system",
"content": "system prompt"
},
{
"role": "user",
"content": "xxxx"
}
]
I've looked into llama-cli, and I see that it allows storing the system prompt tokens into a custom.bin file. However, that file seems fixed/static, so it's not clear how to dynamically insert a user-specific prompt into that cached context.
What I'm trying to do:
We have a fixed prompt template (e.g., system prompt), and we want to:
Cache the tokens from the fixed part (system prompt),
At runtime, append the user's input,
Then perform inference, ideally streaming the output.
The goal is to reuse the encoded system prompt tokens, to reduce total inference time—especially since the system prompt is reused for every interaction.
So my main question is:
How can I use prompt caching in llama.cpp to speed up inference for a chat-style prompt that combines a fixed system prompt with user input dynamically?
I’m not sure how to structure the token input, or whether this is supported in the current design of the prompt cache. Any guidance or examples would be appreciated.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions