-
Notifications
You must be signed in to change notification settings - Fork 12.3k
llama : reuse compute graphs #14482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
llama : reuse compute graphs #14482
Conversation
2f577c5
to
30b4d4e
Compare
f61b0f7
to
d9e1781
Compare
0d9c3d4
to
fc4fdf6
Compare
ggml-ci
fc4fdf6
to
76681e3
Compare
This should be ready for review. Currently, there is some small gain for Metal where It would be interesting to try to reuse the Metal command buffers to speed this up even further on the backend side. Currently, we use |
res &= self_kq_mask->ne[0] == mctx->get_n_kv(); | ||
res &= self_kq_mask->ne[1] == GGML_PAD(params.ubatch.n_tokens, GGML_KQ_MASK_PAD); | ||
|
||
res &= mctx->get_supports_set_rows(); // TODO: tmp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If update()
is implemented for the recurrent cache, I think it could work even without adapting it to ggml_set_rows
, because the head
offset tends to be the same for similar consecutive ubatches in find_slot
.
That might not work as well once multiple recurrent state cells per sequence are implemented (because they won't get re-used as much), but at that point it should be possible to use ggml_set_rows
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it has to work. As long as the check for the head
is correctly added to the update()
of the respective inputs it should be good.
An update on this - it seems that Edit: prototyped this in #14570. Does not seem worth pursuing as the gains are microscopic. |
I tested this on CUDA on various models I had lying around and I see there a perf regression on a larger model, not sure if I'm doing something wrong. I cherry-picked this PR on top of #14551 and compared with commit1 = 14551, and commit2 = 14551 + this PR. Also interesting is 2x(?) speed up on qwen2lvl 3B
To be sure, I ran it again
|
The The But before benchmarking, I think you should make sure that the generated results when graph reuse is enabled are coherent (see the
And also from the results that you posted it seems that there is a lot of variability on your system for this |
@ggerganov I re-ran with r=100, and tg 64 and 128, I see quite a bit of variability at tg128, but tg64 is pretty tight (<1% variability). Also confirming the
|
add_opt(common_arg( | ||
{"--graph-reuse", "-gr"}, | ||
string_format("reuse previous compute graphs when possible (default: %s)" | ||
"[(more info)](https://github.com/ggml-org/llama.cpp/pull/14482)", params.graph_reuse ? "true" : "false"), | ||
[](common_params & params) { | ||
params.graph_reuse = true; | ||
} | ||
).set_env("LLAMA_ARG_GRAPH_REUSE")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any downsides of enabling this option, besides the removal of the preemptive scheduler reset?
@@ -34,6 +34,31 @@ struct llama_ubatch { | |||
llama_seq_id * seq_id_unq; // [n_seqs_unq] | s | seq_id | |||
int32_t * seq_idx; // [LLAMA_MAX_SEQ] | - | seq_idx | |||
int8_t * output; // [n_tokens] | i | - | |||
|
|||
bool is_same(const llama_ubatch & other) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of this function doesn't tell me much. I would expect it to be an operator==
, but it clearly isn't. Why is it necessary for hparams
and cparams
? Shouldn't the graph result be associated to a llama_context
?
|
||
// return true if the resulting input tensors using the provided graph parameters would be | ||
// the same as the previous input tensors that we have currently stored in the object | ||
virtual bool update(const llm_graph_params & params) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need_update
, can_reuse
, ..
target #14285
Reuse computation graphs from the previous ubatch when possible. Works with any batch size and any model.
Note
The functionality currently requires the
LLAMA_SET_ROWS
from #14285This functionality requires the
ggml_set_rows()
operator to be supported (see #14285). In order to be able to reuse a compute graph, its topology (shapes, strides, parameters, etc.) has to be entirely defined by the set of input tensors (e.g.inp_embd
,inp_pos
,inp_attn
, etc.).This PR adds logic to update a previous
llm_graph_result
by verifying that the newllm_graph_params
would result in the same tensor shapes. For this to work, we should no longer preemptively reset the scheduler after processing a batch so that all buffers from the previous graph remain allocated and ready for reuse, in case the newubatch
is compatible. See the newllm_graph_result::update()
method:llama.cpp/src/llama-graph.h
Lines 506 to 525 in fc4fdf6
The other change that is needed is to introduce a way to swap the
llama_memory_context
of all graph inputs, so that the new call tollm_graph_result_i::set_inputs()
uses the correct context from the currentubatch
. This is performed by calling thellm_graph_input_i::update()
method of all input tensors.To enable this feature, define the
LLAMA_SET_ROWS
environment variable and add the new--graph-reuse
CLI argument to the llama.cpp tools.API Changes
bool llama_context_params::graph_reuse
. Default isfalse
Tests
LLAMA_SET_ROWS=1 ./bin/llama-cli -m ../models/llama-3.2-3b-instruct/ggml-model-q8_0.gguf -p "I believe the meaning of life is" -n 32 --top-k 1 -fa -gr LLAMA_SET_ROWS=1 ./bin/llama-parallel -m ../models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -np 8 -ns 128 -s 1 -c 4096 -fa -n 128 -gr
Benchmark on M2 Ultra:
TODO
is_same
methods?Next PRs
llama_graph_result_i
interface - does not seem to have any purposeggml_cgraph * gf
everywhere. Simply move it tollm_graph_context