How does Topk selects a single expert in llama 4? #14161
Unanswered
josemonsalve2
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been studying the llama 4 implementation, and there is something that is bugging me. However, since I am no expert on models, I am posting this as a question. If there is something to improve in the llama.cpp implementation, I will move it to an issue.
The question is regarding the selection of experts on the MoE part of Llama 4. In particular, the selection node (topk). My understanding is that this node is supposed to select only the top-1 expert. However, I am following the tensors coming out of the switch (when the model is built for > 1 token), and it seems to be passing the token to all the experts.
The Switch
This is the current block diagram for the switch (obtained via dot printing):

In red is the tensor in question (ffn_moe_topk). I am printing the values that are coming out of this tensor, and I was expecting this to be a sparse tensor with most values being 0.
This is because I am assuming that the mat_mul_id (where id is the red arrow) would select only a single expert.
The tensor values
However, here are the values I obtain from the tensor:

The order, which is determined by the argsort node in the switch block, determines the best expert. Hence, the first element of each of the vectors represents the top1. However, I was expecting that after the
ffn_moe_topk
node, the rest of the tensor would be 0. In the figure, we're printing the src2 from ffn_moe_up, and the data from ffn_moe_topk (being the same, and hence the duplicated lines).So basically my question is how does this result into a top-1 selection.
How am I running?
Just for the sake of completeness, I am running in the CPU via:
with default parameters.
Single token is not an issue
Additionally, whenever we have one token, the selection is clear because the resulting tensors are specific to a single expert. (as seen below)
How am I printing
Finally, I am printing this by modifying the
ggml_compute_forward
function:Thanks in advance
Beta Was this translation helpful? Give feedback.
All reactions