How does Topk selects a single expert in llama 4? #14161

josemonsalve2 · 2025-06-13T03:41:34Z

josemonsalve2
Jun 13, 2025

I've been studying the llama 4 implementation, and there is something that is bugging me. However, since I am no expert on models, I am posting this as a question. If there is something to improve in the llama.cpp implementation, I will move it to an issue.

The question is regarding the selection of experts on the MoE part of Llama 4. In particular, the selection node (topk). My understanding is that this node is supposed to select only the top-1 expert. However, I am following the tensors coming out of the switch (when the model is built for > 1 token), and it seems to be passing the token to all the experts.

The Switch

This is the current block diagram for the switch (obtained via dot printing):

In red is the tensor in question (ffn_moe_topk). I am printing the values that are coming out of this tensor, and I was expecting this to be a sparse tensor with most values being 0.

This is because I am assuming that the mat_mul_id (where id is the red arrow) would select only a single expert.

The tensor values

However, here are the values I obtain from the tensor:

The order, which is determined by the argsort node in the switch block, determines the best expert. Hence, the first element of each of the vectors represents the top1. However, I was expecting that after the ffn_moe_topk node, the rest of the tensor would be 0. In the figure, we're printing the src2 from ffn_moe_up, and the data from ffn_moe_topk (being the same, and hence the duplicated lines).

So basically my question is how does this result into a top-1 selection.

How am I running?

Just for the sake of completeness, I am running in the CPU via:

llama-cli -hf unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF:Q8_0

with default parameters.

Single token is not an issue

Additionally, whenever we have one token, the selection is clear because the resulting tensors are specific to a single expert. (as seen below)

How am I printing

Finally, I am printing this by modifying the ggml_compute_forward function:

            case GGML_OP_VIEW:
            {
                // Check if name starts with ffn_moe_topk. Print name and value of the tensor
                if (params->ith == 0) {
                    if (strncmp(tensor->name, "ffn_moe_topk", 12) == 0) {
                        GGML_PRINT_TENSOR(int32_t, name, tensor);
                    }
                }

                ggml_compute_forward_view(params, tensor);
            }
            break;

Thanks in advance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How does Topk selects a single expert in llama 4? #14161

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How does Topk selects a single expert in llama 4? #14161

Uh oh!

Uh oh!

josemonsalve2 Jun 13, 2025

The Switch

The tensor values

How am I running?

Single token is not an issue

How am I printing

Replies: 0 comments

josemonsalve2
Jun 13, 2025