Does llama.cpp place one expert entirely on a single device? #11784

Imagium719 · 2025-02-10T07:47:00Z

Imagium719
Feb 10, 2025

Hi there! I have a question regarding the behavior of llama.cpp during multi-GPU inference for MoE models. Specifically, does llama.cpp aim to place each expert entirely on a single device as much as possible, or does it slice experts and distribute them across multiple devices (using tensor parallelism)? This is quite important because the former approach has minimal requirements for GPU interconnect bandwidth, while the latter would require significantly higher bandwidth. Thanks in advance for your insights!

fairydreaming · 2025-02-18T12:16:33Z

fairydreaming
Feb 18, 2025
Collaborator

The default behavior is to split the model by whole layers, so it doesn't slice the experts. There's no tensor parallelism in this case.
But there is a --split-mode option that controls how tensors are split. If you provide --split-mode row llama.cpp will split model not by layers, but by tensor rows. While it's not exactly the same as MegatronLM tensor parallelism that other backends use, it parallelizes tensor processing and requires fast inter-GPU bandwidth interconnect, so it's somewhat similar.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Does llama.cpp place one expert entirely on a single device? #11784

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Does llama.cpp place one expert entirely on a single device? #11784

Uh oh!

Imagium719 Feb 10, 2025

Replies: 1 comment

Uh oh!

fairydreaming Feb 18, 2025 Collaborator

Imagium719
Feb 10, 2025

fairydreaming
Feb 18, 2025
Collaborator