Does llama.cpp place one expert entirely on a single device? #11784
Unanswered
Imagium719
asked this question in
Q&A
Replies: 1 comment
-
The default behavior is to split the model by whole layers, so it doesn't slice the experts. There's no tensor parallelism in this case. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi there! I have a question regarding the behavior of llama.cpp during multi-GPU inference for MoE models. Specifically, does llama.cpp aim to place each expert entirely on a single device as much as possible, or does it slice experts and distribute them across multiple devices (using tensor parallelism)? This is quite important because the former approach has minimal requirements for GPU interconnect bandwidth, while the latter would require significantly higher bandwidth. Thanks in advance for your insights!
Beta Was this translation helpful? Give feedback.
All reactions