Replies: 3 comments 5 replies
-
According to our evaluation with KTransformers, the distribution of experts in Mixtral and Qwen2-57B-A14 is very imbalanced; thus, it would be beneficial to store only the most frequently used experts on the GPU. In contrast, this strategy does not lead to significant benefits with DeepSeek-V2 because it has a huge number of experts and is trained with a more balanced recipe. We plan to implement this strategy in KTransformers to measure the appropriate parameters, which can be used in future implementations in llama.cpp. We are not very familiar with the specific llama.cpp code, thus we are unable to upstream such modifications ourselves. However, we are willing to participate in constructing a PR if there is such an objection in the llama.cpp community. |
Beta Was this translation helpful? Give feedback.
-
Out of the box llama.cpp with |
Beta Was this translation helpful? Give feedback.
-
Bump as this seems more relevant than ever (and sadly the KTransformers project seems to have died). Also linking this thread: as NUMA expert distribution is probably worth considering at the same time. |
Beta Was this translation helpful? Give feedback.
-
https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/deepseek-v2-injection.md
Looks particularly interesting - could even go one step further and run something similar to
llama-imatrix
to find the frequency of expert selections and then rank from "most selected" to "least selected" and allow a variable offload like-ngpu
.Beta Was this translation helpful? Give feedback.
All reactions