Loads of interesting ideas in the 'ktransformers' report for mixed CPU/GPU inference #8721

jukofyork · 2024-07-27T11:46:27Z

jukofyork
Jul 27, 2024

https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/deepseek-v2-injection.md

Arithmetic Intensity Guided Offloading

Storing all 236 billion parameters of a model in GPU VRAM is clearly impractical for local users. Therefore, we strategically store only the most computationally intensive parameters on the GPU. For instance, after our optimizations, the MLA operator, which contains 128 heads with a shared compressed key-value representation, shows an arithmetic intensity of 512. This makes it the most intensive operator, particularly during smaller inference batch sizes. Hence, it is allocated to the GPU to leverage the power of tensor cores.

On the other hand, as shown in Figure 1, each transformer block in DeepSeek-V2 includes 160 mixture-of-experts (MoE) experts, comprising 96% of the total parameters. However, the MoE router activates only 6 out of these 160 experts for each token, which means that only 3.75% of the MoE parameters are utilized during the decoding phase. With a batch size of one, the arithmetic intensity of the MoE operation is roughly 0.075. This operation, primarily involving a batched General Matrix-Vector Multiplication (GEMV), can thus be efficiently handled by the CPU.

Looks particularly interesting - could even go one step further and run something similar to llama-imatrix to find the frequency of expert selections and then rank from "most selected" to "least selected" and allow a variable offload like -ngpu.

james0zan · 2024-07-27T14:00:36Z

james0zan
Jul 27, 2024

According to our evaluation with KTransformers, the distribution of experts in Mixtral and Qwen2-57B-A14 is very imbalanced; thus, it would be beneficial to store only the most frequently used experts on the GPU. In contrast, this strategy does not lead to significant benefits with DeepSeek-V2 because it has a huge number of experts and is trained with a more balanced recipe.

We plan to implement this strategy in KTransformers to measure the appropriate parameters, which can be used in future implementations in llama.cpp. We are not very familiar with the specific llama.cpp code, thus we are unable to upstream such modifications ourselves. However, we are willing to participate in constructing a PR if there is such an objection in the llama.cpp community.

2 replies

JohannesGaessler Jul 30, 2024
Collaborator

My expectation is that the synchronization overhead from potentially having to sandwich CPU operations between GPU operations will outweigh the speedup from more efficient use of VRAM. But I would be happy to be proven wrong.

james0zan Jul 30, 2024

DeepSeek V2's MoE is very large and sparse, thus a layer-wise offload only makes marginal improvement because only a few layers can be loaded into the VRAM.

In contrast, offloading these experts to CPU/DRAM ensures that all the QKVO projections and the shared experts are kept in VRAM. These matrices are scanned for every request and they cost only 21GB of VRAM in 4-bit quantization.

Moreover, we utilize the cudaLaunchHostFunc function
https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/ktransformers_ext/ext_bindings.cpp#L241
It ensures that, even though we have to sandwich CPU operations between GPU operations, all the GPU operators are captured into a single CUDA Graph. This optimization largely reduces the launch cost by invoking only one CUDA Graph replay for all the layers instead of one CUDA graph per layer.

ELigoP · 2024-08-20T16:36:41Z

ELigoP
Aug 20, 2024

Out of the box llama.cpp with bartowski/DeepSeek-V2-Chat-0628-GGUF gives me about 7.5 tokens per second while utilizing about 95% of my 2x RTX 3090 48GB VRAM.
ktransformers utilize about 45% of that VRAM and give me 9 tokens per second.
So I think there is big potential for this approach for GPU-poor.

0 replies

jukofyork · 2025-01-22T16:39:30Z

jukofyork
Jan 22, 2025
Author

Bump as this seems more relevant than ever (and sadly the KTransformers project seems to have died).

Also linking this thread:

#11333

as NUMA expert distribution is probably worth considering at the same time.

3 replies

ELigoP Jan 30, 2025

It hasn't died. Maintainer responded 19 hours ago and said there is work going on.

jukofyork Jan 30, 2025
Author

It hasn't died. Maintainer responded 19 hours ago and said there is work going on.

Oh, that's good news!

ELigoP Feb 20, 2025

Bump as this seems more relevant than ever (and sadly the KTransformers project seems to have died).

Also linking this thread:

#11333

as NUMA expert distribution is probably worth considering at the same time.

Since then, after implementing smart GPU offloading of frequently-used layers, KTransformers stars grew from 1k -> 10k https://star-history.com/#kvcache-ai/KTransformers&Date

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loads of interesting ideas in the 'ktransformers' report for mixed CPU/GPU inference #8721

{{title}}

Arithmetic Intensity Guided Offloading

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Loads of interesting ideas in the 'ktransformers' report for mixed CPU/GPU inference #8721

jukofyork Jul 27, 2024

Arithmetic Intensity Guided Offloading

Replies: 3 comments · 5 replies

james0zan Jul 27, 2024

JohannesGaessler Jul 30, 2024 Collaborator

james0zan Jul 30, 2024

ELigoP Aug 20, 2024

jukofyork Jan 22, 2025 Author

ELigoP Jan 30, 2025

jukofyork Jan 30, 2025 Author

ELigoP Feb 20, 2025

jukofyork
Jul 27, 2024

Replies: 3 comments 5 replies

james0zan
Jul 27, 2024

JohannesGaessler Jul 30, 2024
Collaborator

ELigoP
Aug 20, 2024

jukofyork
Jan 22, 2025
Author

jukofyork Jan 30, 2025
Author