[Roadmap] CPU Performance Optimization for SGLang and Flashinfer 24'Q4 #1

mingfeima · 2024-12-16T06:07:53Z

🚀 The feature, motivation and pitch

The target of this project is to optimize the performance of SGLang on Intel Xeon Scalable Processors, feature targets SPR(4th gen), EMR(5th gen), GNR(6th gen) with Intel® Advanced Matrix Extensions support.

optimize flashinfer backend on CPU device (SGLang currently supports triton and flashinfer backends).
target at providing good efficiency, aka perf per dollar.
upstream optimization to main branch with release.
focus on avx512 and amx, provide fallbacks for other ISA.

For the current stage, focus customer request first and then gradually increase model coverage:

DeepSeek - MHA
DeepSeekV2 - MLA

1. Flashinfer Kernel Optimizations on CPU

layernorm

rmsnorm
fused_add_rmsnorm
gemma_rmsnorm
gemma_fused_add_rmsnorm

activations

silu_and_mul
gelu_and_mul
gelu_tanh_and_mul

sampling

min_p_sampling_from_probs
top_k_renorm_prob
top_k_top_p_sampling_from_probs
top_p_renorm_prob

attention

BatchDecodeWithPagedKVCacheWrapper
BatchPrefillWithPagedKVCacheWrapper
BatchPrefillWithRaggedKVCacheWrapper

[NOTES]: DeepSeekV2 will choose to use triton backend with MLA structure, it won't go to flashinfer, defined in python/sglang/srt/model_executor/model_runner.py

2. SGlang CPU device enabling

native pytorch backend for CPU: add device "cpu" and use "torch_native" for attention_backend; "pytorch" for sampling_backend. The job is for finding any other dependencies aside from flashinfer; also add some basic utils. The file is python/sglang/srt/server_args.py #3
run model DeepSeek with benchmarks: defined as follows.
run model DeepSeekV2: CPU doesn't support FusedMoE. The job is for: a) profile hotspots; b) get shape for each of the kernels.

benchmarking

fixed input benchmarking - use 1K prompt and 128 output tokens.
shared gpt benchmark - default, serving mode
shared gpt benchmark - default, offline mode (TBD, this should be focus later on)

3. DeepSeekV2 Optimization

The v0.3 sgl releases several optimizations for this model.

⭐ MLA decoding kernel optimization with weight absorb. Written in triton, performance increase comes from better blocking, currently hack it in flashinfer for CPU and then merge with existing APIs if possible. The optimized kernel handles MHA and MQA/GQA/MLA with different parallel and tiling strategy to reduce memory access for KV cache.
⭐ fp8 kv_cache - enable bmm_fp8 with flashinfer. The feature is optional but kind of a must on CPU. WO this, it will go to torch.bmm which is not that performant. The current scheme GPU uses is actually fp8 dynamic quant, we may also fuse the quant_A kernel into bmm_fp8. (Feels like CPU cann't do this due to lack of E4M3 fast conversion impl, check later 👀 ).

Additionally optimizations we need:

MHA prefilling kernel optimization: (map triton kernel impl? check later 👀 )
⭐ FusedMoE kernel enabling on CPU.
[Nice to have]: introduce brgemm micro kernel or hard code amx for prefilling kernels and MLA decoding kernel
[Nice to have]: block plannning, current scheme with kv split from flash-decoding has load imbalance issue when serving with multiple different kv lengths (need input from profiler, check later 👀 ).

4. ⭐ First Token latency reduction

To make Xeon actually useful, try to reduce first token latency as much as possible for long prompt length:

⭐⭐ gemm efficiency: weight prepacking (torch.compile?, dynamic packing?)
⭐⭐ tensor parallel: run large gemm with tensor parallel (TP) for SNC=3

5. Upstreaming

and context for cpu to manage dynamic memory allocation and cache immediate results if necessary (in flashinfer).
remove dependency from at::vec::Vectorized<> wrapper. Decision needs to be made after screening types of operations.
add fallbacks for other ISA aside from avx512 and amx, to let other vender run.
introduce onednn brgemm micro kernels in attention calculation.

6. TODO

extend optimizations of CPU backend from flashinfer to other LLM engines
extend flashinfer cpu from aot to jit mode
wrap up distributed GEMM with OSS acceptable approach
extend quantization support

The text was updated successfully, but these errors were encountered:

mingfeima · 2025-02-21T02:12:52Z

DeepSeek V3 - fp8 block gemm

mingfeima · 2025-03-24T06:10:42Z

gau-nernst · 2025-03-25T08:42:30Z

Hi @mingfeima. At my company, I'm also working on optimizing LLM inference for CPU servers. Can I get involved with your team so that we can join efforts together? Recently I wrote an MLA decode kernel for CPU in vLLM vllm-project/vllm#14744. I was not aware of your existing efforts. It will be interesting to benchmark against your triton version. I'm also new to CPU optimization, hope to learn from everyone.

mingfeima · 2025-03-25T08:59:46Z

Great job! Sure, we are open to more CPU contributors :)

for the MLA decoding part, my optimizations are to a) fold H to change gemv to gemm; b) apply avx512-bf16 and amx-bf16; c) use flash-mla algorithm. I have done a) and b) but c) is still on my TODO list. Right now, for the MLA decoding part you referred, each run in DeepSeek R1 will take ~4ms on our machine.

We also optimized IPEX, this will help improve vllm once IPEX is used as attention backend. Right now the overall performance is roughly TPOT 60ms for 1k input and 1k output with DeepSeek R1 671B with int8 dtype on a single node Xeon CPU. The optimization job is still on going!

gau-nernst · 2025-03-25T09:07:38Z

Is there a channel we can discuss things in more details? We can create a channel under SGLang slack if currently there isn't one.

mingfeima · 2025-03-25T09:24:18Z

We do have a slack channel with intel-sglang collab but i am not sure whether it is appropriate to invite you there, since we may share some non-public information there.

Could you please send a email to [email protected] and identify your proposals? We can discuss how to leverage efforts together _

mingfeima changed the title ~~[Feature]~~ [Roadmap] CPU Performance Optimization for SGLang and Flashinfer Dec 16, 2024

mingfeima mentioned this issue Dec 16, 2024

[Roadmap] CPU Performance Optimization for SGLang and Flashinfer mingfeima/flashinfer#1

Open

mingfeima changed the title ~~[Roadmap] CPU Performance Optimization for SGLang and Flashinfer~~ [Roadmap] CPU Performance Optimization for SGLang and Flashinfer 24'Q4 Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] CPU Performance Optimization for SGLang and Flashinfer 24'Q4 #1

[Roadmap] CPU Performance Optimization for SGLang and Flashinfer 24'Q4 #1

mingfeima commented Dec 16, 2024 •

edited

Loading

mingfeima commented Feb 21, 2025 •

edited

Loading

mingfeima commented Mar 24, 2025 •

edited

Loading

gau-nernst commented Mar 25, 2025

mingfeima commented Mar 25, 2025

gau-nernst commented Mar 25, 2025

mingfeima commented Mar 25, 2025

[Roadmap] CPU Performance Optimization for SGLang and Flashinfer 24'Q4 #1

[Roadmap] CPU Performance Optimization for SGLang and Flashinfer 24'Q4 #1

Comments

mingfeima commented Dec 16, 2024 • edited Loading

🚀 The feature, motivation and pitch

1. Flashinfer Kernel Optimizations on CPU

layernorm

activations

sampling

attention

2. SGlang CPU device enabling

benchmarking

3. DeepSeekV2 Optimization

4. ⭐ First Token latency reduction

5. Upstreaming

6. TODO

mingfeima commented Feb 21, 2025 • edited Loading

mingfeima commented Mar 24, 2025 • edited Loading

Servning mode tuning

TODO: (in priority order)

gau-nernst commented Mar 25, 2025

mingfeima commented Mar 25, 2025

gau-nernst commented Mar 25, 2025

mingfeima commented Mar 25, 2025

mingfeima commented Dec 16, 2024 •

edited

Loading

mingfeima commented Feb 21, 2025 •

edited

Loading

mingfeima commented Mar 24, 2025 •

edited

Loading