-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Roadmap] CPU Performance Optimization for SGLang and Flashinfer 24'Q4 #1
Comments
|
Servning mode tuning
TODO: (in priority order)
|
Hi @mingfeima. At my company, I'm also working on optimizing LLM inference for CPU servers. Can I get involved with your team so that we can join efforts together? Recently I wrote an MLA decode kernel for CPU in vLLM vllm-project/vllm#14744. I was not aware of your existing efforts. It will be interesting to benchmark against your triton version. I'm also new to CPU optimization, hope to learn from everyone. |
Great job! Sure, we are open to more CPU contributors :) for the MLA decoding part, my optimizations are to a) fold H to change gemv to gemm; b) apply avx512-bf16 and amx-bf16; c) use flash-mla algorithm. I have done a) and b) but c) is still on my TODO list. Right now, for the MLA decoding part you referred, each run in DeepSeek R1 will take ~4ms on our machine. We also optimized IPEX, this will help improve vllm once IPEX is used as attention backend. Right now the overall performance is roughly TPOT 60ms for 1k input and 1k output with DeepSeek R1 671B with int8 dtype on a single node Xeon CPU. The optimization job is still on going! |
Is there a channel we can discuss things in more details? We can create a channel under SGLang slack if currently there isn't one. |
We do have a slack channel with intel-sglang collab but i am not sure whether it is appropriate to invite you there, since we may share some non-public information there. Could you please send a email to [email protected] and identify your proposals? We can discuss how to leverage efforts together |
🚀 The feature, motivation and pitch
The target of this project is to optimize the performance of SGLang on Intel Xeon Scalable Processors, feature targets SPR(4th gen), EMR(5th gen), GNR(6th gen) with Intel® Advanced Matrix Extensions support.
flashinfer
backend on CPU device (SGLang currently supportstriton
andflashinfer
backends).avx512
andamx
, provide fallbacks for other ISA.For the current stage, focus customer request first and then gradually increase model coverage:
1. Flashinfer Kernel Optimizations on CPU
layernorm
activations
sampling
attention
[NOTES]: DeepSeekV2 will choose to use triton backend with MLA structure, it won't go to flashinfer, defined in
python/sglang/srt/model_executor/model_runner.py
2. SGlang CPU device enabling
python/sglang/srt/server_args.py
#3FusedMoE
. The job is for: a) profile hotspots; b) get shape for each of the kernels.benchmarking
3. DeepSeekV2 Optimization
The v0.3 sgl releases several optimizations for this model.
bmm_fp8
with flashinfer. The feature is optional but kind of a must on CPU. WO this, it will go totorch.bmm which is not that performant
. The current scheme GPU uses is actuallyfp8 dynamic quant
, we may also fuse thequant_A
kernel intobmm_fp8
. (Feels like CPU cann't do this due to lack of E4M3 fast conversion impl, check later 👀 ).Additionally optimizations we need:
brgemm
micro kernel or hard code amx for prefilling kernels and MLA decoding kernel4. ⭐ First Token latency reduction
To make Xeon actually useful, try to reduce first token latency as much as possible for long prompt length:
5. Upstreaming
at::vec::Vectorized<>
wrapper. Decision needs to be made after screening types of operations.avx512
andamx
, to let other vender run.6. TODO
The text was updated successfully, but these errors were encountered: