[Kernel][CPU] CPU MLA #14744

gau-nernst · 2025-03-13T09:23:10Z

In this PR, I add preliminary support for MLA on CPU. I'm opening this PR to get feedback and comments from the maintainers on the high level design. The MLA kernel itself is not optimized and I plan to optimize it further (either in this PR or leave it in a future PR).

The main changes can be summarized as follows

Add concat_and_cache_mla CPU kernel
Add mla_decode_kvcache_cpu kernel. This currently does not follow any existing API. See the code for more details. Only supports decoding 1 query token.
Add CPUMLABackend: This largely follows TorchSDPABackend for metdata stuff, and re-use MLACommonBackend for MLA-related logic. IPEX's varlen_attention is used for prefilling, and the new custom kernel is used for decoding.
Fix various import-related logic so that MLACommon import works on CPU build, and other minor fixes.

Other areas that I'm also looking into (but not yet implemented):

Chunked prefill: this requires merge_attn_states, which can be implemented as a CPU kernel, or perhaps torch.compile() can codegen it?

I have tested this code with deepseek-ai/DeepSeek-V2-Lite-Chat and the outputs look coherent, even though the outputs are different from w/o using MLA (VLLM_MLA_DISABLE=1)

@bigPYJ1151 @Isotr0py Do hope to hear your feedback 🙏

Update 1: I have added some optimizations for the kernel. I'm no expert in optimizing CPU code, so any feedback and advice is welcome. Outlines of my approach:

Multi-threading is only parallelized across context dimension. The main reason for this is that I believe decode attention is only slow (relative to MLP and linear projections) when context length is long. When context length is short, most of the runtime is spent on MLP and linear projections, hence it's not so important to parallelize across batch and query heads -> simplify the implementation a bit.
Convert KV cache from BF16/FP16 to FP32 once a head of time, since BF16/FP16->FP32 is slow on CPU. This can be reused across query heads and between K and V (since V overlap with K in MLA). Some special care is added for AVX512 with BF16 dot product. Thanks to chunking along the context dimension, I can convert KV cache to FP32 per block, hence not consuming too much memory (and hopefully the FP32 KV cache block stays in CPU cache).

Benchmark script

# modified from https://github.com/vllm-project/vllm/blob/main/tests/kernels/test_flashmla.py
import random
import os
import argparse

tcmalloc_path = "/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4"
os.environ["LD_PRELOAD"] = f"{tcmalloc_path}:{os.environ.get('LD_PRELOAD', '')}"

import torch
from torch.utils.benchmark import Timer
from dataclasses import dataclass
import torch.nn.functional as F
from torch import Tensor

import vllm._custom_ops as ops


def cal_diff(x: torch.Tensor, y: torch.Tensor, name: str) -> None:
    x, y = x.double(), y.double()
    cos_diff = 1 - 2 * (x * y).sum().item() / max((x * x + y * y).sum().item(), 1e-12)
    assert cos_diff < 1e-5


def cdiv(a, b):
    return (a + b - 1) // b


def ref_mla(
    out: Tensor,  # (bs, num_heads, v_head_dim)
    query: Tensor,  # (bs, num_heads, head_dim)
    kv_cache: Tensor,  # (num_blocks, block_size, head_dim)
    scale: float,
    block_tables: Tensor,  # (bs, max_num_blocks)
    seq_lens: Tensor,  # (bs,)
):
    bs, num_heads, v_head_dim = out.shape
    head_dim = query.shape[2]

    for i in range(bs):
        # gather and flatten KV-cache
        kv = kv_cache[block_tables[i]]  # (max_num_blocks, block_size, head_dim)
        kv = kv.view(1, -1, head_dim)[:, : seq_lens[i]]  # (1, seq_len, head_dim)
        v = kv[:, :, :v_head_dim]

        q = query[i].view(num_heads, 1, head_dim)
        o = F.scaled_dot_product_attention(q, kv, v, scale=scale, enable_gqa=True)
        out[i] = o.view(num_heads, v_head_dim)

    return out


@dataclass
class ProblemShape:
    bs: int = 1
    seq_len: int = 256
    num_heads: int = 16
    head_dim: int = 576
    v_head_dim: int = 512
    block_size: int = 16


def test_cpu_mla(args: ProblemShape, perf: bool = False):
    dtype = torch.bfloat16
    torch.set_default_dtype(dtype)
    torch.manual_seed(0)
    random.seed(0)

    bs = args.bs
    head_dim = args.head_dim
    v_head_dim = args.v_head_dim
    scale = head_dim ** (-0.5)

    print(args)
    seq_lens = torch.full((bs,), args.seq_len, dtype=torch.int32)
    seqlen_pad = cdiv(args.seq_len, 256) * 256

    q = torch.randn(bs, args.num_heads, head_dim)
    block_table = torch.arange(bs * seqlen_pad // args.block_size, dtype=torch.int32)
    block_table = block_table.view(bs, seqlen_pad // args.block_size)

    kv_cache = torch.randn(block_table.numel(), args.block_size, head_dim)
    for i in range(bs):
        kv_cache.view(bs, seqlen_pad, head_dim)[i, args.seq_len :] = float("nan")

    out_mla = q.new_zeros(bs, args.num_heads, v_head_dim)
    ops.mla_decode_kvcache_cpu(out_mla, q, kv_cache, scale, block_table, seq_lens)

    if perf:
        return

    out_ref = q.new_zeros(bs, args.num_heads, v_head_dim)
    ref_mla(out_ref, q, kv_cache, scale, block_table, seq_lens)

    torch.testing.assert_close(out_mla, out_ref)
    cal_diff(out_mla, out_ref, "out")

    num_elems = (
        bs * args.seq_len * head_dim  # kv cache
        + bs * args.num_heads * head_dim  # query
        + bs * args.num_heads * v_head_dim  # output
    )
    num_gb = num_elems * dtype.itemsize / 1e9
    print(
        f"Input size: {num_gb * 1e3 : .2f} MB. Make sure this is larger than 2x L3 cache size for accurate benchmark."
    )

    a = torch.randn(num_elems // 2)
    b = torch.randn(num_elems // 2)
    t = (
        Timer(
            "a.copy_(b)",
            globals={**globals(), **locals()},
            num_threads=torch.get_num_threads(),
        )
        .blocked_autorange(min_run_time=1)
        .median
    )
    print(f"Copy: {num_gb / t:.4f} GB/s")

    t = (
        Timer(
            "ops.mla_decode_kvcache_cpu(out_mla, q, kv_cache, scale, block_table, seq_lens)",
            globals={**globals(), **locals()},
            num_threads=torch.get_num_threads(),
        )
        .blocked_autorange(min_run_time=1)
        .median
    )
    print(f"CPU MLA: {num_gb / t:.4f} GB/s")


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--perf", action="store_true")
    args = parser.parse_args()

    print(f"No. of threads: {torch.get_num_threads()}")
    test_cpu_mla(ProblemShape(bs=1, seq_len=64_000), perf=args.perf)
    if args.perf:
        return
    test_cpu_mla(ProblemShape(bs=1, seq_len=54_321), perf=args.perf)
    test_cpu_mla(ProblemShape(bs=1, seq_len=1243, num_heads=5), perf=args.perf)
    test_cpu_mla(ProblemShape(bs=3, seq_len=1234), perf=args.perf)
    test_cpu_mla(ProblemShape(bs=30, seq_len=2048), perf=args.perf)


if __name__ == "__main__":
    main()

Signed-off-by: Thien Tran <[email protected]>

github-actions · 2025-03-13T09:23:25Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Thien Tran <[email protected]>

mergify · 2025-03-21T03:24:30Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @gau-nernst.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Thien Tran <[email protected]>

gau-nernst · 2025-03-21T04:47:49Z

.buildkite/run-cpu-test.sh

+    pytest -v -s tests/kernels/test_cache.py -m cpu_model
+    pytest -v -s tests/kernels/test_mla_decode_cpu.py -m cpu_model


pytest -v -s tests/kernels -m cpu_model doesn't work due to Triton imports (there are probably other issues as well). We can have a separate PR to make pytest -v -s tests/kernels -m cpu_model work (as well as improve test coverage for CPU kernels)

gau-nernst · 2025-03-21T04:48:52Z

tests/kernels/test_cache.py

+@pytest.mark.cpu_model
+@pytest.mark.skipif(not current_platform.is_cpu(), reason="CPU only")
+@torch.inference_mode()
+def test_concat_and_cache_mla_cpu(


In the future, we can merge this with the CUDA test of the same op (i.e. select correct device at runtime)

LucasWilkinson

Apologies for the delay! Overall I think this is quite close to mergable, just left a few comments

LucasWilkinson · 2025-03-21T16:36:50Z

vllm/attention/backends/cpu_mla.py

+
+            # for chunked-prefill
+            if self.chunked_prefill:
+                prefill_block_tables = make_tensor_with_pad(


should we assert here since chunked_prefill is not supported?

I have an assert in __init__(). That should be sufficient?

LucasWilkinson · 2025-03-21T16:40:17Z

vllm/attention/backends/cpu_mla.py

+            torch.cumsum(kv_lens_tensor,
+                         dim=0,
+                         dtype=torch.int32,
+                         out=kv_start_loc[1:])


I think all of the above this is fine for now but we should see what we need to do to reuse more from the common builder since we may be refactoring that in the future and this may cause issues

csrc/cpu/mla_decode.cpp

tests/kernels/test_cache.py

Signed-off-by: Thien Tran <[email protected]>

gau-nernst · 2025-03-22T01:15:23Z

I have made some changes. Lmk if it addresses ur concerns. Thank you. Hope to get this PR merged soon.

LucasWilkinson

LGTM, thanks for the contribution!

gau-nernst added 18 commits March 10, 2025 12:25

Add concat_and_cache_mla for CPU

ec93219

Signed-off-by: Thien Tran <[email protected]>

Add OpenMP for concat and cache

a614744

Signed-off-by: Thien Tran <[email protected]>

naive MLA CPU

ecccd26

Signed-off-by: Thien Tran <[email protected]>

simplify naive implementation

f20a8a1

Signed-off-by: Thien Tran <[email protected]>

hook up MLA (still not working e2e)

b1d2b6d

Signed-off-by: Thien Tran <[email protected]>

Merge branch 'main' into cpu_mla

dabf061

modify metadata from TorchSDPA. hook up IPEX for prefill

611afa6

Signed-off-by: Thien Tran <[email protected]>

Merge branch 'main' into cpu_mla

b0d2cc9

fix num_tokens in CPU rotary kernel

611fea0

Signed-off-by: Thien Tran <[email protected]>

Merge branch 'main' into cpu_mla

0f130fc

reenable MLA flag. add import guards

737fb20

Signed-off-by: Thien Tran <[email protected]>

don't use prepack without support

d8d4d5b

Signed-off-by: Thien Tran <[email protected]>

Merge branch 'main' into cpu_mla

bc5c227

Merge branch 'ipex_moe_prepack' into cpu_mla

ac4adab

Merge branch 'cpu_rotary' into cpu_mla

ff3c569

working example, but very slow

250a7bf

Signed-off-by: Thien Tran <[email protected]>

ipex varlen_attention logits soft cap for XPU only

c63011c

Signed-off-by: Thien Tran <[email protected]>

Merge branch 'main' into cpu_mla

73f4a37

gau-nernst added 8 commits March 14, 2025 09:41

Merge branch 'main' into cpu_mla

1500558

fix issues with mypy

ccec19e

Signed-off-by: Thien Tran <[email protected]>

add another impl, but still slow

14b3448

Signed-off-by: Thien Tran <[email protected]>

Merge branch 'main' into cpu_mla

69f1693

Signed-off-by: Thien Tran <[email protected]>

use vectorized load/store

a11b82f

Signed-off-by: Thien Tran <[email protected]>

cache KV tile and tune AVX2 intrinsics

4cf1f37

Signed-off-by: Thien Tran <[email protected]>

revert modified FP32Vec layout. attempted to fix AVX512

72ac22f

Signed-off-by: Thien Tran <[email protected]>

move mla_decode to a separate file. attempt to fix AVX512

9570f95

Signed-off-by: Thien Tran <[email protected]>

mergify bot added the ci/build label Mar 15, 2025

gau-nernst added 2 commits March 15, 2025 13:28

fix AVX512 (yay)

0826ceb

Signed-off-by: Thien Tran <[email protected]>

Merge branch 'main' into cpu_mla

0725661

gau-nernst added 2 commits March 19, 2025 14:39

Merge branch 'main' into cpu_mla

1972d59

fix issues

5c9fdb0

Signed-off-by: Thien Tran <[email protected]>

gau-nernst requested a review from LucasWilkinson March 19, 2025 11:39

gau-nernst added 2 commits March 19, 2025 21:58

unroll heads

632e9e4

Signed-off-by: Thien Tran <[email protected]>

fix avx512

4daa919

Signed-off-by: Thien Tran <[email protected]>

mergify bot added the needs-rebase label Mar 21, 2025

Merge branch 'main' into cpu_mla

312829c

Signed-off-by: Thien Tran <[email protected]>

mergify bot removed the needs-rebase label Mar 21, 2025

gau-nernst added 3 commits March 21, 2025 11:50

Add test for concat_and_cache_mla

dfac73b

Signed-off-by: Thien Tran <[email protected]>

fix out of bounds read

8147c55

Signed-off-by: Thien Tran <[email protected]>

add test for mla_decode_cpu

351d203

Signed-off-by: Thien Tran <[email protected]>

gau-nernst requested review from tlrmchlsmth and WoosukKwon as code owners March 21, 2025 04:44

add CPU test to CI

8cdb1db

Signed-off-by: Thien Tran <[email protected]>

gau-nernst commented Mar 21, 2025

View reviewed changes

LucasWilkinson reviewed Mar 22, 2025

View reviewed changes

gau-nernst added 2 commits March 22, 2025 08:48

Merge branch 'main' into cpu_mla

e63363f

remove fp8 kvcache for test. make HEAD_UNROLL a constexpr

86e8396

Signed-off-by: Thien Tran <[email protected]>

Merge branch 'main' into cpu_mla

a9bb7a3

gau-nernst requested a review from LucasWilkinson March 25, 2025 01:22

LucasWilkinson approved these changes Mar 25, 2025

View reviewed changes

LucasWilkinson enabled auto-merge (squash) March 25, 2025 01:25

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 25, 2025

Merge branch 'main' into cpu_mla

e8706da

gau-nernst mentioned this pull request Mar 25, 2025

[Roadmap] CPU Performance Optimization for SGLang and Flashinfer 24'Q4 mingfeima/sglang#1

Open

36 tasks

LucasWilkinson merged commit 4f044b1 into vllm-project:main Mar 25, 2025
59 checks passed

gau-nernst deleted the cpu_mla branch March 25, 2025 11:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel][CPU] CPU MLA #14744

[Kernel][CPU] CPU MLA #14744

gau-nernst commented Mar 13, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Mar 13, 2025

mergify bot commented Mar 21, 2025

gau-nernst Mar 21, 2025

gau-nernst Mar 21, 2025

LucasWilkinson left a comment

LucasWilkinson Mar 21, 2025

gau-nernst Mar 22, 2025

LucasWilkinson Mar 21, 2025

gau-nernst commented Mar 22, 2025

LucasWilkinson left a comment

		pytest -v -s tests/kernels/test_cache.py -m cpu_model
		pytest -v -s tests/kernels/test_mla_decode_cpu.py -m cpu_model

[Kernel][CPU] CPU MLA #14744

[Kernel][CPU] CPU MLA #14744

Conversation

gau-nernst commented Mar 13, 2025 • edited by github-actions bot Loading

github-actions bot commented Mar 13, 2025

mergify bot commented Mar 21, 2025

gau-nernst Mar 21, 2025

Choose a reason for hiding this comment

gau-nernst Mar 21, 2025

Choose a reason for hiding this comment

LucasWilkinson left a comment

Choose a reason for hiding this comment

LucasWilkinson Mar 21, 2025

Choose a reason for hiding this comment

gau-nernst Mar 22, 2025

Choose a reason for hiding this comment

LucasWilkinson Mar 21, 2025

Choose a reason for hiding this comment

gau-nernst commented Mar 22, 2025

LucasWilkinson left a comment

Choose a reason for hiding this comment

gau-nernst commented Mar 13, 2025 •

edited by github-actions bot

Loading