change kv cache layout #5

yaochengji · 2025-02-11T17:32:40Z

FILL IN THE PR DESCRIPTION HERE

FIX #xxxx (link existing issues this PR will resolve)

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html

vanbasten23 · 2025-02-12T06:12:47Z

vllm/worker/tpu_worker.py

@@ -130,7 +130,7 @@ def determine_num_available_blocks(self) -> Tuple[int, int]:
        # Calculate the TPU KV cache size based on profiling.
        usable_memory_size = int(total_memory_size *
                                 self.cache_config.gpu_memory_utilization)
-        tpu_kv_cache_bytes = max(usable_memory_size - profiled, 0)
+        tpu_kv_cache_bytes = max(usable_memory_size - profiled, 0) // 4


why do "// 4"?

NVM, I will remove it. This is only used to understand the cost of the torch.permute on the pre-allocated kv cache buffer. Anyway after the new pallas attn kernel is finished, the torch.permute will disappear.

vanbasten23 · 2025-02-12T06:14:51Z

vllm/worker/tpu_model_runner.py

@@ -172,14 +173,14 @@ def _dummy_run(
    ) -> None:
        exec_mode = ExecutionMode(exec_mode)
        if exec_mode.is_prefill():
-            seq_len = (seq_len + 15) // 16 * 16
+            seq_len = (seq_len + 15) // MIN_PREFILL_SEQ_LEN * MIN_PREFILL_SEQ_LEN


replace 15 with "MIN_PREFILL_SEQ_LEN-1"?

good catch!

vanbasten23 · 2025-02-12T06:23:43Z

vllm/attention/backends/pallas.py

+        key_cache = key_cache.flatten(0, 1)
+        value_cache = value_cache.flatten(0, 1)
+        slot_mapping = slot_mapping.flatten()
+        key_cache.index_copy_(0, slot_mapping, key)


iiuc, you are copying block by block (of size hn x hs) hence achieved the speedup?

It's now block_size x hn x hs per copy.

vanbasten23 · 2025-02-12T06:24:34Z

vllm/attention/backends/pallas.py

+        value = value.flatten(0, 1)
+        key_cache = key_cache.flatten(0, 1)
+        value_cache = value_cache.flatten(0, 1)
+        slot_mapping = slot_mapping.flatten()


I wonder why you need to flatten the slot_mapping here

It's a 2D tensor before that

miladm · 2025-02-12T22:20:30Z

vllm/attention/backends/pallas.py

@@ -11,6 +11,8 @@
                                              AttentionMetadata, AttentionType)
 from vllm.attention.backends.utils import CommonAttentionState

+MIN_PREFILL_SEQ_LEN = 16


Can we clarify what specific TPU generations work optimally with this seq-len? Does it make sense to add this into a config file?

Wondering if v5p or older generations would require a different threshold. Same comment regarding future TPU generations. :)

Basically the larger the better. It has two limits:

the new size cannot be smaller than the actual size

the new size cannot be larger than the page size

I'm actually thinking about directly set MIN_PREFILL_SEQ_LEN to page size for simplicity, then we can control the size by the page size

github-actions · 2025-05-15T02:56:19Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

Signed-off-by: DarkLight1337 <[email protected]>

…18693) Signed-off-by: DarkLight1337 <[email protected]>

…18703) Signed-off-by: DarkLight1337 <[email protected]>

Signed-off-by: Andy Xie <[email protected]>

Signed-off-by: Nave Assaf <[email protected]>

…s in CI (vllm-project#18705) Signed-off-by: Isotr0py <[email protected]>

…to HPU CI (vllm-project#18709) Signed-off-by: Lukasz Durejko <[email protected]>

Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

…llm-project#18701)

Signed-off-by: DarkLight1337 <[email protected]>

…8666) Signed-off-by: DarkLight1337 <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]>

…logits instead of probs (vllm-project#18608)

Signed-off-by: Harry Mellor <[email protected]>

Signed-off-by: Lukas Geiger <[email protected]>

Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]>

…oject#18646) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]>

…18732) Signed-off-by: Lukas Geiger <[email protected]>

Signed-off-by: vllmellm <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]>

…orrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]>

Signed-off-by: Patrick von Platen <[email protected]>

…roject#19205)

Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]>

…e time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]>

…9217)

Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]>

Signed-off-by: Jerry Zhang <[email protected]>

…roject#19033) Signed-off-by: Benjamin Chislett <[email protected]>

Signed-off-by: Chen Zhang <[email protected]>

Signed-off-by: Chengji Yao <[email protected]>

Signed-off-by: Xu Song <[email protected]>

Signed-off-by: Aaron Pham <[email protected]>

Co-authored-by: jinghui <[email protected]>

Signed-off-by: Chengji Yao <[email protected]>

…model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]>

…9172) Signed-off-by: Nick Hill <[email protected]>

Signed-off-by: rzou <[email protected]>

Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]>

Signed-off-by: Jee Jee Li <[email protected]>

…m-project#19227) Signed-off-by: Jon Swenson <[email protected]>

Signed-off-by: Chengji Yao <[email protected]>

yaochengji force-pushed the chengji/kv-cache branch from a3b2270 to 15d2772 Compare February 11, 2025 23:41

vanbasten23 reviewed Feb 12, 2025

View reviewed changes

yaochengji force-pushed the chengji/kv-cache branch from ec1fb6d to 8754c7e Compare February 12, 2025 19:21

miladm reviewed Feb 12, 2025

View reviewed changes

github-actions bot added the stale label May 15, 2025

DarkLight1337 and others added 21 commits May 26, 2025 00:45

[Doc] Fix issue template format (vllm-project#18699)

65523a0

Signed-off-by: DarkLight1337 <[email protected]>

[Bugfix] Fix Mistral-format models with sliding window (vllm-project#…

61a45e7

…18693) Signed-off-by: DarkLight1337 <[email protected]>

[CI/Build] Replace math.isclose with pytest.approx (vllm-project#…

38b13df

…18703) Signed-off-by: DarkLight1337 <[email protected]>

[CI] fix dump_input for str type (vllm-project#18697)

5a2c76c

Signed-off-by: Andy Xie <[email protected]>

[Model] Add support for YARN in NemotronNAS models (vllm-project#18427)

6d68030

Signed-off-by: Nave Assaf <[email protected]>

[CI/Build] Split pooling and generation extended language models test…

0877750

…s in CI (vllm-project#18705) Signed-off-by: Isotr0py <[email protected]>

[Hardware][Intel-Gaudi] [CI/Build] Add tensor parallel size = 2 test …

e76be06

…to HPU CI (vllm-project#18709) Signed-off-by: Lukasz Durejko <[email protected]>

[Misc] add AutoGen integration (vllm-project#18712)

0665e29

Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

[Bugfix]: handle hf-xet CAS error when loading Qwen3 weights in vLLM (v…

243eb91

…llm-project#18701)

[Doc] Improve API docs (vllm-project#18713)

9553fdb

Signed-off-by: DarkLight1337 <[email protected]>

[Doc] Move examples and further reorganize user guide (vllm-project#1…

82e2339

…8666) Signed-off-by: DarkLight1337 <[email protected]>

[Bugfix] Fix Llama GGUF initialization (vllm-project#18717)

a869bac

Signed-off-by: DarkLight1337 <[email protected]>

[V1][Sampler] Improve performance of FlashInfer sampling by sampling …

e7523c2

…logits instead of probs (vllm-project#18608)

Convert examples to ruff-format (vllm-project#18400)

27bebcd

Signed-off-by: Harry Mellor <[email protected]>

[Model][Gemma3] Simplify image input validation (vllm-project#18710)

0eebd74

Signed-off-by: Lukas Geiger <[email protected]>

[Misc] improve web section group title display (vllm-project#18684)

1f88dbd

Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]>

[V1][Quantization] Add CUDA graph compatible v1 GGUF support (vllm-pr…

1f1b1bc

…oject#18646) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]>

[Model][Gemma3] Cast image pixel values already on CPU (vllm-project#…

b50602d

…18732) Signed-off-by: Lukas Geiger <[email protected]>

[FEAT] [ROCm] Upgrade AITER Fused MoE kernels. (vllm-project#18271)

d260f79

Signed-off-by: vllmellm <[email protected]>

[Doc] Update OOT model docs (vllm-project#18742)

25a817f

Signed-off-by: DarkLight1337 <[email protected]>

[Doc] Update reproducibility doc and example (vllm-project#18741)

753944f

Signed-off-by: DarkLight1337 <[email protected]>

gcalmettes and others added 6 commits June 5, 2025 12:59

[Bugfix] properly catch PIL-related errors for vision models when inc…

9bc8bb0

…orrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]>

[mistral_common] Add v11 tokenizer (vllm-project#19193)

f20f9f0

Signed-off-by: Patrick von Platen <[email protected]>

Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-p…

ec89524

…roject#19205)

[Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110)

61059be

Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]>

[MISC][Bugfix] Use less CPU when message queue has been empty for som…

85e2b7b

…e time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]>

[P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090)

9ef9173

yaochengji force-pushed the chengji/kv-cache branch from e21696a to eb584a1 Compare June 5, 2025 17:36

dsikka and others added 18 commits June 5, 2025 18:21

[Quantization] Skip Fp4 Test for compressed-tensors (vllm-project#1…

aa49f14

…9217)

[V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118)

8736030

[Model] NemotronH support (vllm-project#18863)

cb6d572

Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]>

Fix AOPerModuleConfig name changes (vllm-project#18869)

c8134be

Signed-off-by: Jerry Zhang <[email protected]>

[Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-p…

3465b87

…roject#19033) Signed-off-by: Benjamin Chislett <[email protected]>

[v1] Hybrid Memory Allocator (vllm-project#17996)

f8a1a2d

Signed-off-by: Chen Zhang <[email protected]>

[TPU] update torch_xla pin (vllm-project#19231)

b61dc5f

Signed-off-by: Chengji Yao <[email protected]>

Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143)

3da2313

Signed-off-by: Xu Song <[email protected]>

[Chore] update CODEOWNERS (vllm-project#19247)

91a2ef9

Signed-off-by: Aaron Pham <[email protected]>

[v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182)

90b78ec

Co-authored-by: jinghui <[email protected]>

[TPU] fix kv cache dtype in model runner (vllm-project#19244)

0d49483

Signed-off-by: Chengji Yao <[email protected]>

[Quantization] Bump compressed-tensors version; update NVFP4A16 test …

9487035

…model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]>

[Docs] Improve V1 KVConnector interface documentation (vllm-project#1…

65c6944

…9172) Signed-off-by: Nick Hill <[email protected]>

Fix CompilationConfig repr (vllm-project#19091)

da511d5

Signed-off-by: rzou <[email protected]>

Unit Test for run_dp_sharded_vision_model (vllm-project#19103)

f168b85

Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]>

[Model] Optimize nemotron_h implementation (vllm-project#19249)

7661e92

Signed-off-by: Jee Jee Li <[email protected]>

[Core] Raise when non-multi-instance DP clients target a DP rank (vll…

7353492

…m-project#19227) Signed-off-by: Jon Swenson <[email protected]>

improve logits bias (vllm-project#19041)

8267f99

yaochengji force-pushed the chengji/kv-cache branch 2 times, most recently from d0e50eb to 1f7c757 Compare June 6, 2025 17:48

[TPU] support fp8 kv cache quantization

fff63b2

Signed-off-by: Chengji Yao <[email protected]>

yaochengji force-pushed the chengji/kv-cache branch from 1f7c757 to fff63b2 Compare June 6, 2025 19:03

fix comment

d6275e7

Signed-off-by: Chengji Yao <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

change kv cache layout #5

change kv cache layout #5

Uh oh!

yaochengji commented Feb 11, 2025

Uh oh!

vanbasten23 Feb 12, 2025

Uh oh!

yaochengji Feb 12, 2025

Uh oh!

vanbasten23 Feb 12, 2025

Uh oh!

yaochengji Feb 12, 2025

Uh oh!

vanbasten23 Feb 12, 2025

Uh oh!

yaochengji Feb 12, 2025

Uh oh!

vanbasten23 Feb 12, 2025

Uh oh!

yaochengji Feb 12, 2025

Uh oh!

miladm Feb 12, 2025

Uh oh!

yaochengji Feb 13, 2025

Uh oh!

github-actions bot commented May 15, 2025

Uh oh!

Uh oh!

change kv cache layout #5

Are you sure you want to change the base?

change kv cache layout #5

Uh oh!

Conversation

yaochengji commented Feb 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 15, 2025

Uh oh!

Uh oh!