Add paged attention support #1355

cyanguwa · 2024-12-04T05:03:04Z

Description

This PR adds paged attention support for FusedAttention, FlashAttention, and UnfusedDotProductAttention.

KV cache is maintained in 'bshd' format, and it supports FP32/FP16/BF16, not FP8 yet
Context parallelism is not supported with KV cache
FusedAttention and UnfusedDotProductAttention support page_size >= 16, and FlashAttention supports page_size >= 256
All backends support both pure generation and mixed generation/context in the batch
FlashAttention supports paged attention through flash_attn_varlen_func, not flash_attn_with_kvcache, due to some numerical issues
UnfusedDotProductAttention supports paged attention by converting paged cache tensors to non-paged, before attention

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Please list the changes introduced in this PR:

Add paged attention support for FusedAttention, FlashAttention, and UnfusedDotProductAttention

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

cyanguwa · 2024-12-04T05:45:41Z

/te-ci pytorch L0

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa · 2025-01-06T12:27:19Z

/te-ci pytorch L0

transformer_engine/pytorch/attention.py

tests/pytorch/fused_attn/test_paged_attn.py

Signed-off-by: Charlene Yang <[email protected]>

sudhakarsingh27 · 2025-01-21T19:08:34Z

transformer_engine/pytorch/kv_cache_manager_non_paged.py

+        v_cache: torch.Tensor
+            The value cache tensor containing previous and the current tokens
+        """
+        k_cache, v_cache = self.cache[layer_number]


Suggested change

k_cache, v_cache = self.cache[layer_number]

assert layer_number in self.cache

k_cache, v_cache = self.cache[layer_number]

sudhakarsingh27 · 2025-01-21T21:39:33Z

transformer_engine/pytorch/attention.py

+    def __init__(
+        self,
+        max_batch_size: int,
+        max_seqlen_kv: int,


corresponding docstring says max_sequence_length, we should change one of those

sudhakarsingh27 · 2025-01-22T22:27:02Z

transformer_engine/pytorch/kv_cache_manager_non_paged.py

+            seq_s = self.sequences[seq] - step_dict[seq]
+            seq_e = self.sequences[seq]
+            if qkv_format == "bshd":
+                new_k_cache[i, seq_s:seq_e, :, :] = k[i, : step_dict[seq], :, :]


k[i, : step_dict[seq], :, :]

k isn't supposed to have any tokens beyond step_dict[seq], right?

same for v

sudhakarsingh27 · 2025-01-22T22:42:26Z

transformer_engine/pytorch/kv_cache_manager_non_paged.py

+            seq_s = self.sequences[seq] - step_dict[seq]
+            seq_e = self.sequences[seq]


These could potentially be moved into a method since this could be reused from outside like when getting the start positions of RoPE embeddings application

cyanguwa and others added 10 commits December 3, 2024 17:01

add paged attention; test_kv_cache_accuray and test_paged_attn pass

44f6ff2

Signed-off-by: Charlene Yang <[email protected]>

remove unnecessary change from last commit

06605e5

Signed-off-by: Charlene Yang <[email protected]>

test_fused_attn pass

0b2eb88

Signed-off-by: Charlene Yang <[email protected]>

Merge branch 'main' into paged_attention

d243b79

Signed-off-by: Charlene Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b0a5da4

for more information, see https://pre-commit.ci

remove unnecessary import in test_numerics

b4efd71

Signed-off-by: Charlene Yang <[email protected]>

add license for test

e637a07

Signed-off-by: Charlene Yang <[email protected]>

fix lint

767c8f5

Signed-off-by: Charlene Yang <[email protected]>

add to L0 test

a3bb14f

Signed-off-by: Charlene Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d65933c

for more information, see https://pre-commit.ci

cyanguwa requested review from sudhakarsingh27 and ptrendx December 4, 2024 16:45

cyanguwa added 3 commits January 6, 2025 04:16

Merge branch 'main' into paged_attention

cd626b8

Signed-off-by: Charlene Yang <[email protected]>

update license for test_paged_attn

7c23b96

Signed-off-by: Charlene Yang <[email protected]>

update kv_cache_manager license

2dbf2e1

Signed-off-by: Charlene Yang <[email protected]>

sudhakarsingh27 reviewed Jan 7, 2025

View reviewed changes

transformer_engine/pytorch/attention.py Show resolved Hide resolved

tests/pytorch/fused_attn/test_paged_attn.py Show resolved Hide resolved

cyanguwa added 2 commits January 6, 2025 17:09

fix build issue from previous merge

d2f1549

Signed-off-by: Charlene Yang <[email protected]>

Merge branch 'main' into paged_attention

81a07e0

sudhakarsingh27 reviewed Jan 28, 2025

View reviewed changes

cyanguwa added 2 commits January 29, 2025 07:47

Merge branch 'main' into paged_attention

76282cf

Merge branch 'NVIDIA:main' into paged_attention

366fa65

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add paged attention support #1355

Add paged attention support #1355

cyanguwa commented Dec 4, 2024 •

edited

Loading

cyanguwa commented Dec 4, 2024

cyanguwa commented Jan 6, 2025

sudhakarsingh27 Jan 21, 2025

sudhakarsingh27 Jan 21, 2025

sudhakarsingh27 Jan 22, 2025

sudhakarsingh27 Jan 22, 2025

sudhakarsingh27 Jan 22, 2025

	k_cache, v_cache = self.cache[layer_number]
	assert layer_number in self.cache
	k_cache, v_cache = self.cache[layer_number]

		seq_s = self.sequences[seq] - step_dict[seq]
		seq_e = self.sequences[seq]

Add paged attention support #1355

Are you sure you want to change the base?

Add paged attention support #1355

Conversation

cyanguwa commented Dec 4, 2024 • edited Loading

Description

Type of change

Changes

Checklist:

cyanguwa commented Dec 4, 2024

cyanguwa commented Jan 6, 2025

sudhakarsingh27 Jan 21, 2025

Choose a reason for hiding this comment

sudhakarsingh27 Jan 21, 2025

Choose a reason for hiding this comment

sudhakarsingh27 Jan 22, 2025

Choose a reason for hiding this comment

sudhakarsingh27 Jan 22, 2025

Choose a reason for hiding this comment

sudhakarsingh27 Jan 22, 2025

Choose a reason for hiding this comment

cyanguwa commented Dec 4, 2024 •

edited

Loading