-
Notifications
You must be signed in to change notification settings - Fork 1
change kv cache layout #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a3b2270
to
15d2772
Compare
vllm/worker/tpu_worker.py
Outdated
@@ -130,7 +130,7 @@ def determine_num_available_blocks(self) -> Tuple[int, int]: | |||
# Calculate the TPU KV cache size based on profiling. | |||
usable_memory_size = int(total_memory_size * | |||
self.cache_config.gpu_memory_utilization) | |||
tpu_kv_cache_bytes = max(usable_memory_size - profiled, 0) | |||
tpu_kv_cache_bytes = max(usable_memory_size - profiled, 0) // 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do "// 4"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVM, I will remove it. This is only used to understand the cost of the torch.permute on the pre-allocated kv cache buffer. Anyway after the new pallas attn kernel is finished, the torch.permute will disappear.
vllm/worker/tpu_model_runner.py
Outdated
@@ -172,14 +173,14 @@ def _dummy_run( | |||
) -> None: | |||
exec_mode = ExecutionMode(exec_mode) | |||
if exec_mode.is_prefill(): | |||
seq_len = (seq_len + 15) // 16 * 16 | |||
seq_len = (seq_len + 15) // MIN_PREFILL_SEQ_LEN * MIN_PREFILL_SEQ_LEN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace 15 with "MIN_PREFILL_SEQ_LEN-1"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
vllm/attention/backends/pallas.py
Outdated
key_cache = key_cache.flatten(0, 1) | ||
value_cache = value_cache.flatten(0, 1) | ||
slot_mapping = slot_mapping.flatten() | ||
key_cache.index_copy_(0, slot_mapping, key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iiuc, you are copying block by block (of size hn x hs) hence achieved the speedup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's now block_size x hn x hs
per copy.
vllm/attention/backends/pallas.py
Outdated
value = value.flatten(0, 1) | ||
key_cache = key_cache.flatten(0, 1) | ||
value_cache = value_cache.flatten(0, 1) | ||
slot_mapping = slot_mapping.flatten() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why you need to flatten the slot_mapping here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a 2D tensor before that
ec1fb6d
to
8754c7e
Compare
vllm/attention/backends/pallas.py
Outdated
@@ -11,6 +11,8 @@ | |||
AttentionMetadata, AttentionType) | |||
from vllm.attention.backends.utils import CommonAttentionState | |||
|
|||
MIN_PREFILL_SEQ_LEN = 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we clarify what specific TPU generations work optimally with this seq-len? Does it make sense to add this into a config file?
Wondering if v5p or older generations would require a different threshold. Same comment regarding future TPU generations. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically the larger the better. It has two limits:
- the new size cannot be smaller than the actual size
- the new size cannot be larger than the page size
I'm actually thinking about directly set MIN_PREFILL_SEQ_LEN to page size for simplicity, then we can control the size by the page size
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
Signed-off-by: DarkLight1337 <[email protected]>
…18693) Signed-off-by: DarkLight1337 <[email protected]>
…18703) Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Andy Xie <[email protected]>
Signed-off-by: Nave Assaf <[email protected]>
…s in CI (vllm-project#18705) Signed-off-by: Isotr0py <[email protected]>
…to HPU CI (vllm-project#18709) Signed-off-by: Lukasz Durejko <[email protected]>
Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
…8666) Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
…logits instead of probs (vllm-project#18608)
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Lukas Geiger <[email protected]>
Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]>
…oject#18646) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]>
…18732) Signed-off-by: Lukas Geiger <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
…orrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]>
Signed-off-by: Patrick von Platen <[email protected]>
Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]>
…e time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]>
e21696a
to
eb584a1
Compare
Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]>
Signed-off-by: Jerry Zhang <[email protected]>
…roject#19033) Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chengji Yao <[email protected]>
Signed-off-by: Xu Song <[email protected]>
Signed-off-by: Aaron Pham <[email protected]>
Co-authored-by: jinghui <[email protected]>
Signed-off-by: Chengji Yao <[email protected]>
…model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]>
…9172) Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: rzou <[email protected]>
Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]>
…m-project#19227) Signed-off-by: Jon Swenson <[email protected]>
d0e50eb
to
1f7c757
Compare
Signed-off-by: Chengji Yao <[email protected]>
1f7c757
to
fff63b2
Compare
Signed-off-by: Chengji Yao <[email protected]>
FILL IN THE PR DESCRIPTION HERE
FIX #xxxx (link existing issues this PR will resolve)
BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html