Importance weight based sparse attention implementation for auto-regressive decoding. #2

FengDSP · 2023-10-24T23:00:03Z

Transformers are powerful sequence models but require time and memory that grows quadratically with the sequence length. To support a longer input context, many research efforts have been made to reduce the KV cache and speed up the model inference.

This PR implements a relatively simple way to limit the KV cache size inspired by the findings in https://arxiv.org/abs/2305.17118. In this PR, a weight-based cache eviction is added on top of the circular cache eviction policy. Instead of only keeping the local k keys and values, we can also make sure the highest k weighted key and values are not dropped when the cache is at the limit. The weight is calculated simply by the Q*K result in the previous step.

Empirical results measured from a few public datasets have shown that this simple sparse attention policy can greatly improve the completion speed while retaining the majority of the completion quality. Please feel free to contact me if you are interested in details.

Feng Li added 3 commits October 24, 2023 16:37

masked_tokens uses session_length

cd59efe

masked_tokens uses session length everywhere

72319c6

Important KV Cache in auto-regressive decoder.

499149e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Importance weight based sparse attention implementation for auto-regressive decoding. #2

Importance weight based sparse attention implementation for auto-regressive decoding. #2

Uh oh!

FengDSP commented Oct 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Importance weight based sparse attention implementation for auto-regressive decoding. #2

Are you sure you want to change the base?

Importance weight based sparse attention implementation for auto-regressive decoding. #2

Uh oh!

Conversation

FengDSP commented Oct 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants