Skip to content

Conversation

@FengDSP
Copy link
Owner

@FengDSP FengDSP commented Oct 24, 2023

Transformers are powerful sequence models but require time and memory that grows quadratically with the sequence length. To support a longer input context, many research efforts have been made to reduce the KV cache and speed up the model inference.

This PR implements a relatively simple way to limit the KV cache size inspired by the findings in https://arxiv.org/abs/2305.17118. In this PR, a weight-based cache eviction is added on top of the circular cache eviction policy. Instead of only keeping the local k keys and values, we can also make sure the highest k weighted key and values are not dropped when the cache is at the limit. The weight is calculated simply by the Q*K result in the previous step.

Empirical results measured from a few public datasets have shown that this simple sparse attention policy can greatly improve the completion speed while retaining the majority of the completion quality. Please feel free to contact me if you are interested in details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants