Paper | Installation | Eviction | Quantization
We provide three implementations. ThinK_eager
contains the code for eager attention, ThinK_flash
utilizes FlashAttention and TinK_KIVI
which intergrates with KV quantization. Please note that the current implementations may not be fully optimized, and we are actively working on improving their efficiency. We use LongBench to evaluate the performance.
- Support More Models
- Support Multi-GPUs
- Optimize Efficiency
Step 1: Clone this repository
Step 2: Setup Environments
conda create -n think python=3.10
conda activate think
pip install -r requirements.txt
Evaluate on LongBench: You can first modify the hyperparameters in scripts/scripts_longBench/eval.sh
(e.g., pruning_ratio)
cd ThinK_flash
sh ./scripts/scripts_longBench/eval.sh
Results:
sh ./scripts/scripts_longBench/metrics.sh
cd ThinK_kivi
Set up the environments as per the instructions from KIVI, adding an additional argument, pruning_ratio
. Currently, only LLaMA-2 is supported.
Users need to make their own assessment regarding any obligations or responsibilities under the corresponding licenses or terms and conditions pertaining to the original datasets and data. This repository is being released for research purposes only.
@article{xu2024think,
title={ThinK: Thinner Key Cache by Query-Driven Pruning},
author={Xu, Yuhui and Jie, Zhanming and Dong, Hanze and Wang, Lei and Lu, Xudong and Zhou, Aojun and Saha, Amrita and Xiong, Caiming and Sahoo, Doyen},
journal={arXiv preprint arXiv:2407.21018},
year={2024}
}
This repo builds on the SnapKV, PyramidKV, KIVI repos.
This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.