Mohsen Gholami, Mohammad Akabri, Kevin Cannons, Yong Zhang,
Huawei Technologies Canada
- CASP proposes a 2-bit compression method for VLMs that is compatible with any quantization technique and enhances state-of-the-art 2-bit quantization methods (AQLM and QuIP#) by an average of 21% on image- and video-language benchmarks
Install the requirements via pip install -r requirements.txt.
- Build and install the CUDA inference kernels. (cd quip-sharp/quiptools && python setup.py install && cd ../)
- Install the fast-hadamard-transform package using their github repo.
pip install aqlm[gpu,cpu]
Follow the below steps to prepare CASPQuIP# for LLaVA-1.5-7B. If you want to quantize LLaVA-1.5-13B or LLaVA-Next you can set the --model
in the scripts accordingly. If you want to qunatize LLaMA-7B you should use svd_llama.sh
,hfize_llama.sh
, and quantize_finetune_llama.sh
in the below steps.
-
To prepare LLaVA-1.5-7B with low-rank compressed Wq and Wk.
bash SVD/scripts/svd_llava.sh
-
To prepare hessians for QuIP#:
bash quip-sharp/scripts/hfize_llava.sh
-
Quantization:
bash quip-sharp/scripts/quantize_finetune_llava.sh
Follow the below steps to prepare CASPAQLM for LLaVA-1.5-7B. If you want to quantize LLaVA-1.5-13B or LLaVA-Next you can set the --model
in the scripts accordingly. If you want to qunatize LLaMA-7B you should use svd_llama.sh
and quantize_llama.sh
in the below steps.
-
To prepare llava with low-rank compressed Wq and Wk :
bash SVD/scripts/svd_llava.sh
-
Quantization:
bash AQLM/scripts/quantize_llava.sh
Follow the below steps to prepare CASPGPTQ for LLaVA-1.5-7B. If you want to quantize LLaVA-1.5-13B or LLaVA-Next you can set the --model
in the scripts accordingly. If you want to qunatize LLaMA-7B you should use svd_llama.sh
and quantize_llama.sh
in the below steps.
-
To prepare llava with low-rank compressed Wq and Wk:
bash SVD/scripts/svd_llava.sh
-
Quantization:
bash GPTQ/scripts/quantize_llava.sh
If you find CASP useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.
@misc{gholami2025caspcompressionlargemultimodal,
title={CASP: Compression of Large Multimodal Models Based on Attention Sparsity},
author={Mohsen Gholami and Mohammad Akbari and Kevin Cannons and Yong Zhang},
year={2025},
eprint={2503.05936},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.05936},
}