If you know some related papers which don't conclute in this list, please tell me in Issues
!)
If this repository has been helpful to you, please consider giving it a ⭐️ to show your support. Your support helps us reach more researchers and contributes to the growth of this resource. Thank you!
We summarize awesome token merge / reduce / resample methods in vision model for multi-modal large language models.
The list of token merge, reduce, drop, resample methods is summarized in chronological order and is on updating.
-
Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee
NIPS'2023 (oral) [Paper] [Code] -
Honeybee: Locality-enhanced Projector for Multimodal LLM
Junbum Cha, Wooyoung Kang, Jonghwan Mun, Byungseok Roh
CVPR'2024 [Paper] [Code] -
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi
arXiv'2023 [Paper] [Code]
-
MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer
Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yansong Tang, Jiwen Lu, Tao Chen
CVPR'2024 [Paper] [Code] -
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models
Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xiaoshuai Sun, Rongrong Ji
arXiv'2024 [Paper] -
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, Xiang Bai
arXiv'2024 [Paper] [Code] -
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang
ECCV'2024 (oral) [Paper] [Code] -
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Yufei Zhan, Yousong Zhu, Hongyin Zhao, Fan Yang, Ming Tang, Jinqiao Wang
arXiv'2024 [Paper] [Code] -
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan
arXiv'2024 [Paper] [Code]
- DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu Sun, Lu Hou
arXiv'2024 [Paper] [Code]
-
Efficient Large Multi-modal Models via Visual Context Compression
Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, Alan Yuille
NIPS'2024 [Paper] [Code] -
VoCo-LLaMA: Towards Vision Compression with Large Language Models
Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang
arXiv'2024 [Paper] [Code]
-
TokenPacker: Efficient Visual Projector for Multimodal LLM
Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, Lei Zhang
arXiv'2024 [Paper] [Code] -
Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding
Renshan Zhang, Yibo Lyu, Rui Shao, Gongwei Chen, Weili Guan, Liqiang Nie
arXiv'2024 [Paper] [Code]
-
HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments
Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji
arXiv'2024 [Paper] -
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model
Chaoya Jiang, Jia Hongrui, Haiyang Xu, Wei Ye, Mengfan Dong, Ming Yan, Ji Zhang, Fei Huang, Shikun Zhang
arXiv'2024 [Paper]
-
Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information
Yi Chen, Jian Xu, Xu-Yao Zhang, Wen-Zhuo Liu, Yang-Yang Liu, Cheng-Lin Liu
arXiv'2024 [Paper] -
TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings
Dawei Yan, Pengcheng Li, Yang Li, Hao Chen, Qingguo Chen, Weihua Luo, Wei Dong, Qingsen Yan, Haokui Zhang, Chunhua Shen
arXiv'2024 [Paper] -
Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs
Dingjie Song, Wenjun Wang, Shunian Chen, Xidong Wang, Michael Guan, Benyou Wang
arXiv'2024 [Paper] [Code]
-
AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity
Zhibin Lan, Liqiang Niu, Fandong Meng, Wenbo Li, Jie Zhou, Jinsong Su
arXiv'2024 [Paper] [Code] -
Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See
Zhibin Lan, Liqiang Niu, Fandong Meng, Wenbo Li, Jie Zhou, Jinsong Su
arXiv'2024 [Paper] -
Retrieval Replace Reduction: An effective visual token reduction method via semantic match
Yingen Liu, Fan Wu, Ruihui Li, Zhuo Tang, Kenli Li
arXiv'2024 [Paper] -
Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers
Yuxin Wen, Qingqing Cao, Qichen Fu, Sachin Mehta, Mahyar Najibi
arXiv'2024 [Paper] -
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, Dahua Lin
arXiv'2024 [Paper] [Code]
-
Inference Optimal VLMs Need Only One Visual Token but Larger Models
Kevin Y. Li, Sachin Goyal, Joao D. Semedo, J. Zico Kolter
arXiv'2024 [Paper] [Code] -
Don't Look Twice: Faster Video Transformers with Run-Length Tokenization
Rohan Choudhury, Guanglei Zhu, Sihan Liu, Koichiro Niinuma, Kris M. Kitani, László Jeni
NIPS'2024 (Spotlight) [Paper] [Code] -
Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model
Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, Linfeng Zhang
arXiv'2024 [Paper] [Code] -
FoPru: Focal Pruning for Efficient Large Vision-Language Models
Lei Jiang, Weizhe Huang, Tongxuan Liu, Yuting Zeng, Jing Li, Lechao Cheng, Xiaohua Xu
arXiv'2024 [Paper] -
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression
Yuke Zhu, Chi Xie, Shuang Liang, Bo Zheng, Sheng Guo
arXiv'2024 [Paper] -
LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval
Weiheng Lu, Jian Li, An Yu, Ming-Ching Chang, Shengpeng Ji, Min Xia
arXiv'2024 [Paper] -
DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models
Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang
arXiv'2024 [Paper] [Code] -
freePruner: A Training-free Approach for Large Multimodal Model Acceleration
Bingxin Xu, Yuzhang Shang, Yunhao Ge, Qian Lou, Yan Yan
arXiv'2024 [Paper] -
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration
Yuhang Han, Xuyang Liu, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang
arXiv'2024 [Paper]
-
ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, Yansong Tang
arXiv'2024 [Paper] [Code] -
Accelerating Multimodel Large Language Models by Searching Optimal Vision Token Reduction
Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dimitris N. Metaxas, Licheng Yu
arXiv'2024 [Paper] -
Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification
Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaoshen Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin
arXiv'2024 [Paper] [Code] -
Negative Token Merging: Image-based Adversarial Feature Guidance
Jaskirat Singh, Lindsey Li, Weijia Shi, Ranjay Krishna, Yejin Choi, Pang Wei Koh, Michael F. Cohen, Stephen Gould, Liang Zheng, Luke Zettlemoyer
arXiv'2024 [Paper] [Code] -
[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster
Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang
arXiv'2024 [Paper] [Code] -
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang
arXiv'2024 [Paper] [Code] -
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia
arXiv'2024 [Paper] [Code] -
[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs
Ao Wang, Fengyuan Sun, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding
arXiv'2024 [Paper] [Code] -
iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
Lianyu Hu, Fanhua Shang, Liang Wan, Wei Feng
arXiv'2024 [Paper] [Code] -
Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models
Wei Suo, Ji Ma, Mengyang Sun, Lin Yuanbo Wu, Peng Wang, Yanning Zhang
arXiv'2024 [Paper] -
DocVLM: Make Your VLM an Efficient Reader
Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona Golts, Yair Kittenplon, Shai Mazor, Ron Litman
arXiv'2024 [Paper] -
LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information
Ke Wang, Hong Xuan
arXiv'2024 [Paper] -
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, Can Huang
arXiv'2024 [Paper] -
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Chenyu Yang, Xuan Dong, Xizhou Zhu, Weijie Su, Jiahao Wang, Hao Tian, Zhe Chen, Wenhai Wang, Lewei Lu, Jifeng Dai
arXiv'2024 [Paper] [Code] -
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
Mark Endo, Xiaohan Wang, Serena Yeung-Levy
arXiv'2024 [Paper] [Code]FEATHER Framework
-
FastVLM: Efficient Vision Encoding for Vision Language Models
Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari
arXiv'2024 [Paper] -
PruneVid: Visual Token Pruning for Efficient Video Large Language Models
Xiaohu Huang, Hao Zhou, Kai Han
arXiv'2024 [Paper] [Code] -
ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, Liqiang Nie
arXiv'2024 [Paper] [Code]
-
FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models
Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang
arXiv'2024 [Paper] [Code] -
What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph
Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, Yiyi Zhou
arXiv'2024 [Paper] [Code] -
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng
arXiv'2024 [Paper] [Code] -
Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration
Xuyang Liu, Ziming Wang, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Bo Zheng, Linfeng Zhang, Siteng Huang, Honggang Chen
arXiv'2024 [Paper] [Code]