Skip to content

A paper list about Token Merge, Reduce, Resample, Drop for MLLMs.

License

Notifications You must be signed in to change notification settings

JinXins/Awesome-Token-Merge-for-MLLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 

Repository files navigation

💫 Awesome-Token-Merge-for-MLLMs

Awesome GitHub stars GitHub forks

Welcome to Awesome-Token-Merge-for-MLLMs.

If you know some related papers which don't conclute in this list, please tell me in Issues !)

If this repository has been helpful to you, please consider giving it a ⭐️ to show your support. Your support helps us reach more researchers and contributes to the growth of this resource. Thank you!

📜 Introduction

We summarize awesome token merge / reduce / resample methods in vision model for multi-modal large language models.

The list of token merge, reduce, drop, resample methods is summarized in chronological order and is on updating.

📖 Related Papers

Baseline

  • Visual Instruction Tuning arXiv
    Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee
    NIPS'2023 (oral) [Paper] [Code]

    LLaVA Framework

  • Honeybee: Locality-enhanced Projector for Multimodal LLM arXiv
    Junbum Cha, Wooyoung Kang, Jonghwan Mun, Byungseok Roh
    CVPR'2024 [Paper] [Code]

    Honeybee Framework

  • BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models arXiv
    Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi
    arXiv'2023 [Paper] [Code]

    BLIP-2 Framework

2024.3

  • MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer arXiv
    Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yansong Tang, Jiwen Lu, Tao Chen
    CVPR'2024 [Paper] [Code]

    MADTP Framework

  • Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models arXiv
    Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xiaoshuai Sun, Rongrong Ji
    arXiv'2024 [Paper]

    LLaVA-HR Framework

  • TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document arXiv
    Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, Xiang Bai
    arXiv'2024 [Paper] [Code]

    TextMonkey Framework

  • An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models arXiv
    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang
    ECCV'2024 (oral) [Paper] [Code]

    FastV Framework

  • Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring arXiv
    Yufei Zhan, Yousong Zhu, Hongyin Zhao, Fan Yang, Ming Tang, Jinqiao Wang
    arXiv'2024 [Paper] [Code]

    Griffon V2 Framework

  • LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models arXiv
    Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan
    arXiv'2024 [Paper] [Code]

    LLaVA-PruMerge Framework

2024.5

  • DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models arXiv
    Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu Sun, Lu Hou
    arXiv'2024 [Paper] [Code]
    DeCo Framework

2024.6

  • Efficient Large Multi-modal Models via Visual Context Compression arXiv
    Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, Alan Yuille
    NIPS'2024 [Paper] [Code]

    LLaVolta Framework

  • VoCo-LLaMA: Towards Vision Compression with Large Language Models arXiv
    Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang
    arXiv'2024 [Paper] [Code]

    VoCo-LLaMA Framework

2024.7

  • TokenPacker: Efficient Visual Projector for Multimodal LLM arXiv
    Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, Lei Zhang
    arXiv'2024 [Paper] [Code]

    TokenPacker Framework

  • Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding arXiv
    Renshan Zhang, Yibo Lyu, Rui Shao, Gongwei Chen, Weili Guan, Liqiang Nie
    arXiv'2024 [Paper] [Code]

    Token-level Framework

2024.8

  • HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments arXiv
    Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji
    arXiv'2024 [Paper]

    HiRED Framework

  • MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model arXiv
    Chaoya Jiang, Jia Hongrui, Haiyang Xu, Wei Ye, Mengfan Dong, Ming Yan, Ji Zhang, Fei Huang, Shikun Zhang
    arXiv'2024 [Paper]

    MaVEn Framework

2024.9

  • Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information arXiv
    Yi Chen, Jian Xu, Xu-Yao Zhang, Wen-Zhuo Liu, Yang-Yang Liu, Cheng-Lin Liu
    arXiv'2024 [Paper]

    Recoverable Compression Framework

  • TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings arXiv
    Dawei Yan, Pengcheng Li, Yang Li, Hao Chen, Qingguo Chen, Weihua Luo, Wei Dong, Qingsen Yan, Haokui Zhang, Chunhua Shen
    arXiv'2024 [Paper]

    TG-LLaVA Framework

  • Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs arXiv
    Dingjie Song, Wenjun Wang, Shunian Chen, Xidong Wang, Michael Guan, Benyou Wang
    arXiv'2024 [Paper] [Code]

    TRIM Framework

2024.10

  • AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity arXiv
    Zhibin Lan, Liqiang Niu, Fandong Meng, Wenbo Li, Jie Zhou, Jinsong Su
    arXiv'2024 [Paper] [Code]

    AVG-LLaVA Framework

  • Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See arXiv
    Zhibin Lan, Liqiang Niu, Fandong Meng, Wenbo Li, Jie Zhou, Jinsong Su
    arXiv'2024 [Paper]

    YOPO Framework

  • Retrieval Replace Reduction: An effective visual token reduction method via semantic match arXiv
    Yingen Liu, Fan Wu, Ruihui Li, Zhuo Tang, Kenli Li
    arXiv'2024 [Paper]

    TRSM Framework

  • Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers arXiv
    Yuxin Wen, Qingqing Cao, Qichen Fu, Sachin Mehta, Mahyar Najibi
    arXiv'2024 [Paper]

    Victor Framework

  • PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction arXiv
    Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, Dahua Lin
    arXiv'2024 [Paper] [Code]

    PyramidDrop Framework

2024.11

  • Inference Optimal VLMs Need Only One Visual Token but Larger Models arXiv
    Kevin Y. Li, Sachin Goyal, Joao D. Semedo, J. Zico Kolter
    arXiv'2024 [Paper] [Code]

    QuCC Framework

  • Don't Look Twice: Faster Video Transformers with Run-Length Tokenization arXiv
    Rohan Choudhury, Guanglei Zhu, Sihan Liu, Koichiro Niinuma, Kris M. Kitani, László Jeni
    NIPS'2024 (Spotlight) [Paper] [Code]

    RLT Framework

  • Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model arXiv
    Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, Linfeng Zhang
    arXiv'2024 [Paper] [Code]

    MustDrop Framework

  • FoPru: Focal Pruning for Efficient Large Vision-Language Models arXiv
    Lei Jiang, Weizhe Huang, Tongxuan Liu, Yuting Zeng, Jing Li, Lechao Cheng, Xiaohua Xu
    arXiv'2024 [Paper]

    FoPru Framework

  • FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression arXiv
    Yuke Zhu, Chi Xie, Shuang Liang, Bo Zheng, Sheng Guo
    arXiv'2024 [Paper]

    FocusLLaVA Framework

  • LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval arXiv
    Weiheng Lu, Jian Li, An Yu, Ming-Ching Chang, Shengpeng Ji, Min Xia
    arXiv'2024 [Paper]

    LLaVA-MR Framework

  • DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models arXiv
    Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang
    arXiv'2024 [Paper] [Code]

    DyCoke Framework

  • freePruner: A Training-free Approach for Large Multimodal Model Acceleration arXiv
    Bingxin Xu, Yuzhang Shang, Yunhao Ge, Qian Lou, Yan Yan
    arXiv'2024 [Paper]

    freePruner Framework

  • Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration arXiv
    Yuhang Han, Xuyang Liu, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang
    arXiv'2024 [Paper]

    FiCoCo Framework

2024.12

  • ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models arXiv
    Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, Yansong Tang
    arXiv'2024 [Paper] [Code]

    ATP-LLaVA Framework

  • Accelerating Multimodel Large Language Models by Searching Optimal Vision Token Reduction arXiv
    Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dimitris N. Metaxas, Licheng Yu
    arXiv'2024 [Paper]

    Framework

  • Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification arXiv
    Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaoshen Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin
    arXiv'2024 [Paper] [Code]

    Dynamic-LLaVA Framework

  • Negative Token Merging: Image-based Adversarial Feature Guidance arXiv
    Jaskirat Singh, Lindsey Li, Weijia Shi, Ranjay Krishna, Yejin Choi, Pang Wei Koh, Michael F. Cohen, Stephen Gould, Liang Zheng, Luke Zettlemoyer
    arXiv'2024 [Paper] [Code]

    NegToMe Framework

  • [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster arXiv
    Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang
    arXiv'2024 [Paper] [Code]

    FasterVLM Framework

  • AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning arXiv
    Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang
    arXiv'2024 [Paper] [Code]

    AIM Framework

  • VisionZip: Longer is Better but Not Necessary in Vision Language Models arXiv
    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia
    arXiv'2024 [Paper] [Code]

    VisionZip Framework

  • [CLS] Token Tells Everything Needed for Training-free Efficient MLLMs arXiv
    Ao Wang, Fengyuan Sun, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding
    arXiv'2024 [Paper] [Code]

    VTC-CLS Framework

  • iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models arXiv
    Lianyu Hu, Fanhua Shang, Liang Wan, Wei Feng
    arXiv'2024 [Paper] [Code]

    iLLaVA Framework

  • Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models arXiv
    Wei Suo, Ji Ma, Mengyang Sun, Lin Yuanbo Wu, Peng Wang, Yanning Zhang
    arXiv'2024 [Paper]

    PAR Framework

  • DocVLM: Make Your VLM an Efficient Reader arXiv
    Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona Golts, Yair Kittenplon, Shai Mazor, Ron Litman
    arXiv'2024 [Paper]

    DocVLM Framework

  • LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information arXiv
    Ke Wang, Hong Xuan
    arXiv'2024 [Paper]

    LLaVA-Zip Framework

  • Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM arXiv
    Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, Can Huang
    arXiv'2024 [Paper]

    Dynamic-VLM Framework

  • PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models arXiv
    Chenyu Yang, Xuan Dong, Xizhou Zhu, Weijie Su, Jiahao Wang, Hao Tian, Zhe Chen, Wenhai Wang, Lewei Lu, Jifeng Dai
    arXiv'2024 [Paper] [Code]

    PVC Framework

  • Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration arXiv
    Mark Endo, Xiaohan Wang, Serena Yeung-Levy
    arXiv'2024 [Paper] [Code]

    FEATHER Framework
  • FastVLM: Efficient Vision Encoding for Vision Language Models arXiv
    Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari
    arXiv'2024 [Paper]

    FastVLM Framework

  • PruneVid: Visual Token Pruning for Efficient Video Large Language Models arXiv
    Xiaohu Huang, Hao Zhou, Kai Han
    arXiv'2024 [Paper] [Code]

    PruneVid Framework

  • ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding arXiv
    Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, Liqiang Nie
    arXiv'2024 [Paper] [Code]

    ReTaKe Framework

2025.1

  • FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models arXiv
    Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang
    arXiv'2024 [Paper] [Code]

    FrameFusion Framework

  • What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph arXiv
    Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, Yiyi Zhou
    arXiv'2024 [Paper] [Code]

    G-Prune Framework

  • LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token arXiv
    Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng
    arXiv'2024 [Paper] [Code]

    LLaVA-Mini Framework

  • Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration arXiv
    Xuyang Liu, Ziming Wang, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Bo Zheng, Linfeng Zhang, Siteng Huang, Honggang Chen
    arXiv'2024 [Paper] [Code]

    GlobalCom2 Framework

(back to top)

Releases

No releases published

Packages

No packages published