Skip to content

Latest commit

 

History

History
815 lines (582 loc) · 89.3 KB

awesome_multimod_llm.md

File metadata and controls

815 lines (582 loc) · 89.3 KB

Awesome-multimod-llms

Survey

  • MM-LLMs: Recent Advances in MultiModal Large Language Models, arXiv, 2401.13601, arxiv, pdf, cication: -1

    Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, Dong Yu

  • A Survey of Resource-efficient LLM and Multimodal Foundation Models, arXiv, 2401.08092, arxiv, pdf, cication: -1

    Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang

Vision

Video

  • Distilling Vision-Language Models on Millions of Videos, arXiv, 2401.06129, arxiv, pdf, cication: -1

    Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu, Hui Miao, Florian Schroff, Hartwig Adam, Ting Liu, Boqing Gong

  • LEGO:Language Enhanced Multi-modal Grounding Model, arXiv, 2401.06071, arxiv, pdf, cication: -1

    Zhaowei Li, Qi Xu, Dong Zhang, Hang Song, Yiqing Cai, Qi Qi, Ran Zhou, Junting Pan, Zefeng Li, Van Tu Vu · (LEGO - lzw-lzw) Star

  • COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training, arXiv, 2401.00849, arxiv, pdf, cication: -1

    Alex Jinpeng Wang, Linjie Li, Kevin Qinghong Lin, Jianfeng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou

  • General Object Foundation Model for Images and Videos at Scale, arXiv, 2312.09158, arxiv, pdf, cication: -1

    Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai

    · (glee-vision.github)

  • Video Understanding with Large Language Models: A Survey, arXiv, 2312.17432, arxiv, pdf, cication: -1

    Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu · (Awesome-LLMs-for-Video-Understanding - yunlong10) Star

  • OneLLM: One Framework to Align All Modalities with Language, arXiv, 2312.03700, arxiv, pdf, cication: -1

    Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue

  • CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation, arXiv, 2311.18775, arxiv, pdf, cication: -1

    Zineng Tang, Ziyi Yang, Mahmoud Khademi, Yang Liu, Chenguang Zhu, Mohit Bansal · (i-Code - microsoft) Star · (codi-2.github)

  • LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models, arXiv, 2311.17043, arxiv, pdf, cication: -1

    Yanwei Li, Chengyao Wang, Jiaya Jia · (llama-vid - dvlab-research) Star

  • Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding, arXiv, 2311.08046, arxiv, pdf, cication: -1

    Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, Li Yuan · (Chat-UniVi - PKU-YuanGroup) Star · (huggingface) · (qbitai)

  • Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities, arXiv, 2311.05698, arxiv, pdf, cication: -1

    AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova

    · (jiqizhixin)

  • UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition, arXiv, 2311.15599, arxiv, pdf, cication: -1

    Xiaohan Ding, Yiyuan Zhang, Yixiao Ge, Sijie Zhao, Lin Song, Xiangyu Yue, Ying Shan

  • PG-Video-LLaVA: Pixel Grounding Large Video-Language Models, arXiv, 2311.13435, arxiv, pdf, cication: -1

    Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, Fahad Khan

  • VideoCon: Robust Video-Language Alignment via Contrast Captions, arXiv, 2311.10111, arxiv, pdf, cication: -1

    Hritik Bansal, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang, Aditya Grover

  • Video-LLaVA: Learning United Visual Representation by Alignment Before Projection, arXiv, 2311.10122, arxiv, pdf, cication: -1

    Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, Li Yuan

  • Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding, arXiv, 2306.02858, arxiv, pdf, cication: 39

    Hang Zhang, Xin Li, Lidong Bing · [Video-LLaMA - DAMO-NLP-SG] Star

Image

  • OmniLMM - OpenBMB Star

    Large Multi-modal Models for Strong Performance and Efficient Deployment · (mp.weixin.qq)

  • MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer, arXiv, 2401.10208, arxiv, pdf, cication: -1

    Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie Zhou · (jiqizhixin) · (MM-Interleaved - OpenGVLab) Star

  • Octavius: Mitigating Task Interference in MLLMs via MoE, arXiv, 2311.02684, arxiv, pdf, cication: 3

    Zeren Chen, Ziqin Wang, Zhen Wang, Huayang Liu, Zhenfei Yin, Si Liu, Lu Sheng, Wanli Ouyang, Yu Qiao, Jing Shao · (openlamm.github)

  • Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study, arXiv, 2401.17981, arxiv, pdf, cication: -1

    Qirui Jiao, Daoyuan Chen, Yilun Huang, Yaliang Li, Ying Shen

  • Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization, arXiv, 2401.15914, arxiv, pdf, cication: -1

    Yuhang Zang, Hanlin Goh, Josh Susskind, Chen Huang

  • MouSi: Poly-Visual-Expert Vision-Language Models, arXiv, 2401.17221, arxiv, pdf, cication: -1

    Xiaoran Fan, Tao Ji, Changhao Jiang, Shuo Li, Senjie Jin, Sirui Song, Junke Wang, Boyang Hong, Lu Chen, Guodong Zheng

  • LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model, arXiv, 2401.02330, arxiv, pdf, cication: -1

    Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, Jian Tang · (llava-phi - zhuyiche) Star

  • InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model, arXiv, 2401.16420, arxiv, pdf, cication: -1

    Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao · (InternLM-XComposer - InternLM) Star

  • MoE-LLaVA: Mixture of Experts for Large Vision-Language Models, arXiv, 2401.15947, arxiv, pdf, cication: -1

    Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, Li Yuan · (MoE-LLaVA - PKU-YuanGroup) Star

  • Small Language Model Meets with Reinforced Vision Vocabulary, arXiv, 2401.12503, arxiv, pdf, cication: -1

    Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, En Yu, Jianjian Sun, Chunrui Han, Xiangyu Zhang · (Vary-toy - Ucas-HaoranWei) Star · (qbitai)

  • ShareGPT4V: Improving Large Multi-Modal Models with Better Captions, arXiv, 2311.12793, arxiv, pdf, cication: 6

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin · (sharegpt4v.github) · (InternLM-XComposer - InternLM) Star · (huggingface)

  • Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities, arXiv, 2401.14405, arxiv, pdf, cication: -1

    Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, Xiangyu Yue · (M2PT - AILab-CVC) Star

  • ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models, arXiv, 2401.13311, arxiv, pdf, cication: -1

    Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, Nanyun Peng · (con-textual.github)

  • SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities, arXiv, 2401.12168, arxiv, pdf, cication: -1

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, Fei Xia · (spatial-vlm.github)

  • Improving fine-grained understanding in image-text pre-training, arXiv, 2401.09865, arxiv, pdf, cication: -1

    Ioana Bica, Anastasija Ilić, Matthias Bauer, Goker Erdogan, Matko Bošnjak, Christos Kaplanis, Alexey A. Gritsenko, Matthias Minderer, Charles Blundell, Razvan Pascanu

  • Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs, arXiv, 2401.06209, arxiv, pdf, cication: -1

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie · (tsb0601.github)

    · (MMVP - tsb0601) Star

  • GATS: Gather-Attend-Scatter, arXiv, 2401.08525, arxiv, pdf, cication: -1

    Konrad Zolna, Serkan Cabi, Yutian Chen, Eric Lau, Claudio Fantacci, Jurgis Pasukonis, Jost Tobias Springenberg, Sergio Gomez Colmenarejo

  • Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs, arXiv, 2401.06209, arxiv, pdf, cication: -1

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie · (MMVP - tsb0601) Star

  • LLaVA-$φ$: Efficient Multi-Modal Assistant with Small Language Model, arXiv, 2401.02330, arxiv, pdf, cication: -1

    Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, Jian Tang · (llava-phi - zhuyiche) Star

  • A Vision Check-up for Language Models, arXiv, 2401.01862, arxiv, pdf, cication: -1

    Pratyusha Sharma, Tamar Rott Shaham, Manel Baradad, Stephanie Fu, Adrian Rodriguez-Munoz, Shivam Duggal, Phillip Isola, Antonio Torralba

  • DocLLM: A layout-aware generative language model for multimodal document understanding, arXiv, 2401.00908, arxiv, pdf, cication: -1

    Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nourbakhsh, Xiaomo Liu

  • V: Guided Visual Search as a Core Mechanism in Multimodal LLMs*, arXiv, 2312.14135, arxiv, pdf, cication: -1

    Penghao Wu, Saining Xie · (vstar - penghao-wu) Star

  • Learning Vision from Models Rivals Learning Vision from Data, arXiv, 2312.17742, arxiv, pdf, cication: -1

    Yonglong Tian, Lijie Fan, Kaifeng Chen, Dina Katabi, Dilip Krishnan, Phillip Isola

  • Parrot Captions Teach CLIP to Spot Text, arXiv, 2312.14232, arxiv, pdf, cication: -1

    Yiqi Lin, Conghui He, Alex Jinpeng Wang, Bin Wang, Weijia Li, Mike Zheng Shou

  • Harnessing Diffusion Models for Visual Perception with Meta Prompts, arXiv, 2312.14733, arxiv, pdf, cication: -1

    Qiang Wan, Zilong Huang, Bingyi Kang, Jiashi Feng, Li Zhang · (meta-prompts - fudan-zvg) Star · (mp.weixin.qq)

  • VCoder: Versatile Vision Encoders for Multimodal Large Language Models, arXiv, 2312.14233, arxiv, pdf, cication: -1

    Jitesh Jain, Jianwei Yang, Humphrey Shi

    · (VCoder - SHI-Labs) Star · (praeclarumjj3.github)

  • InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks, arXiv, 2312.14238, arxiv, pdf, cication: -1

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu

    · (internvl - opengvlab) Star · (internvl.opengvlab)

  • Generative Multimodal Models are In-Context Learners, arXiv, 2312.13286, arxiv, pdf, cication: -1

    Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang · (Emu - baaivision) Star · (huggingface)

  • Osprey: Pixel Understanding with Visual Instruction Tuning, arXiv, 2312.10032, arxiv, pdf, cication: -1

    Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, Jianke Zhu · (osprey - circleradon) Star · (huggingface)

  • Gemini: A Family of Highly Capable Multimodal Models, arXiv, 2312.11805, arxiv, pdf, cication: -1

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth

  • Silkie: Preference Distillation for Large Visual Language Models, arXiv, 2312.10665, arxiv, pdf, cication: -1

    Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong

  • G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model, arXiv, 2312.11370, arxiv, pdf, cication: -1

    Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li

  • Merlin:Empowering Multimodal LLMs with Foresight Minds, arXiv, 2312.00589, arxiv, pdf, cication: -1

    En Yu, Liang Zhao, Yana Wei, Jinrong Yang, Dongming Wu, Lingyu Kong, Haoran Wei, Tiancai Wang, Zheng Ge, Xiangyu Zhang · (qbitai)

  • VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation, arXiv, 2312.09251, arxiv, pdf, cication: -1

    Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, Ying Shan

  • Honeybee: Locality-enhanced Projector for Multimodal LLM, arXiv, 2312.06742, arxiv, pdf, cication: -1

    Junbum Cha, Wooyoung Kang, Jonghwan Mun, Byungseok Roh

    · (honeybee - kakaobrain) Star

  • Interfacing Foundation Models' Embeddings, arXiv, 2312.07532, arxiv, pdf, cication: -1

    Xueyan Zou, Linjie Li, Jianfeng Wang, Jianwei Yang, Mingyu Ding, Zhengyuan Yang, Feng Li, Hao Zhang, Shilong Liu, Arul Aravinthan · (FIND - UX-Decoder) Star

  • VILA: On Pre-training for Visual Language Models, arXiv, 2312.07533, arxiv, pdf, cication: -1

    Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, Song Han

  • Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models, arXiv, 2312.06109, arxiv, pdf, cication: -1

    Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, Xiangyu Zhang

  • Generating Illustrated Instructions, arXiv, 2312.04552, arxiv, pdf, cication: -1

    Sachit Menon, Ishan Misra, Rohit Girdhar

  • Alpha-CLIP: A CLIP Model Focusing on Wherever You Want, arXiv, 2312.03818, arxiv, pdf, cication: -1

    Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang · (aleafy.github)

  • LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models, arXiv, 2312.02949, arxiv, pdf, cication: -1

    Hao Zhang, Hongyang Li, Feng Li, Tianhe Ren, Xueyan Zou, Shilong Liu, Shijia Huang, Jianfeng Gao, Lei Zhang, Chunyuan Li · (LLaVA-Grounding - UX-Decoder) Star

  • Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models, arXiv, 2311.06783, arxiv, pdf, cication: -1

    Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Kaixin Xu, Chunyi Li, Jingwen Hou, Guangtao Zhai · (Q-Instruct - Q-Future) Star

  • Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding, arXiv, 2311.16922, arxiv, pdf, cication: -1

    Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, Lidong Bing

    · (VCD - DAMO-NLP-SG) Star

  • ChartLlama: A Multimodal LLM for Chart Understanding and Generation, arXiv, 2311.16483, arxiv, pdf, cication: -1

    Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, Hanwang Zhang · (tingxueronghua.github) · (jiqizhixin)

  • LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge, arXiv, 2311.11860, arxiv, pdf, cication: -1

    Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, Liqiang Nie · (JiuTian - rshaojimmy) Star · (rshaojimmy.github) · (mp.weixin.qq)

  • GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?, arXiv, 2311.15732, arxiv, pdf, cication: -1

    Wenhao Wu, Huanjin Yao, Mengxi Zhang, Yuxin Song, Wanli Ouyang, Jingdong Wang · (GPT4Vis - whwu95) Star

  • DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding, arXiv, 2311.11810, arxiv, pdf, cication: -1

    Hao Feng, Qi Liu, Hao Liu, Wengang Zhou, Houqiang Li, Can Huang · (qbitai)

  • Merlin:Empowering Multimodal LLMs with Foresight Minds, arXiv, 2312.00589, arxiv, pdf, cication: -1

    En Yu, Liang Zhao, Yana Wei, Jinrong Yang, Dongming Wu, Lingyu Kong, Haoran Wei, Tiancai Wang, Zheng Ge, Xiangyu Zhang

  • UniIR: Training and Benchmarking Universal Multimodal Information Retrievers, arXiv, 2311.17136, arxiv, pdf, cication: -1

    Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, Wenhu Chen · (UniIR - TIGER-AI-Lab) Star · (tiger-ai-lab.github)

  • ShareGPT4V: Improving Large Multi-Modal Models with Better Captions, arXiv, 2311.12793, arxiv, pdf, cication: -1

    Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin · (sharegpt4v.github) · (InternLM-XComposer - InternLM) Star · (huggingface)

  • UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework, arXiv, 2311.10125, arxiv, pdf, cication: -1

    Chris Kelly, Luhui Hu, Cindy Yang, Yu Tian, Deshun Yang, Bang Yang, Zaoshan Huang, Zihao Li, Yuexian Zou · (SA-Segment-Anything - LHBuilder) Star

  • Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models, arXiv, 2311.06607, arxiv, pdf, cication: -1

    Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, Xiang Bai · (monkey - yuliang-liu) Star

  • SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models, arXiv, 2311.07575, arxiv, pdf, cication: -1

    Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen · (LLaMA2-Accessory - Alpha-VLLM) Star

  • To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning, arXiv, 2311.07574, arxiv, pdf, cication: -1

    Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, Yu-Gang Jiang · (LVIS-INSTRUCT4V - X2FD) Star

  • Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models, arXiv, 2311.06783, arxiv, pdf, cication: -1

    Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Kaixin Xu, Chunyi Li, Jingwen Hou, Guangtao Zhai · (q-future.github)

  • Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks, arXiv, 2311.06242, arxiv, pdf, cication: -1

    Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, Lu Yuan

  • SILC: Improving Vision Language Pretraining with Self-Distillation, arXiv, 2310.13355, arxiv, pdf, cication: -1

    Muhammad Ferjad Naeem, Yongqin Xian, Xiaohua Zhai, Lukas Hoyer, Luc Van Gool, Federico Tombari

  • u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model, arXiv, 2311.05348, arxiv, pdf, cication: -1

    Jinjin Xu, Liwu Xu, Yuzhe Yang, Xiang Li, Yanchun Xie, Yi-Jie Huang, Yaqian Li

  • LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents, arXiv, 2311.05437, arxiv, pdf, cication: -1

    Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu · (LLaVA-Plus-Codebase - LLaVA-VL) Star · (huggingface)

  • NExT-Chat: An LMM for Chat, Detection and Segmentation, arXiv, 2311.04498, arxiv, pdf, cication: -1

    Ao Zhang, Liming Zhao, Chen-Wei Xie, Yun Zheng, Wei Ji, Tat-Seng Chua

    · (NExT-Chat - NExT-ChatV) Star · [20d12192149d0748e2.gradio]

  • LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment, arXiv, 2310.01852, arxiv, pdf, cication: -1

    Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li · [LanguageBind - PKU-YuanGroup] Star · [jiqizhixin]

  • OtterHD: A High-Resolution Multi-modality Model, arXiv, 2311.04219, arxiv, pdf, cication: -1

    Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu

  • mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration, arXiv, 2311.04257, arxiv, pdf, cication: -1

    Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou · [mPLUG-Owl - X-PLUG] Star

  • GLaMM: Pixel Grounding Large Multimodal Model, arXiv, 2311.03356, arxiv, pdf, cication: -1

    Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Erix Xing, Ming-Hsuan Yang, Fahad S. Khan

  • CogVLM: Visual Expert for Pretrained Language Models, arXiv, 2311.03079, arxiv, pdf, cication: -1

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song

  • CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding, arXiv, 2311.03354, arxiv, pdf, cication: -1

    Junyan Li, Delin Chen, Yining Hong, Zhenfang Chen, Peihao Chen, Yikang Shen, Chuang Gan

  • Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?, arXiv, 2311.00047, arxiv, pdf, cication: -1

    Yichi Zhang, Jiayi Pan, Yuchen Zhou, Rui Pan, Joyce Chai

  • De-Diffusion Makes Text a Strong Cross-Modal Interface, arXiv, 2311.00618, arxiv, pdf, cication: -1

    Chen Wei, Chenxi Liu, Siyuan Qiao, Zhishuai Zhang, Alan Yuille, Jiahui Yu

  • LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing, arXiv, 2311.00571, arxiv, pdf, cication: -1

    Wei-Ge Chen, Irina Spiridonova, Jianwei Yang, Jianfeng Gao, Chunyuan Li

  • Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V, arXiv, 2310.19061, arxiv, pdf, cication: -1

    Zhiling Yan, Kai Zhang, Rong Zhou, Lifang He, Xiang Li, Lichao Sun

  • MM-VID: Advancing Video Understanding with GPT-4V(ision), arXiv, 2310.19773, arxiv, pdf, cication: 1

    Kevin Lin, Faisal Ahmed, Linjie Li, Chung-Ching Lin, Ehsan Azarnasab, Zhengyuan Yang, Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu

  • Woodpecker: Hallucination Correction for Multimodal Large Language Models, arXiv, 2310.16045, arxiv, pdf, cication: 1

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, Enhong Chen · [woodpecker - bradyfu] Star · [qbitai]

  • PaLI-3 Vision Language Models: Smaller, Faster, Stronger, arXiv, 2310.09199, arxiv, pdf, cication: 1

    Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski · [mp.weixin.qq]

  • Large Language Models are Visual Reasoning Coordinators, arXiv, 2310.15166, arxiv, pdf, cication: -1

    Liangyu Chen, Bo Li, Sheng Shen, Jingkang Yang, Chunyuan Li, Kurt Keutzer, Trevor Darrell, Ziwei Liu

  • Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V, arXiv, 2310.11441, arxiv, pdf, cication: 2

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao · [SoM - microsoft] Star · [jiqizhixin]

  • Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V, arXiv, 2310.11441, arxiv, pdf, cication: 2

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao

  • From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models, arXiv, 2310.08825, arxiv, pdf, cication: -1

    Dongsheng Jiang, Yuchen Liu, Songlin Liu, Xiaopeng Zhang, Jin Li, Hongkai Xiong, Qi Tian · [comm - yuchenliu98] Star

  • PaLI-3 Vision Language Models: Smaller, Faster, Stronger, arXiv, 2310.09199, arxiv, pdf, cication: 1

    Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski · [jiqizhixin]

  • MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning, arXiv, 2310.09478, arxiv, pdf, cication: -1

    Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, Mohamed Elhoseiny

  • Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation, arXiv, 2310.08541, arxiv, pdf, cication: 1

    Zhengyuan Yang, Jianfeng Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Lijuan Wang

  • Kosmos-G: Generating Images in Context with Multimodal Large Language Models, arXiv, 2310.02992, arxiv, pdf, cication: 1

    Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, Furu Wei

  • CogVLM: Visual Expert for Pretrained Language Models, arXiv, 2311.03079, arxiv, pdf, cication: -1

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song · [CogVLM - THUDM] Star · [qbitai]

  • Making LLaMA SEE and Draw with SEED Tokenizer, arXiv, 2310.01218, arxiv, pdf, cication: 1

    Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, Ying Shan · [SEED - AILab-CVC] Star

  • Improved Baselines with Visual Instruction Tuning, arXiv, 2310.03744, arxiv, pdf, cication: 11

    Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee · mp.weixin.qq] · [llava-vl.github

  • Investigating the Catastrophic Forgetting in Multimodal Large Language Models, arXiv, 2309.10313, arxiv, pdf, cication: 4

    Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, Yi Ma · [mp.weixin.qq]

  • MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens, arXiv, 2310.02239, arxiv, pdf, cication: 3

    Kaizhi Zheng, Xuehai He, Xin Eric Wang · [minigpt-5 - eric-ai-lab] Star

  • Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency, arXiv, 2310.03734, arxiv, pdf, cication: -1

    Tianhong Li, Sangnie Bhardwaj, Yonglong Tian, Han Zhang, Jarred Barber, Dina Katabi, Guillaume Lajoie, Huiwen Chang, Dilip Krishnan

  • AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models, arXiv, 2309.16414, arxiv, pdf, cication: -1

    Jan Hendrik Metzen, Piyapat Saranrittichai, Chaithanya Kumar Mummadi

  • InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition, arXiv, 2309.15112, arxiv, pdf, cication: 2

    Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan · [mp.weixin.qq] · [internlm-xcomposer - internlm] Star

  • AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model, arXiv, 2309.16058, arxiv, pdf, cication: 4

    Seungwhan Moon, Andrea Madotto, Zhaojiang Lin, Tushar Nagarajan, Matt Smith, Shashank Jain, Chun-Fu Yeh, Prakash Murugesan, Peyman Heidari, Yue Liu · [jiqizhixin]

  • The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision), arXiv, 2309.17421, arxiv, pdf, cication: 20

    Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, Lijuan Wang · [qbitai]

  • DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention, arXiv, 2309.14327, arxiv, pdf, cication: -1

    Zhewei Yao, Xiaoxia Wu, Conglong Li, Minjia Zhang, Heyang Qin, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He

  • Kosmos-2.5: A Multimodal Literate Model, arXiv, 2309.11419, arxiv, pdf, cication: 1

    Tengchao Lv, Yupan Huang, Jingye Chen, Lei Cui, Shuming Ma, Yaoyao Chang, Shaohan Huang, Wenhui Wang, Li Dong, Weiyao Luo

  • DreamLLM: Synergistic Multimodal Comprehension and Creation, arXiv, 2309.11499, arxiv, pdf, cication: 4

    Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei

  • Multimodal Foundation Models: From Specialists to General-Purpose Assistants, arXiv, 2309.10020, arxiv, pdf, cication: 14

    Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao

  • TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild, arXiv, 2309.08637, arxiv, pdf, cication: -1

    Huayang Li, Siheng Li, Deng Cai, Longyue Wang, Lemao Liu, Taro Watanabe, Yujiu Yang, Shuming Shi

  • An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models, arXiv, 2309.09958, arxiv, pdf, cication: 2

    Yadong Lu, Chunyuan Li, Haotian Liu, Jianwei Yang, Jianfeng Gao, Yelong Shen

  • NExT-GPT: Any-to-Any Multimodal LLM, arXiv, 2309.05519, arxiv, pdf, cication: 12

    Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, Tat-Seng Chua

  • DreamLLM: Synergistic Multimodal Comprehension and Creation, arXiv, 2309.11499, arxiv, pdf, cication: 4

    Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei

    · [DreamLLM - RunpeiDong] Star

  • ImageBind-LLM: Multi-modality Instruction Tuning, arXiv, 2309.03905, arxiv, pdf, cication: 4

    Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo

  • Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning, arXiv, 2309.02591, arxiv, pdf, cication: 12

    Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin

  • Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior, arXiv, 2309.00359, arxiv, pdf, cication: -1

    Ashmit Khandelwal, Aditya Agrawal, Aanisha Bhattacharyya, Yaman K Singla, Somesh Singh, Uttaran Bhattacharya, Ishita Dasgupta, Stefano Petrangeli, Rajiv Ratn Shah, Changyou Chen

  • Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following, arXiv, 2309.00615, arxiv, pdf, cication: 4

    Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li

  • PointLLM: Empowering Large Language Models to Understand Point Clouds, arXiv, 2308.16911, arxiv, pdf, cication: 3

    Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, Dahua Lin · [pointllm - openrobotlab] Star

  • InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4, arXiv, 2308.12067, arxiv, pdf, cication: 5

    Lai Wei, Zihao Jiang, Weiran Huang, Lichao Sun · [minigpt-v2.github] · [jiqizhixin]

  • BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions, arXiv, 2308.09936, arxiv, pdf, cication: 3

    Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, Zhuowen Tu · [bliva - mlpc-ucsd] Star

  • Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond, arXiv, 2308.12966, arxiv, pdf, cication: 17

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou · [Qwen-VL - QwenLM] Star · jiqizhixin

  • Generating Images with Multimodal Language Models, arXiv, 2305.17216, arxiv, pdf, cication: 27

    Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov · [mp.weixin.qq]

  • Link-Context Learning for Multimodal LLMs, arXiv, 2308.07891, arxiv, pdf, cication: -1

    Yan Tai, Weichen Fan, Zhao Zhang, Feng Zhu, Rui Zhao, Ziwei Liu · [Link-Context-Learning - isekai-portal] Star

  • Visual Instruction Tuning, arXiv, 2304.08485, arxiv, pdf, cication: 301

    Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee · [llava-vl.github]

  • Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions, arXiv, 2308.04152, arxiv, pdf, cication: 2

    Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Hanwang Zhang, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Yueting Zhuang · [cheetah - dcdmllm] Star

  • OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models, arXiv, 2308.01390, arxiv, pdf, cication: 31

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa · [open_flamingo - mlfoundations] Star

  • UniVTG: Towards Unified Video-Language Temporal Grounding, ICCV, 2023, arxiv, pdf, cication: -1

    Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou · [UniVTG - showlab] Star

  • Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection, arXiv, 2307.16888, arxiv, pdf, cication: 2

    Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, Hongxia Jin

  • Unified Model for Image, Video, Audio and Language Tasks, arXiv, 2307.16184, arxiv, pdf, cication: 6

    Mustafa Shukor, Corentin Dancette, Alexandre Rame, Matthieu Cord

  • 3D-LLM: Injecting the 3D World into Large Language Models, arXiv, 2307.12981, arxiv, pdf, cication: 10

    Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, Chuang Gan

  • Meta-Transformer: A Unified Framework for Multimodal Learning, arXiv, 2307.10802, arxiv, pdf, cication: 23

    Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Wanli Ouyang, Xiangyu Yue

  • Augmenting CLIP with Improved Visio-Linguistic Reasoning, arXiv, 2307.09233, arxiv, pdf, cication: -1

    Samyadeep Basu, Maziar Sanjabi, Daniela Massiceti, Shell Xu Hu, Soheil Feizi

  • Planting a SEED of Vision in Large Language Model, arXiv, 2307.08041, arxiv, pdf, cication: 7

    Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, Ying Shan

  • BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs, arXiv, 2307.08581, arxiv, pdf, cication: 11

    Yang Zhao, Zhijie Lin, Daquan Zhou, Zilong Huang, Jiashi Feng, Bingyi Kang · [bubogpt - magic-research] Star

  • Vision-Language Models for Vision Tasks: A Survey, arXiv, 2304.00685, arxiv, pdf, cication: 27

    Jingyi Zhang, Jiaxing Huang, Sheng Jin, Shijian Lu · [vlm_survey - jingyi0000] Star

  • What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?, arXiv, 2307.02469, arxiv, pdf, cication: 10

    Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guoqiang Wei, Yang Wei, Yuchen Zhang, Tao Kong · [jiqizhixin] · [lynx-llm - bytedance] Star

  • Generative Pretraining in Multimodality, arXiv, 2307.05222, arxiv, pdf, cication: 19

    Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang

  • EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone, ICCV, 2023, arxiv, pdf, cication: 2

    Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, Pengchuan Zhang

  • SVIT: Scaling up Visual Instruction Tuning, arXiv, 2307.04087, arxiv, pdf, cication: 9

    Bo Zhao, Boya Wu, Tiejun Huang

  • GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest, arXiv, 2307.03601, arxiv, pdf, cication: 15

    Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, Ping Luo · [GPT4RoI - jshilong] Star

  • mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding, arXiv, 2307.02499, arxiv, pdf, cication: 5

    Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Junfeng Tian

  • What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?, arXiv, 2307.02469, arxiv, pdf, cication: 10

    Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guoqiang Wei, Yang Wei, Yuchen Zhang, Tao Kong

  • LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding, arXiv, 2306.17107, arxiv, pdf, cication: 12

    Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, Tong Sun

  • Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic, arXiv, 2306.15195, arxiv, pdf, cication: 26

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, Rui Zhao · [shikra - shikras] Star

  • Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language, arXiv, 2306.16410, arxiv, pdf, cication: 11

    William Berrios, Gautam Mittal, Tristan Thrush, Douwe Kiela, Amanpreet Singh

  • PandaGPT: One Model To Instruction-Follow Them All, arXiv, 2305.16355, arxiv, pdf, cication: 42

    Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, Deng Cai · [panda-gpt.github] · [mp.weixin.qq]

  • Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration, arXiv, 2306.09093, arxiv, pdf, cication: 9

    Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, Zhaopeng Tu

  • Otter: A Multi-Modal Model with In-Context Instruction Tuning, arXiv, 2305.03726, arxiv, pdf, cication: 69

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, Ziwei Liu · [Otter - Luodian] Star · [mp.weixin.qq]

  • Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models, arXiv, 2306.05424, arxiv, pdf, cication: 30

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan · [video-chatgpt - mbzuai-oryx] Star

  • Revisiting the Role of Language Priors in Vision-Language Models, arXiv, 2306.01879, arxiv, pdf, cication: 4

    Zhiqiu Lin, Xinyue Chen, Deepak Pathak, Pengchuan Zhang, Deva Ramanan

  • Language Models are General-Purpose Interfaces, arXiv, 2206.06336, arxiv, pdf, cication: 51

    Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shuming Ma, Furu Wei

  • Flamingo: a Visual Language Model for Few-Shot Learning, NeurIPS, 2022, arxiv, pdf, cication: 989

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds

  • Revisiting the Role of Language Priors in Vision-Language Models, arXiv, 2306.01879, arxiv, pdf, cication: 4

    Zhiqiu Lin, Xinyue Chen, Deepak Pathak, Pengchuan Zhang, Deva Ramanan

  • Kosmos-2: Grounding Multimodal Large Language Models to the World, arXiv, 2306.14824, arxiv, pdf, cication: 52

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei · [huggingface] · [unilm - microsoft] Star · [thegenerality]

  • Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration, arXiv, 2306.09093, arxiv, pdf, cication: 9

    Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, Zhaopeng Tu · [macaw-llm - lyuchenyang] Star

  • Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding, arXiv, 2306.02858, arxiv, pdf, cication: 39

    Hang Zhang, Xin Li, Lidong Bing · [Video-LLaMA - DAMO-NLP-SG] Star

  • ImageBind: One Embedding Space To Bind Them All, CVPR, 2023, arxiv, pdf, cication: 36

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra · [facebookresearch.github] · [ImageBind - facebookresearch] Star

  • InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning, arXiv, 2305.06500, arxiv, pdf, cication: 301

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi · [LAVIS - salesforce] Star · [LAVIS - salesforce] Star

  • Language Is Not All You Need: Aligning Perception with Language Models, arXiv, 2302.14045, arxiv, pdf, cication: 133

    Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra

Audio

  • SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation, arXiv, 2401.13527, arxiv, pdf, cication: -1

    Dong Zhang, Xin Zhang, Jun Zhan, Shimin Li, Yaqian Zhou, Xipeng Qiu · (speechgpt - 0nutation) Star

  • On the Audio Hallucinations in Large Audio-Video Language Models, arXiv, 2401.09774, arxiv, pdf, cication: -1

    Taichi Nishimura, Shota Nakada, Masayoshi Kondo

  • E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models, arXiv, 2401.00475, arxiv, pdf, cication: -1

    Hongfei Xue, Yuhao Liang, Bingshen Mu, Shiliang Zhang, Mengzhe Chen, Qian Chen, Lei Xie

  • Boosting Large Language Model for Speech Synthesis: An Empirical Study, arXiv, 2401.00246, arxiv, pdf, cication: -1

    Hongkun Hao, Long Zhou, Shujie Liu, Jinyu Li, Shujie Hu, Rui Wang, Furu Wei

  • Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models, arXiv, 2312.03632, arxiv, pdf, cication: -1

    Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Panayiotis Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi

  • WhisBERT: Multimodal Text-Audio Language Modeling on 100M Words, arXiv, 2312.02931, arxiv, pdf, cication: -1

    Lukas Wolf, Greta Tuckute, Klemen Kotar, Eghbal Hosseini, Tamar Regev, Ethan Wilcox, Alex Warstadt

  • Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models, arXiv, 2311.07919, arxiv, pdf, cication: -1

    Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, Jingren Zhou

    · (Qwen-Audio - QwenLM) Star

  • Towards General-Purpose Speech Abilities for Large Language Models Using Unpaired Data, arXiv, 2311.06753, arxiv, pdf, cication: -1

    Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer

  • Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency, arXiv, 2311.02772, arxiv, pdf, cication: -1

    Sungho Jeon, Ching-Feng Yeh, Hakan Inan, Wei-Ning Hsu, Rashi Rungta, Yashar Mehdad, Daniel Bikel

  • SALMONN: Towards Generic Hearing Abilities for Large Language Models, arXiv, 2310.13289, arxiv, pdf, cication: 1

    Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang · SALMONN - bytedance Star

  • LLaSM: Large Language and Speech Model, arXiv, 2308.15930, arxiv, pdf, cication: 2

    Yu Shu, Siwei Dong, Guangyao Chen, Wenhao Huang, Ruihua Zhang, Daochen Shi, Qiqi Xiang, Yemin Shi · [llasm - linksoul-ai] Star

  • Prompting Large Language Models with Speech Recognition Abilities, arXiv, 2307.11795, arxiv, pdf, cication: 11

    Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, Jinxi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli

  • Any-to-Any Generation via Composable Diffusion, arXiv, 2305.11846, arxiv, pdf, cication: -1

    Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, Mohit Bansal · (codi-gen.github) · (i-Code - microsoft) Star

Efficient

  • Small Language Model Meets with Reinforced Vision Vocabulary, arXiv, 2401.12503, arxiv, pdf, cication: -1

    Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, En Yu, Jianjian Sun, Chunrui Han, Xiangyu Zhang

Extra Modalities

  • Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action, arXiv, 2312.17172, arxiv, pdf, cication: -1

    Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, Aniruddha Kembhavi

  • AI-Generated Content (AIGC) for Various Data Modalities: A Survey, arXiv, 2308.14177, arxiv, pdf, cication: -1

    Lin Geng Foo, Hossein Rahmani, Jun Liu

  • M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts, arXiv, 2312.10763, arxiv, pdf, cication: -1

    Mingsheng Li, Xin Chen, Chi Zhang, Sijin Chen, Hongyuan Zhu, Fukun Yin, Gang Yu, Tao Chen

Projects

Benchmarks

  • Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences, arXiv, 2401.10529, arxiv, pdf, cication: -1

    Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Gedas Bertasius, Mohit Bansal · (mementos-bench.github)

  • Leaderboards | LAMM

  • LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark, arXiv, 2306.06687, arxiv, pdf, cication: -1

    Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang · (LAMM - OpenGVLab) Star

  • BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models, arXiv, 2312.02896, arxiv, pdf, cication: -1

    Rizhao Cai, Zirui Song, Dayan Guan, Zhenhao Chen, Xing Luo, Chenyu Yi, Alex Kot · (aifeg.github) · (BenchLMM - AIFEG) Star

  • MVBench_Leaderboard - OpenGVLab 🤗

  • MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI, arXiv, 2311.16502, arxiv, pdf, cication: -1

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun · (mmmu-benchmark.github) · (MMMU - MMMU-Benchmark) Star · (huggingface)

  • Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models, arXiv, 2311.16103, arxiv, pdf, cication: -1

    Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, Li Yuan · (video-bench - pku-yuangroup) Star

  • SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension, arXiv, 2307.16125, arxiv, pdf, cication: 12

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, Ying Shan

  • HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models, arXiv, 2310.14566, arxiv, pdf, cication: 3

    Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, Tianyi Zhou

  • VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use, arXiv, 2308.06595, arxiv, pdf, cication: 3

    Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, Ludwig Schimdt

  • On the Hidden Mystery of OCR in Large Multimodal Models, arXiv, 2305.07895, arxiv, pdf, cication: 29

    Yuliang Liu, Zhang Li, Biao Yang, Chunyuan Li, Xucheng Yin, Cheng-lin Liu, Lianwen Jin, Xiang Bai · (MultimodalOCR - Yuliang-Liu) Star · (mp.weixin.qq)

Datasets

  • CapsFusion: Rethinking Image-Text Data at Scale, arXiv, 2310.20550, arxiv, pdf, cication: -1

    Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Yue Cao, Xinlong Wang, Jingjing Liu

  • OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents, NeurIPS, 2023, arxiv, pdf, cication: 8

    Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela

  • DataComp: In search of the next generation of multimodal datasets, arXiv, 2304.14108, arxiv, pdf, cication: 71

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang · (datacomp)

Other

Reference

  • Awesome-LLM-3D - ActiveVisionLab Star

    Awesome-LLM-3D: a curated list of Multi-modal Large Language Model in 3D world Resources

  • awesome-foundation-and-multimodal-models - SkalskiP Star

    👁️ + 💬 + 🎧 = 🤖 Curated list of top foundation and multimodal models! [Paper + Code]

  • Awesome-Multimodal-Assistant - zjr2000 Star

    Awesome Multimodal Assistant is a curated list of multimodal chatbots/conversational assistants that utilize various modes of interaction, such as text, speech, images, and videos, to provide a seamless and versatile user experience.

  • Awesome-Multimodal-Large-Language-Models - BradyFU Star

    ✨✨Latest Papers and Datasets on Multimodal Large Language Models, and Their Evaluation.