-
MM-LLMs: Recent Advances in MultiModal Large Language Models,
arXiv, 2401.13601
, arxiv, pdf, cication: -1Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, Dong Yu
-
A Survey of Resource-efficient LLM and Multimodal Foundation Models,
arXiv, 2401.08092
, arxiv, pdf, cication: -1Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang
-
Distilling Vision-Language Models on Millions of Videos,
arXiv, 2401.06129
, arxiv, pdf, cication: -1Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu, Hui Miao, Florian Schroff, Hartwig Adam, Ting Liu, Boqing Gong
-
LEGO:Language Enhanced Multi-modal Grounding Model,
arXiv, 2401.06071
, arxiv, pdf, cication: -1Zhaowei Li, Qi Xu, Dong Zhang, Hang Song, Yiqing Cai, Qi Qi, Ran Zhou, Junting Pan, Zefeng Li, Van Tu Vu · (LEGO - lzw-lzw)
-
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training,
arXiv, 2401.00849
, arxiv, pdf, cication: -1Alex Jinpeng Wang, Linjie Li, Kevin Qinghong Lin, Jianfeng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou
-
General Object Foundation Model for Images and Videos at Scale,
arXiv, 2312.09158
, arxiv, pdf, cication: -1Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai
-
Video Understanding with Large Language Models: A Survey,
arXiv, 2312.17432
, arxiv, pdf, cication: -1Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu · (Awesome-LLMs-for-Video-Understanding - yunlong10)
-
OneLLM: One Framework to Align All Modalities with Language,
arXiv, 2312.03700
, arxiv, pdf, cication: -1Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue
-
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation,
arXiv, 2311.18775
, arxiv, pdf, cication: -1Zineng Tang, Ziyi Yang, Mahmoud Khademi, Yang Liu, Chenguang Zhu, Mohit Bansal · (i-Code - microsoft)
· (codi-2.github)
-
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models,
arXiv, 2311.17043
, arxiv, pdf, cication: -1Yanwei Li, Chengyao Wang, Jiaya Jia · (llama-vid - dvlab-research)
-
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding,
arXiv, 2311.08046
, arxiv, pdf, cication: -1Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, Li Yuan · (Chat-UniVi - PKU-YuanGroup)
· (huggingface) · (qbitai)
-
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities,
arXiv, 2311.05698
, arxiv, pdf, cication: -1AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova
· (jiqizhixin)
-
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition,
arXiv, 2311.15599
, arxiv, pdf, cication: -1Xiaohan Ding, Yiyuan Zhang, Yixiao Ge, Sijie Zhao, Lin Song, Xiangyu Yue, Ying Shan
-
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models,
arXiv, 2311.13435
, arxiv, pdf, cication: -1Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, Fahad Khan
-
VideoCon: Robust Video-Language Alignment via Contrast Captions,
arXiv, 2311.10111
, arxiv, pdf, cication: -1Hritik Bansal, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang, Aditya Grover
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection,
arXiv, 2311.10122
, arxiv, pdf, cication: -1Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, Li Yuan
-
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding,
arXiv, 2306.02858
, arxiv, pdf, cication: 39Hang Zhang, Xin Li, Lidong Bing · [Video-LLaMA - DAMO-NLP-SG]
-
OmniLMM - OpenBMB
Large Multi-modal Models for Strong Performance and Efficient Deployment · (mp.weixin.qq)
-
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer,
arXiv, 2401.10208
, arxiv, pdf, cication: -1Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie Zhou · (jiqizhixin) · (MM-Interleaved - OpenGVLab)
-
Octavius: Mitigating Task Interference in MLLMs via MoE,
arXiv, 2311.02684
, arxiv, pdf, cication: 3Zeren Chen, Ziqin Wang, Zhen Wang, Huayang Liu, Zhenfei Yin, Si Liu, Lu Sheng, Wanli Ouyang, Yu Qiao, Jing Shao · (openlamm.github)
-
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study,
arXiv, 2401.17981
, arxiv, pdf, cication: -1Qirui Jiao, Daoyuan Chen, Yilun Huang, Yaliang Li, Ying Shen
-
Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization,
arXiv, 2401.15914
, arxiv, pdf, cication: -1Yuhang Zang, Hanlin Goh, Josh Susskind, Chen Huang
-
MouSi: Poly-Visual-Expert Vision-Language Models,
arXiv, 2401.17221
, arxiv, pdf, cication: -1Xiaoran Fan, Tao Ji, Changhao Jiang, Shuo Li, Senjie Jin, Sirui Song, Junke Wang, Boyang Hong, Lu Chen, Guodong Zheng
-
LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model,
arXiv, 2401.02330
, arxiv, pdf, cication: -1Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, Jian Tang · (llava-phi - zhuyiche)
-
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model,
arXiv, 2401.16420
, arxiv, pdf, cication: -1Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao · (InternLM-XComposer - InternLM)
-
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models,
arXiv, 2401.15947
, arxiv, pdf, cication: -1Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, Li Yuan · (MoE-LLaVA - PKU-YuanGroup)
-
Small Language Model Meets with Reinforced Vision Vocabulary,
arXiv, 2401.12503
, arxiv, pdf, cication: -1Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, En Yu, Jianjian Sun, Chunrui Han, Xiangyu Zhang · (Vary-toy - Ucas-HaoranWei)
· (qbitai)
-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions,
arXiv, 2311.12793
, arxiv, pdf, cication: 6Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin · (sharegpt4v.github) · (InternLM-XComposer - InternLM)
· (huggingface)
-
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities,
arXiv, 2401.14405
, arxiv, pdf, cication: -1Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, Xiangyu Yue · (M2PT - AILab-CVC)
-
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models,
arXiv, 2401.13311
, arxiv, pdf, cication: -1Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, Nanyun Peng · (con-textual.github)
-
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities,
arXiv, 2401.12168
, arxiv, pdf, cication: -1Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, Fei Xia · (spatial-vlm.github)
-
Improving fine-grained understanding in image-text pre-training,
arXiv, 2401.09865
, arxiv, pdf, cication: -1Ioana Bica, Anastasija Ilić, Matthias Bauer, Goker Erdogan, Matko Bošnjak, Christos Kaplanis, Alexey A. Gritsenko, Matthias Minderer, Charles Blundell, Razvan Pascanu
-
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs,
arXiv, 2401.06209
, arxiv, pdf, cication: -1Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie · (tsb0601.github)
· (MMVP - tsb0601)
-
GATS: Gather-Attend-Scatter,
arXiv, 2401.08525
, arxiv, pdf, cication: -1Konrad Zolna, Serkan Cabi, Yutian Chen, Eric Lau, Claudio Fantacci, Jurgis Pasukonis, Jost Tobias Springenberg, Sergio Gomez Colmenarejo
-
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs,
arXiv, 2401.06209
, arxiv, pdf, cication: -1Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie · (MMVP - tsb0601)
-
LLaVA-$φ$: Efficient Multi-Modal Assistant with Small Language Model,
arXiv, 2401.02330
, arxiv, pdf, cication: -1Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, Jian Tang · (llava-phi - zhuyiche)
-
A Vision Check-up for Language Models,
arXiv, 2401.01862
, arxiv, pdf, cication: -1Pratyusha Sharma, Tamar Rott Shaham, Manel Baradad, Stephanie Fu, Adrian Rodriguez-Munoz, Shivam Duggal, Phillip Isola, Antonio Torralba
-
DocLLM: A layout-aware generative language model for multimodal document understanding,
arXiv, 2401.00908
, arxiv, pdf, cication: -1Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nourbakhsh, Xiaomo Liu
-
V: Guided Visual Search as a Core Mechanism in Multimodal LLMs*,
arXiv, 2312.14135
, arxiv, pdf, cication: -1Penghao Wu, Saining Xie · (vstar - penghao-wu)
-
Learning Vision from Models Rivals Learning Vision from Data,
arXiv, 2312.17742
, arxiv, pdf, cication: -1Yonglong Tian, Lijie Fan, Kaifeng Chen, Dina Katabi, Dilip Krishnan, Phillip Isola
-
Parrot Captions Teach CLIP to Spot Text,
arXiv, 2312.14232
, arxiv, pdf, cication: -1Yiqi Lin, Conghui He, Alex Jinpeng Wang, Bin Wang, Weijia Li, Mike Zheng Shou
-
Harnessing Diffusion Models for Visual Perception with Meta Prompts,
arXiv, 2312.14733
, arxiv, pdf, cication: -1Qiang Wan, Zilong Huang, Bingyi Kang, Jiashi Feng, Li Zhang · (meta-prompts - fudan-zvg)
· (mp.weixin.qq)
-
VCoder: Versatile Vision Encoders for Multimodal Large Language Models,
arXiv, 2312.14233
, arxiv, pdf, cication: -1Jitesh Jain, Jianwei Yang, Humphrey Shi
· (VCoder - SHI-Labs)
· (praeclarumjj3.github)
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks,
arXiv, 2312.14238
, arxiv, pdf, cication: -1Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu
· (internvl - opengvlab)
· (internvl.opengvlab)
-
Generative Multimodal Models are In-Context Learners,
arXiv, 2312.13286
, arxiv, pdf, cication: -1Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang · (Emu - baaivision)
· (huggingface)
-
Osprey: Pixel Understanding with Visual Instruction Tuning,
arXiv, 2312.10032
, arxiv, pdf, cication: -1Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, Jianke Zhu · (osprey - circleradon)
· (huggingface)
-
Gemini: A Family of Highly Capable Multimodal Models,
arXiv, 2312.11805
, arxiv, pdf, cication: -1Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth
-
Silkie: Preference Distillation for Large Visual Language Models,
arXiv, 2312.10665
, arxiv, pdf, cication: -1Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong
-
G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model,
arXiv, 2312.11370
, arxiv, pdf, cication: -1Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li
-
Merlin:Empowering Multimodal LLMs with Foresight Minds,
arXiv, 2312.00589
, arxiv, pdf, cication: -1En Yu, Liang Zhao, Yana Wei, Jinrong Yang, Dongming Wu, Lingyu Kong, Haoran Wei, Tiancai Wang, Zheng Ge, Xiangyu Zhang · (qbitai)
-
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation,
arXiv, 2312.09251
, arxiv, pdf, cication: -1Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, Ying Shan
-
Honeybee: Locality-enhanced Projector for Multimodal LLM,
arXiv, 2312.06742
, arxiv, pdf, cication: -1Junbum Cha, Wooyoung Kang, Jonghwan Mun, Byungseok Roh
· (honeybee - kakaobrain)
-
Interfacing Foundation Models' Embeddings,
arXiv, 2312.07532
, arxiv, pdf, cication: -1Xueyan Zou, Linjie Li, Jianfeng Wang, Jianwei Yang, Mingyu Ding, Zhengyuan Yang, Feng Li, Hao Zhang, Shilong Liu, Arul Aravinthan · (FIND - UX-Decoder)
-
VILA: On Pre-training for Visual Language Models,
arXiv, 2312.07533
, arxiv, pdf, cication: -1 -
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models,
arXiv, 2312.06109
, arxiv, pdf, cication: -1Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, Xiangyu Zhang
-
Generating Illustrated Instructions,
arXiv, 2312.04552
, arxiv, pdf, cication: -1Sachit Menon, Ishan Misra, Rohit Girdhar
-
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want,
arXiv, 2312.03818
, arxiv, pdf, cication: -1Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang · (aleafy.github)
-
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models,
arXiv, 2312.02949
, arxiv, pdf, cication: -1Hao Zhang, Hongyang Li, Feng Li, Tianhe Ren, Xueyan Zou, Shilong Liu, Shijia Huang, Jianfeng Gao, Lei Zhang, Chunyuan Li · (LLaVA-Grounding - UX-Decoder)
-
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models,
arXiv, 2311.06783
, arxiv, pdf, cication: -1Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Kaixin Xu, Chunyi Li, Jingwen Hou, Guangtao Zhai · (Q-Instruct - Q-Future)
-
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding,
arXiv, 2311.16922
, arxiv, pdf, cication: -1Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, Lidong Bing
· (VCD - DAMO-NLP-SG)
-
ChartLlama: A Multimodal LLM for Chart Understanding and Generation,
arXiv, 2311.16483
, arxiv, pdf, cication: -1Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, Hanwang Zhang · (tingxueronghua.github) · (jiqizhixin)
-
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge,
arXiv, 2311.11860
, arxiv, pdf, cication: -1Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, Liqiang Nie · (JiuTian - rshaojimmy)
· (rshaojimmy.github) · (mp.weixin.qq)
-
GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?,
arXiv, 2311.15732
, arxiv, pdf, cication: -1Wenhao Wu, Huanjin Yao, Mengxi Zhang, Yuxin Song, Wanli Ouyang, Jingdong Wang · (GPT4Vis - whwu95)
-
DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding,
arXiv, 2311.11810
, arxiv, pdf, cication: -1Hao Feng, Qi Liu, Hao Liu, Wengang Zhou, Houqiang Li, Can Huang · (qbitai)
-
Merlin:Empowering Multimodal LLMs with Foresight Minds,
arXiv, 2312.00589
, arxiv, pdf, cication: -1En Yu, Liang Zhao, Yana Wei, Jinrong Yang, Dongming Wu, Lingyu Kong, Haoran Wei, Tiancai Wang, Zheng Ge, Xiangyu Zhang
-
UniIR: Training and Benchmarking Universal Multimodal Information Retrievers,
arXiv, 2311.17136
, arxiv, pdf, cication: -1Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, Wenhu Chen · (UniIR - TIGER-AI-Lab)
· (tiger-ai-lab.github)
-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions,
arXiv, 2311.12793
, arxiv, pdf, cication: -1Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin · (sharegpt4v.github) · (InternLM-XComposer - InternLM)
· (huggingface)
-
UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework,
arXiv, 2311.10125
, arxiv, pdf, cication: -1Chris Kelly, Luhui Hu, Cindy Yang, Yu Tian, Deshun Yang, Bang Yang, Zaoshan Huang, Zihao Li, Yuexian Zou · (SA-Segment-Anything - LHBuilder)
-
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models,
arXiv, 2311.06607
, arxiv, pdf, cication: -1Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, Xiang Bai · (monkey - yuliang-liu)
-
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models,
arXiv, 2311.07575
, arxiv, pdf, cication: -1Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen · (LLaMA2-Accessory - Alpha-VLLM)
-
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning,
arXiv, 2311.07574
, arxiv, pdf, cication: -1Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, Yu-Gang Jiang · (LVIS-INSTRUCT4V - X2FD)
-
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models,
arXiv, 2311.06783
, arxiv, pdf, cication: -1Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Kaixin Xu, Chunyi Li, Jingwen Hou, Guangtao Zhai · (q-future.github)
-
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks,
arXiv, 2311.06242
, arxiv, pdf, cication: -1Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, Lu Yuan
-
SILC: Improving Vision Language Pretraining with Self-Distillation,
arXiv, 2310.13355
, arxiv, pdf, cication: -1Muhammad Ferjad Naeem, Yongqin Xian, Xiaohua Zhai, Lukas Hoyer, Luc Van Gool, Federico Tombari
-
u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model,
arXiv, 2311.05348
, arxiv, pdf, cication: -1Jinjin Xu, Liwu Xu, Yuzhe Yang, Xiang Li, Yanchun Xie, Yi-Jie Huang, Yaqian Li
-
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents,
arXiv, 2311.05437
, arxiv, pdf, cication: -1Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu · (LLaVA-Plus-Codebase - LLaVA-VL)
· (huggingface)
-
NExT-Chat: An LMM for Chat, Detection and Segmentation,
arXiv, 2311.04498
, arxiv, pdf, cication: -1Ao Zhang, Liming Zhao, Chen-Wei Xie, Yun Zheng, Wei Ji, Tat-Seng Chua
· (NExT-Chat - NExT-ChatV)
· [20d12192149d0748e2.gradio]
-
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment,
arXiv, 2310.01852
, arxiv, pdf, cication: -1Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li · [LanguageBind - PKU-YuanGroup]
· [jiqizhixin]
-
OtterHD: A High-Resolution Multi-modality Model,
arXiv, 2311.04219
, arxiv, pdf, cication: -1Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu
-
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration,
arXiv, 2311.04257
, arxiv, pdf, cication: -1Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou · [mPLUG-Owl - X-PLUG]
-
GLaMM: Pixel Grounding Large Multimodal Model,
arXiv, 2311.03356
, arxiv, pdf, cication: -1Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Erix Xing, Ming-Hsuan Yang, Fahad S. Khan
-
CogVLM: Visual Expert for Pretrained Language Models,
arXiv, 2311.03079
, arxiv, pdf, cication: -1Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song
-
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding,
arXiv, 2311.03354
, arxiv, pdf, cication: -1Junyan Li, Delin Chen, Yining Hong, Zhenfang Chen, Peihao Chen, Yikang Shen, Chuang Gan
-
Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?,
arXiv, 2311.00047
, arxiv, pdf, cication: -1Yichi Zhang, Jiayi Pan, Yuchen Zhou, Rui Pan, Joyce Chai
-
De-Diffusion Makes Text a Strong Cross-Modal Interface,
arXiv, 2311.00618
, arxiv, pdf, cication: -1Chen Wei, Chenxi Liu, Siyuan Qiao, Zhishuai Zhang, Alan Yuille, Jiahui Yu
-
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing,
arXiv, 2311.00571
, arxiv, pdf, cication: -1Wei-Ge Chen, Irina Spiridonova, Jianwei Yang, Jianfeng Gao, Chunyuan Li
-
Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V,
arXiv, 2310.19061
, arxiv, pdf, cication: -1Zhiling Yan, Kai Zhang, Rong Zhou, Lifang He, Xiang Li, Lichao Sun
-
MM-VID: Advancing Video Understanding with GPT-4V(ision),
arXiv, 2310.19773
, arxiv, pdf, cication: 1Kevin Lin, Faisal Ahmed, Linjie Li, Chung-Ching Lin, Ehsan Azarnasab, Zhengyuan Yang, Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu
-
Woodpecker: Hallucination Correction for Multimodal Large Language Models,
arXiv, 2310.16045
, arxiv, pdf, cication: 1Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, Enhong Chen · [woodpecker - bradyfu]
· [qbitai]
-
PaLI-3 Vision Language Models: Smaller, Faster, Stronger,
arXiv, 2310.09199
, arxiv, pdf, cication: 1Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski · [mp.weixin.qq]
-
Large Language Models are Visual Reasoning Coordinators,
arXiv, 2310.15166
, arxiv, pdf, cication: -1Liangyu Chen, Bo Li, Sheng Shen, Jingkang Yang, Chunyuan Li, Kurt Keutzer, Trevor Darrell, Ziwei Liu
-
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V,
arXiv, 2310.11441
, arxiv, pdf, cication: 2Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao · [SoM - microsoft]
· [jiqizhixin]
-
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V,
arXiv, 2310.11441
, arxiv, pdf, cication: 2Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao
-
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models,
arXiv, 2310.08825
, arxiv, pdf, cication: -1Dongsheng Jiang, Yuchen Liu, Songlin Liu, Xiaopeng Zhang, Jin Li, Hongkai Xiong, Qi Tian · [comm - yuchenliu98]
-
PaLI-3 Vision Language Models: Smaller, Faster, Stronger,
arXiv, 2310.09199
, arxiv, pdf, cication: 1Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski · [jiqizhixin]
-
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning,
arXiv, 2310.09478
, arxiv, pdf, cication: -1Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, Mohamed Elhoseiny
-
Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation,
arXiv, 2310.08541
, arxiv, pdf, cication: 1Zhengyuan Yang, Jianfeng Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Lijuan Wang
-
Kosmos-G: Generating Images in Context with Multimodal Large Language Models,
arXiv, 2310.02992
, arxiv, pdf, cication: 1Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, Furu Wei
-
CogVLM: Visual Expert for Pretrained Language Models,
arXiv, 2311.03079
, arxiv, pdf, cication: -1Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song · [CogVLM - THUDM]
· [qbitai]
-
Making LLaMA SEE and Draw with SEED Tokenizer,
arXiv, 2310.01218
, arxiv, pdf, cication: 1Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, Ying Shan · [SEED - AILab-CVC]
-
Improved Baselines with Visual Instruction Tuning,
arXiv, 2310.03744
, arxiv, pdf, cication: 11Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee · mp.weixin.qq] · [llava-vl.github
-
Investigating the Catastrophic Forgetting in Multimodal Large Language Models,
arXiv, 2309.10313
, arxiv, pdf, cication: 4Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, Yi Ma · [mp.weixin.qq]
-
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens,
arXiv, 2310.02239
, arxiv, pdf, cication: 3Kaizhi Zheng, Xuehai He, Xin Eric Wang · [minigpt-5 - eric-ai-lab]
-
Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency,
arXiv, 2310.03734
, arxiv, pdf, cication: -1Tianhong Li, Sangnie Bhardwaj, Yonglong Tian, Han Zhang, Jarred Barber, Dina Katabi, Guillaume Lajoie, Huiwen Chang, Dilip Krishnan
-
AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models,
arXiv, 2309.16414
, arxiv, pdf, cication: -1Jan Hendrik Metzen, Piyapat Saranrittichai, Chaithanya Kumar Mummadi
-
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition,
arXiv, 2309.15112
, arxiv, pdf, cication: 2Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan · [mp.weixin.qq] · [internlm-xcomposer - internlm]
-
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model,
arXiv, 2309.16058
, arxiv, pdf, cication: 4Seungwhan Moon, Andrea Madotto, Zhaojiang Lin, Tushar Nagarajan, Matt Smith, Shashank Jain, Chun-Fu Yeh, Prakash Murugesan, Peyman Heidari, Yue Liu · [jiqizhixin]
-
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision),
arXiv, 2309.17421
, arxiv, pdf, cication: 20Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, Lijuan Wang · [qbitai]
-
DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention,
arXiv, 2309.14327
, arxiv, pdf, cication: -1Zhewei Yao, Xiaoxia Wu, Conglong Li, Minjia Zhang, Heyang Qin, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He
-
Kosmos-2.5: A Multimodal Literate Model,
arXiv, 2309.11419
, arxiv, pdf, cication: 1Tengchao Lv, Yupan Huang, Jingye Chen, Lei Cui, Shuming Ma, Yaoyao Chang, Shaohan Huang, Wenhui Wang, Li Dong, Weiyao Luo
-
DreamLLM: Synergistic Multimodal Comprehension and Creation,
arXiv, 2309.11499
, arxiv, pdf, cication: 4Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei
-
Multimodal Foundation Models: From Specialists to General-Purpose Assistants,
arXiv, 2309.10020
, arxiv, pdf, cication: 14Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao
-
TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild,
arXiv, 2309.08637
, arxiv, pdf, cication: -1Huayang Li, Siheng Li, Deng Cai, Longyue Wang, Lemao Liu, Taro Watanabe, Yujiu Yang, Shuming Shi
-
An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models,
arXiv, 2309.09958
, arxiv, pdf, cication: 2Yadong Lu, Chunyuan Li, Haotian Liu, Jianwei Yang, Jianfeng Gao, Yelong Shen
-
NExT-GPT: Any-to-Any Multimodal LLM,
arXiv, 2309.05519
, arxiv, pdf, cication: 12Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, Tat-Seng Chua
-
DreamLLM: Synergistic Multimodal Comprehension and Creation,
arXiv, 2309.11499
, arxiv, pdf, cication: 4Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei
· [DreamLLM - RunpeiDong]
-
ImageBind-LLM: Multi-modality Instruction Tuning,
arXiv, 2309.03905
, arxiv, pdf, cication: 4Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo
-
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning,
arXiv, 2309.02591
, arxiv, pdf, cication: 12Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin
-
Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior,
arXiv, 2309.00359
, arxiv, pdf, cication: -1Ashmit Khandelwal, Aditya Agrawal, Aanisha Bhattacharyya, Yaman K Singla, Somesh Singh, Uttaran Bhattacharya, Ishita Dasgupta, Stefano Petrangeli, Rajiv Ratn Shah, Changyou Chen
-
Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following,
arXiv, 2309.00615
, arxiv, pdf, cication: 4Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li
-
PointLLM: Empowering Large Language Models to Understand Point Clouds,
arXiv, 2308.16911
, arxiv, pdf, cication: 3Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, Dahua Lin · [pointllm - openrobotlab]
-
InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4,
arXiv, 2308.12067
, arxiv, pdf, cication: 5Lai Wei, Zihao Jiang, Weiran Huang, Lichao Sun · [minigpt-v2.github] · [jiqizhixin]
-
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions,
arXiv, 2308.09936
, arxiv, pdf, cication: 3Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, Zhuowen Tu · [bliva - mlpc-ucsd]
-
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond,
arXiv, 2308.12966
, arxiv, pdf, cication: 17Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou · [Qwen-VL - QwenLM]
· jiqizhixin
-
Generating Images with Multimodal Language Models,
arXiv, 2305.17216
, arxiv, pdf, cication: 27Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov · [mp.weixin.qq]
-
Link-Context Learning for Multimodal LLMs,
arXiv, 2308.07891
, arxiv, pdf, cication: -1Yan Tai, Weichen Fan, Zhao Zhang, Feng Zhu, Rui Zhao, Ziwei Liu · [Link-Context-Learning - isekai-portal]
-
Visual Instruction Tuning,
arXiv, 2304.08485
, arxiv, pdf, cication: 301Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee · [llava-vl.github]
-
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions,
arXiv, 2308.04152
, arxiv, pdf, cication: 2Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Hanwang Zhang, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Yueting Zhuang · [cheetah - dcdmllm]
-
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models,
arXiv, 2308.01390
, arxiv, pdf, cication: 31Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa · [open_flamingo - mlfoundations]
-
UniVTG: Towards Unified Video-Language Temporal Grounding,
ICCV, 2023
, arxiv, pdf, cication: -1Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou · [UniVTG - showlab]
-
Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection,
arXiv, 2307.16888
, arxiv, pdf, cication: 2Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, Hongxia Jin
-
Unified Model for Image, Video, Audio and Language Tasks,
arXiv, 2307.16184
, arxiv, pdf, cication: 6Mustafa Shukor, Corentin Dancette, Alexandre Rame, Matthieu Cord
-
3D-LLM: Injecting the 3D World into Large Language Models,
arXiv, 2307.12981
, arxiv, pdf, cication: 10Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, Chuang Gan
-
Meta-Transformer: A Unified Framework for Multimodal Learning,
arXiv, 2307.10802
, arxiv, pdf, cication: 23Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Wanli Ouyang, Xiangyu Yue
-
Augmenting CLIP with Improved Visio-Linguistic Reasoning,
arXiv, 2307.09233
, arxiv, pdf, cication: -1Samyadeep Basu, Maziar Sanjabi, Daniela Massiceti, Shell Xu Hu, Soheil Feizi
-
Planting a SEED of Vision in Large Language Model,
arXiv, 2307.08041
, arxiv, pdf, cication: 7Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, Ying Shan
-
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs,
arXiv, 2307.08581
, arxiv, pdf, cication: 11Yang Zhao, Zhijie Lin, Daquan Zhou, Zilong Huang, Jiashi Feng, Bingyi Kang · [bubogpt - magic-research]
-
Vision-Language Models for Vision Tasks: A Survey,
arXiv, 2304.00685
, arxiv, pdf, cication: 27Jingyi Zhang, Jiaxing Huang, Sheng Jin, Shijian Lu · [vlm_survey - jingyi0000]
-
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?,
arXiv, 2307.02469
, arxiv, pdf, cication: 10Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guoqiang Wei, Yang Wei, Yuchen Zhang, Tao Kong · [jiqizhixin] · [lynx-llm - bytedance]
-
Generative Pretraining in Multimodality,
arXiv, 2307.05222
, arxiv, pdf, cication: 19Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang
-
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone,
ICCV, 2023
, arxiv, pdf, cication: 2Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, Pengchuan Zhang
-
SVIT: Scaling up Visual Instruction Tuning,
arXiv, 2307.04087
, arxiv, pdf, cication: 9Bo Zhao, Boya Wu, Tiejun Huang
-
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest,
arXiv, 2307.03601
, arxiv, pdf, cication: 15Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, Ping Luo · [GPT4RoI - jshilong]
-
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding,
arXiv, 2307.02499
, arxiv, pdf, cication: 5Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Junfeng Tian
-
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?,
arXiv, 2307.02469
, arxiv, pdf, cication: 10Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guoqiang Wei, Yang Wei, Yuchen Zhang, Tao Kong
-
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding,
arXiv, 2306.17107
, arxiv, pdf, cication: 12Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, Tong Sun
-
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic,
arXiv, 2306.15195
, arxiv, pdf, cication: 26Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, Rui Zhao · [shikra - shikras]
-
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language,
arXiv, 2306.16410
, arxiv, pdf, cication: 11William Berrios, Gautam Mittal, Tristan Thrush, Douwe Kiela, Amanpreet Singh
-
PandaGPT: One Model To Instruction-Follow Them All,
arXiv, 2305.16355
, arxiv, pdf, cication: 42Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, Deng Cai · [panda-gpt.github] · [mp.weixin.qq]
-
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration,
arXiv, 2306.09093
, arxiv, pdf, cication: 9Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, Zhaopeng Tu
-
Otter: A Multi-Modal Model with In-Context Instruction Tuning,
arXiv, 2305.03726
, arxiv, pdf, cication: 69Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, Ziwei Liu · [Otter - Luodian]
· [mp.weixin.qq]
-
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models,
arXiv, 2306.05424
, arxiv, pdf, cication: 30Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan · [video-chatgpt - mbzuai-oryx]
-
Revisiting the Role of Language Priors in Vision-Language Models,
arXiv, 2306.01879
, arxiv, pdf, cication: 4Zhiqiu Lin, Xinyue Chen, Deepak Pathak, Pengchuan Zhang, Deva Ramanan
-
Language Models are General-Purpose Interfaces,
arXiv, 2206.06336
, arxiv, pdf, cication: 51Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shuming Ma, Furu Wei
-
Flamingo: a Visual Language Model for Few-Shot Learning,
NeurIPS, 2022
, arxiv, pdf, cication: 989Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds
-
Revisiting the Role of Language Priors in Vision-Language Models,
arXiv, 2306.01879
, arxiv, pdf, cication: 4Zhiqiu Lin, Xinyue Chen, Deepak Pathak, Pengchuan Zhang, Deva Ramanan
-
Kosmos-2: Grounding Multimodal Large Language Models to the World,
arXiv, 2306.14824
, arxiv, pdf, cication: 52Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei · [huggingface] · [unilm - microsoft]
· [thegenerality]
-
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration,
arXiv, 2306.09093
, arxiv, pdf, cication: 9Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, Zhaopeng Tu · [macaw-llm - lyuchenyang]
-
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding,
arXiv, 2306.02858
, arxiv, pdf, cication: 39Hang Zhang, Xin Li, Lidong Bing · [Video-LLaMA - DAMO-NLP-SG]
-
ImageBind: One Embedding Space To Bind Them All,
CVPR, 2023
, arxiv, pdf, cication: 36Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra · [facebookresearch.github] · [ImageBind - facebookresearch]
-
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning,
arXiv, 2305.06500
, arxiv, pdf, cication: 301Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi · [LAVIS - salesforce]
· [LAVIS - salesforce]
-
Language Is Not All You Need: Aligning Perception with Language Models,
arXiv, 2302.14045
, arxiv, pdf, cication: 133Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra
-
SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation,
arXiv, 2401.13527
, arxiv, pdf, cication: -1Dong Zhang, Xin Zhang, Jun Zhan, Shimin Li, Yaqian Zhou, Xipeng Qiu · (speechgpt - 0nutation)
-
On the Audio Hallucinations in Large Audio-Video Language Models,
arXiv, 2401.09774
, arxiv, pdf, cication: -1Taichi Nishimura, Shota Nakada, Masayoshi Kondo
-
E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models,
arXiv, 2401.00475
, arxiv, pdf, cication: -1Hongfei Xue, Yuhao Liang, Bingshen Mu, Shiliang Zhang, Mengzhe Chen, Qian Chen, Lei Xie
-
Boosting Large Language Model for Speech Synthesis: An Empirical Study,
arXiv, 2401.00246
, arxiv, pdf, cication: -1Hongkun Hao, Long Zhou, Shujie Liu, Jinyu Li, Shujie Hu, Rui Wang, Furu Wei
-
Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models,
arXiv, 2312.03632
, arxiv, pdf, cication: -1Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Panayiotis Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi
-
WhisBERT: Multimodal Text-Audio Language Modeling on 100M Words,
arXiv, 2312.02931
, arxiv, pdf, cication: -1Lukas Wolf, Greta Tuckute, Klemen Kotar, Eghbal Hosseini, Tamar Regev, Ethan Wilcox, Alex Warstadt
-
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models,
arXiv, 2311.07919
, arxiv, pdf, cication: -1Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, Jingren Zhou
· (Qwen-Audio - QwenLM)
-
Towards General-Purpose Speech Abilities for Large Language Models Using Unpaired Data,
arXiv, 2311.06753
, arxiv, pdf, cication: -1Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer
-
Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency,
arXiv, 2311.02772
, arxiv, pdf, cication: -1Sungho Jeon, Ching-Feng Yeh, Hakan Inan, Wei-Ning Hsu, Rashi Rungta, Yashar Mehdad, Daniel Bikel
-
SALMONN: Towards Generic Hearing Abilities for Large Language Models,
arXiv, 2310.13289
, arxiv, pdf, cication: 1Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang · SALMONN - bytedance
-
LLaSM: Large Language and Speech Model,
arXiv, 2308.15930
, arxiv, pdf, cication: 2Yu Shu, Siwei Dong, Guangyao Chen, Wenhao Huang, Ruihua Zhang, Daochen Shi, Qiqi Xiang, Yemin Shi · [llasm - linksoul-ai]
-
Prompting Large Language Models with Speech Recognition Abilities,
arXiv, 2307.11795
, arxiv, pdf, cication: 11Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, Jinxi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli
-
Any-to-Any Generation via Composable Diffusion,
arXiv, 2305.11846
, arxiv, pdf, cication: -1Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, Mohit Bansal · (codi-gen.github) · (i-Code - microsoft)
-
Small Language Model Meets with Reinforced Vision Vocabulary,
arXiv, 2401.12503
, arxiv, pdf, cication: -1Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, En Yu, Jianjian Sun, Chunrui Han, Xiangyu Zhang
-
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action,
arXiv, 2312.17172
, arxiv, pdf, cication: -1Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, Aniruddha Kembhavi
-
AI-Generated Content (AIGC) for Various Data Modalities: A Survey,
arXiv, 2308.14177
, arxiv, pdf, cication: -1Lin Geng Foo, Hossein Rahmani, Jun Liu
-
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts,
arXiv, 2312.10763
, arxiv, pdf, cication: -1Mingsheng Li, Xin Chen, Chi Zhang, Sijin Chen, Hongyuan Zhu, Fukun Yin, Gang Yu, Tao Chen
-
imp - MILVLG
a family of multimodal small language models · (huggingface) · (xmbot)
-
reading-analog-gauge - Synanthropic 🤗
-
moondream - vikhyat
tiny vision language model · (huggingface)
-
FireLLaVA-13b - fireworks-ai 🤗
· (app.fireworks)
-
Yi-VL-6B - 01-ai 🤗
-
hermes-llava - qnguyen3
· (huggingface)
-
multimodal-maestro - roboflow
Effective prompting for Large Multimodal Models like GPT-4 Vision or LLaVA. 🔥
-
mic - haozhezhao
MMICL, a state-of-the-art VLM with the in context learning ability from ICL, PKU
-
awesome-openai-vision-api-experiments - roboflow
Examples showing how to use the OpenAI vision API to run inference on images, video files and webcam streams
-
fuyu-8b - adept 🤗
· [qbitai]
-
Mini-DALLE3 - Zeqiang-Lai
Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models
-
idefics-80b-instruct - HuggingFaceM4 🤗
· (huggingface)
-
LLaMA2-Accessory - Alpha-VLLM
An Open-source Toolkit for LLM Development
-
open_flamingo - mlfoundations
An open-source framework for training large multimodal models.
-
BakLLaVA-1 - SkunkworksAI 🤗
-
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences,
arXiv, 2401.10529
, arxiv, pdf, cication: -1Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Gedas Bertasius, Mohit Bansal · (mementos-bench.github)
-
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark,
arXiv, 2306.06687
, arxiv, pdf, cication: -1Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang · (LAMM - OpenGVLab)
-
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models,
arXiv, 2312.02896
, arxiv, pdf, cication: -1Rizhao Cai, Zirui Song, Dayan Guan, Zhenhao Chen, Xing Luo, Chenyu Yi, Alex Kot · (aifeg.github) · (BenchLMM - AIFEG)
-
MVBench_Leaderboard - OpenGVLab 🤗
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI,
arXiv, 2311.16502
, arxiv, pdf, cication: -1Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun · (mmmu-benchmark.github) · (MMMU - MMMU-Benchmark)
· (huggingface)
-
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models,
arXiv, 2311.16103
, arxiv, pdf, cication: -1Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, Li Yuan · (video-bench - pku-yuangroup)
-
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension,
arXiv, 2307.16125
, arxiv, pdf, cication: 12Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, Ying Shan
-
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models,
arXiv, 2310.14566
, arxiv, pdf, cication: 3Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, Tianyi Zhou
-
VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use,
arXiv, 2308.06595
, arxiv, pdf, cication: 3Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, Ludwig Schimdt
-
On the Hidden Mystery of OCR in Large Multimodal Models,
arXiv, 2305.07895
, arxiv, pdf, cication: 29Yuliang Liu, Zhang Li, Biao Yang, Chunyuan Li, Xucheng Yin, Cheng-lin Liu, Lianwen Jin, Xiang Bai · (MultimodalOCR - Yuliang-Liu)
· (mp.weixin.qq)
-
CapsFusion: Rethinking Image-Text Data at Scale,
arXiv, 2310.20550
, arxiv, pdf, cication: -1Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Yue Cao, Xinlong Wang, Jingjing Liu
-
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents,
NeurIPS, 2023
, arxiv, pdf, cication: 8Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela
-
DataComp: In search of the next generation of multimodal datasets,
arXiv, 2304.14108
, arxiv, pdf, cication: 71Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang · (datacomp)
-
Multimodal Neurons in Pretrained Text-Only Transformers,
ICCV, 2023
, arxiv, pdf, cication: -1Sarah Schwettmann, Neil Chowdhury, Samuel Klein, David Bau, Antonio Torralba
-
Awesome-LLM-3D - ActiveVisionLab
Awesome-LLM-3D: a curated list of Multi-modal Large Language Model in 3D world Resources
-
awesome-foundation-and-multimodal-models - SkalskiP
👁️ + 💬 + 🎧 = 🤖 Curated list of top foundation and multimodal models! [Paper + Code]
-
Awesome-Multimodal-Assistant - zjr2000
Awesome Multimodal Assistant is a curated list of multimodal chatbots/conversational assistants that utilize various modes of interaction, such as text, speech, images, and videos, to provide a seamless and versatile user experience.
-
Awesome-Multimodal-Large-Language-Models - BradyFU
✨✨Latest Papers and Datasets on Multimodal Large Language Models, and Their Evaluation.