Official repository for the paper "AV-Odyssey: Can Your Multimodal LLMs Really Understand Audio-Visual Information?".
๐ For more details, please refer to the project page with data examples: https://av-odyssey.github.io/.
[๐ Webpage] [๐ Paper] [๐ค AV-Odyssey Dataset] [๐ค Deaftest Dataset] [๐ Leaderboard]
2024.12.22
๐ AV-Odyssey can be evaluated on lmms-eval, thanks to kennymckormick and Luodian.2024.11.24
๐ We release AV-Odyssey, the first-ever comprehensive evaluation benchmark to explore whether MLLMs really understand audio-visual information.
Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two sounds is louder, and 2) determining which of two sounds has a higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench. This benchmark encompasses 26 different tasks and 4,555 carefully crafted problems, each incorporating text, visual, and audio components. All data are newly collected and annotated by humans, not from any existing audio-visual dataset. AV-Odyssey Bench demonstrates three major features: 1. Comprehensive Audio Attributes; 2. Extensive Domains; 3. Interleaved Text, Audio, and Visual components.
Please refer to our project page https://av-odyssey.github.io/ for exploring more examples.
License:
AV-Odyssey is only used for academic research. Commercial use in any form is prohibited.
The copyright of all videos belongs to the video owners.
If there is any infringement in AV-Odyssey, please email [email protected] and we will remove it immediately.
Without prior approval, you cannot distribute, publish, copy, disseminate, or modify AV-Odyssey in whole or in part.
You must strictly comply with the above restrictions.
Please send an email to [email protected]. ๐
We now provide an example code for the evaluation of the Video-Llama model. You can put model-related code under avlm_model
folder.
-
Download the AV-Odyssey data from [๐ค AV-Odyssey Dataset] and put it into your specified folder. In our code, we download AV-Odyssey data into data.
-
Download the pre-trained weights of the evaluated model. In our code, we download Video-Llama weight into avlm_model_weight. You need to install all the required packages of the evaluated model.
Then, run
python evaluation.py --model videollama
We specify the model in evaluate.py.
The result will be collected into avlm_results.
๐จ The Leaderboard for AV-Odyssey is continuously being updated, welcoming the contribution of your excellent MLLMs!
If you find our work helpful for your research, please consider citing our work.
@misc{gong2024avodysseybenchmultimodalllms,
title={AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?},
author={Kaixiong Gong and Kaituo Feng and Bohao Li and Yibing Wang and Mofan Cheng and Shijia Yang and Jiaming Han and Benyou Wang and Yutong Bai and Zhuoran Yang and Xiangyu Yue},
year={2024},
eprint={2412.02611},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.02611},
}