Skip to content
View AV-Odyssey's full-sized avatar

Block or report AV-Odyssey

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this userโ€™s behavior. Learn more about reporting abuse.

Report abuse
AV-Odyssey/README.md

AV-Odyssey: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

AVQA Multi-Modal AV-Odyssey
Gemini Reka GPT-4o

Official repository for the paper "AV-Odyssey: Can Your Multimodal LLMs Really Understand Audio-Visual Information?".

๐ŸŒŸ For more details, please refer to the project page with data examples: https://av-odyssey.github.io/.

[๐ŸŒ Webpage] [๐Ÿ“– Paper] [๐Ÿค— AV-Odyssey Dataset] [๐Ÿค— Deaftest Dataset] [๐Ÿ† Leaderboard]


๐Ÿ”ฅ News

  • 2024.12.22 ๐ŸŒŸ AV-Odyssey can be evaluated on lmms-eval, thanks to kennymckormick and Luodian.
  • 2024.11.24 ๐ŸŒŸ We release AV-Odyssey, the first-ever comprehensive evaluation benchmark to explore whether MLLMs really understand audio-visual information.

๐Ÿ‘€ About AV-Odyssey

Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two sounds is louder, and 2) determining which of two sounds has a higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench. This benchmark encompasses 26 different tasks and 4,555 carefully crafted problems, each incorporating text, visual, and audio components. All data are newly collected and annotated by humans, not from any existing audio-visual dataset. AV-Odyssey Bench demonstrates three major features: 1. Comprehensive Audio Attributes; 2. Extensive Domains; 3. Interleaved Text, Audio, and Visual components.

๐Ÿ“ Data Examples

Please refer to our project page https://av-odyssey.github.io/ for exploring more examples.

๐Ÿ“AV-Odyssey Bench

๐Ÿ” Dataset

License:

AV-Odyssey is only used for academic research. Commercial use in any form is prohibited.
The copyright of all videos belongs to the video owners.
If there is any infringement in AV-Odyssey, please email [email protected] and we will remove it immediately.
Without prior approval, you cannot distribute, publish, copy, disseminate, or modify AV-Odyssey in whole or in part. 
You must strictly comply with the above restrictions.

Please send an email to [email protected]. ๐ŸŒŸ

๐Ÿ”ฎ Evaluation Pipeline

Run Evaluation on AV-Odyssey

We now provide an example code for the evaluation of the Video-Llama model. You can put model-related code under avlm_model folder.

  1. Download the AV-Odyssey data from [๐Ÿค— AV-Odyssey Dataset] and put it into your specified folder. In our code, we download AV-Odyssey data into data.

  2. Download the pre-trained weights of the evaluated model. In our code, we download Video-Llama weight into avlm_model_weight. You need to install all the required packages of the evaluated model.

Then, run

python evaluation.py --model videollama

We specify the model in evaluate.py.

The result will be collected into avlm_results.

๐Ÿ† Leaderboard

Contributing to the AV-Odyssey Leaderboard

๐Ÿšจ The Leaderboard for AV-Odyssey is continuously being updated, welcoming the contribution of your excellent MLLMs!

โœ’๏ธ Citation

If you find our work helpful for your research, please consider citing our work.

@misc{gong2024avodysseybenchmultimodalllms,
      title={AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?}, 
      author={Kaixiong Gong and Kaituo Feng and Bohao Li and Yibing Wang and Mofan Cheng and Shijia Yang and Jiaming Han and Benyou Wang and Yutong Bai and Zhuoran Yang and Xiangyu Yue},
      year={2024},
      eprint={2412.02611},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.02611}, 
}

Popular repositories Loading

  1. AV-Odyssey AV-Odyssey Public

    This repo contains evaluation code for the paper "AV-Odyssey: Can Your Multimodal LLMs Really Understand Audio-Visual Information?"

    Python 19 1

  2. AV-Odyssey.github.io AV-Odyssey.github.io Public

    JavaScript