Skip to content

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

Notifications You must be signed in to change notification settings

MME-Benchmarks/MME-Unify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

[2025/04/07] 🔥 We are proud to open-source MME-Unify, a comprehensive evaluation framework designed to systematically assess U-MLLMs. Our benchmark includes:

  • A Standardized Traditional Task Evaluation We sample from 12 datasets, covering 10 tasks with 30 subtasks, ensuring consistent and fair comparisons across studies.
  • A Unified Task Assessment We introduce five novel tasks testing multimodal reasoning, including image editing, commonsense QA with image generation, and geometric reasoning.
  • A Comprehensive Model Benchmarking We evaluate 12 leading U-MLLMs, such as Janus-Pro, EMU3, and VILA-U, Gemini-2-Flash-exp, alongside specialized understanding (e.g., Claude-3.5-Sonnet) and generation models (e.g., DALL-E-3/2).

Our findings reveal substantial performance gaps in existing U-MLLMs, highlighting the need for more robust models capable of handling mixed-modality tasks effectively.

Dataset Examples

Evaluation Pipeline

Prompt

The common prompt used in our evaluation for different tasks can be found in:

MME-Unify/Prompt.txt

Dataset

You can download images in our Hugging Face repository and the final structure should look like this:

MME-Unify
├── CommonSense_Questions
├── Conditional_Image_to_Video_Generation
├── Fine-Grained_Image_Reconstruction
├── Math_Reasoning
├── Multiple_Images_and_Text_Interlaced
├── Single_Image_Perception_and_Understanding
├── Spot_Diff
├── Text-Image_Editing
├── Text-Image_Generation
├── Text-to-Video_Generation
├── Video_Perception_and_Understanding
└── Visual_CoT

You can found QA pairs in:

MME-Unify/Unify_Dataset

and the structure should look like this:

Unify_Dataset
├── Understanding
├── Generation
├── Unify_Capability
│   ├── Auxiliary_Lines
│   ├── Common_Sense_Question
│   ├── Image_Editing_and_Explaning
│   ├── SpotDiff
│   ├── Visual_CoT

Evaluate

To extract the answer and calculate the scores, we add the model response to a JSON file. Here we provide an example template output_test_template.json. Once you have prepared the model responses in this format, please refer to the evaluation scripts in:

MME-Unify/evaluate

Dataset License

License:

MME-Unify is only used for academic research. Commercial use in any form is prohibited.
The copyright of all images belongs to the image owners.
If there is any infringement in MME-Unify, please email [email protected] and we will remove it immediately.
Without prior approval, you cannot distribute, publish, copy, disseminate, or modify MME-Unify in whole or in part. 
You must strictly comply with the above restrictions.

Please send an email to [email protected]. 🌟

Citation

If you find it useful for your research and applications, please cite related papers/blogs using this BibTeX:

@article{xie2025mme,
  title={MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models},
  author={Xie, Wulin and Zhang, Yi-Fan and Fu, Chaoyou and Shi, Yang and Nie, Bingyan and Chen, Hongkai and Zhang, Zhang and Wang, Liang and Tan, Tieniu},
  journal={arXiv preprint arXiv:2504.03641},
  year={2025}
}

Related Works

Explore our related researches:

About

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages