MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

[📖 arXiv Paper] [📊 MME-Unify Data] [📝 Homepage]

[2025/04/07] 🔥 We are proud to open-source MME-Unify, a comprehensive evaluation framework designed to systematically assess U-MLLMs. Our benchmark includes:

A Standardized Traditional Task Evaluation We sample from 12 datasets, covering 10 tasks with 30 subtasks, ensuring consistent and fair comparisons across studies.
A Unified Task Assessment We introduce five novel tasks testing multimodal reasoning, including image editing, commonsense QA with image generation, and geometric reasoning.
A Comprehensive Model Benchmarking We evaluate 12 leading U-MLLMs, such as Janus-Pro, EMU3, and VILA-U, Gemini-2-Flash-exp, alongside specialized understanding (e.g., Claude-3.5-Sonnet) and generation models (e.g., DALL-E-3/2).

Our findings reveal substantial performance gaps in existing U-MLLMs, highlighting the need for more robust models capable of handling mixed-modality tasks effectively.

Dataset Examples

Evaluation Pipeline

Prompt

The common prompt used in our evaluation for different tasks can be found in:

MME-Unify/Prompt.txt

Dataset

You can download images in our Hugging Face repository and the final structure should look like this:

MME-Unify
├── CommonSense_Questions
├── Conditional_Image_to_Video_Generation
├── Fine-Grained_Image_Reconstruction
├── Math_Reasoning
├── Multiple_Images_and_Text_Interlaced
├── Single_Image_Perception_and_Understanding
├── Spot_Diff
├── Text-Image_Editing
├── Text-Image_Generation
├── Text-to-Video_Generation
├── Video_Perception_and_Understanding
└── Visual_CoT

You can found QA pairs in:

MME-Unify/Unify_Dataset

and the structure should look like this:

Unify_Dataset
├── Understanding
├── Generation
├── Unify_Capability
│   ├── Auxiliary_Lines
│   ├── Common_Sense_Question
│   ├── Image_Editing_and_Explaning
│   ├── SpotDiff
│   ├── Visual_CoT

Evaluate

To extract the answer and calculate the scores, we add the model response to a JSON file. Here we provide an example template output_test_template.json. Once you have prepared the model responses in this format, please refer to the evaluation scripts in:

MME-Unify/evaluate

Dataset License

License:

MME-Unify is only used for academic research. Commercial use in any form is prohibited.
The copyright of all images belongs to the image owners.
If there is any infringement in MME-Unify, please email [email protected] and we will remove it immediately.
Without prior approval, you cannot distribute, publish, copy, disseminate, or modify MME-Unify in whole or in part. 
You must strictly comply with the above restrictions.

Please send an email to [email protected]. 🌟

Citation

If you find it useful for your research and applications, please cite related papers/blogs using this BibTeX:

@article{xie2025mme,
  title={MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models},
  author={Xie, Wulin and Zhang, Yi-Fan and Fu, Chaoyou and Shi, Yang and Nie, Bingyan and Chen, Hongkai and Zhang, Zhang and Wang, Liang and Tan, Tieniu},
  journal={arXiv preprint arXiv:2504.03641},
  year={2025}
}

Related Works

Explore our related researches:

[SliME] Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
[VITA] VITA: Towards Open-Source Interactive Omni Multimodal LLM
[Long-VITA] Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
[MME] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
[Video-mme] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
[MME-RealWorld] Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
[MME-Survey] MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
[MM-RLHF] MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Unify_Dataset		Unify_Dataset
docs		docs
evaluate		evaluate
evaluation		evaluation
Prompt.txt		Prompt.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

Dataset Examples

Evaluation Pipeline

Prompt

Dataset

Evaluate

Dataset License

Citation

Related Works

About

Releases

Packages

Contributors 2

Languages

MME-Benchmarks/MME-Unify

Folders and files

Latest commit

History

Repository files navigation

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

Dataset Examples

Evaluation Pipeline

Prompt

Dataset

Evaluate

Dataset License

Citation

Related Works

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages