How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game
-
About the project 🧮
- This project has started since Jun 2024. It used to be a multi-image benchmark. However, we found it oversimplified and failed to enable flexible interaction as human players do in the real escape game. We started to design an interactable 3D environment together with the Legent team since Aug 2024.
-
About the team 👩🏻🎓🧑🏻🎓🧑🏻🎓🧑🏻🎓🧑🏻🎓🧑🏻🎓🧑🏻🏫🧑🏻🏫
- We are students from THUMT & THUNLP (Tsinghua University) and Fudan University, and we work part-time on this project. (This is why it took so long to release this project.😣)
- As experienced escape game players, we are curious about how MLLMs would perform in such an environment.
- We are currently planning a second version. If you are insterested in our project, feel free to contact us. (✉️email)
- We live to enjoy life, not just to work.
- Install required packages of EscapeCraft as follows:
git clone https://github.com/THUNLP-MT/EscapeCraft.git
cd EscapeCraft
conda create -n mm-escape python=3.11
conda activate mm-escape
pip install -r requirements.txt
- Download Legent client and environment
For detailed instructions to install Legent, please follow hugging face or Tsinghua Cloud. After downloading the client and environment, please unzip the file to create the following file structure:
src/
└── .legent/
└── env/
├── client
│ └── LEGENT-<platform>-<version>
└── env_data/
└── env_data-<version>
Please refer to LEGENT if you encounter any issues.
Our EscapeCraft is extensible and can be customized by modifying configs in src/config.py
according to your requirements. Please try our pre-defined settings or customize your own settings follow the instructions below:
-
For direct usage:
- The MM-Escape benchmark we used in our paper are provided in the
levels/
dir. - Users can directly play with our pre-defined settings.
- The MM-Escape benchmark we used in our paper are provided in the
-
For customization:
- Please prepare two types of files: the level file and the scene file. Users can refer to the structure of our json files (in
levels/
dir) to config your own data. - For the level file, users should define key props and way to get out (e.g. unlocking the door with the key, or unlocking the door using password)
- For the scene file, users should specify object models used in the scene. If the objects are not included in our repo, please download the required object models and place them under the
prefabs/
dir.
- Please prepare two types of files: the level file and the scene file. Users can refer to the structure of our json files (in
cd src/scripts
python generate_scene.py --setting_path path/to/levels
Then the scene will be saved automatically in levels/level_name/
.
cd src/scripts
python load_scene.py --scene_path path/to/levels
The options for the evalution are listed as following:
usage: main.py [-h] [--level LEVEL] [--model MODEL] [--room_id ROOM_ID] [--record_path RECORD_PATH] [--history_type HISTORY_TYPE] [--hint]
[--max_history MAX_HISTORY] [--max_retry MAX_RETRY]
options:
-h, --help show this help message and exit
--level LEVEL level name
--model MODEL model name
--room_id ROOM_ID
generated room_id of level "LEVEL"
--record_path RECORD_PATH
record path to load
--history_type HISTORY_TYPE
history type, asserted in full, key, max
--hint whether to use hint
--max_history MAX_HISTORY
max history length (you need to *set history_type to "max"* to enable this max history length setting)
--max_retry MAX_RETRY
max retry times
For example, you can load the third scene generated for level3 (aka "Diffuculty-3" in our paper) and evaluate the model gpt-4o
with the history type full
:
cd src
python main.py --level level3 --room_id 3 --model gpt-4o --history_type full
To load a recorded history, please follow this command:
cd src
python main.py --level level3 --room_id 3 --model record --history_type full --record_path path/to/record
This is for visualization of a complete escaping history, or to restore a unfinished game (continue running).
coming soon!
If you find this repository useful, please cite our paper:
@misc{wang2025multimodallargelanguagemodels,
title={How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game},
author={Ziyue Wang and Yurui Dong and Fuwen Luo and Minyuan Ruan and Zhili Cheng and Chi Chen and Peng Li and Yang Liu},
year={2025},
eprint={2503.10042},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.10042},
}