
DreamEngine is a unified framework that integrates multimodal encoders like QwenVL with diffusion models through a two-stage training approach, enabling advanced text-image interleaved control and achieving state-of-the-art performance in generating images with complex, concept-merged inputs.
demo.mp4
Updates:
- 2025-03-03: Release checkpoint and a demo for text-guided object fusion.
bash setup.sh
# setup the paths in demo.py
python src/scripts/eval/demo.py





If you feel the work helpful, please kindly cite
@misc{chen2025multimodalrepresentationalignmentimage,
title={Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think},
author={Liang Chen and Shuai Bai and Wenhao Chai and Weichu Xie and Haozhe Zhao and Leon Vinci and Junyang Lin and Baobao Chang},
year={2025},
eprint={2502.20172},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.20172},
}