Skip to content

Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think!

Notifications You must be signed in to change notification settings

chenllliang/DreamEngine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

DreamEngine

Static Badge Static Badge

截屏2025-02-23 22 38 04

DreamEngine is a unified framework that integrates multimodal encoders like QwenVL with diffusion models through a two-stage training approach, enabling advanced text-image interleaved control and achieving state-of-the-art performance in generating images with complex, concept-merged inputs.

demo.mp4

Updates:

  • 2025-03-03: Release checkpoint and a demo for text-guided object fusion.

Run the Demo locally

bash setup.sh

# setup the paths in demo.py
python src/scripts/eval/demo.py

Model Structure

截屏2025-02-27 23 14 47

Training

截屏2025-02-27 23 15 16

Demos

截屏2025-02-27 23 15 03 截屏2025-02-27 23 15 24 截屏2025-02-27 23 15 30

Citation

If you feel the work helpful, please kindly cite

@misc{chen2025multimodalrepresentationalignmentimage,
      title={Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think}, 
      author={Liang Chen and Shuai Bai and Wenhao Chai and Weichu Xie and Haozhe Zhao and Leon Vinci and Junyang Lin and Baobao Chang},
      year={2025},
      eprint={2502.20172},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.20172}, 
}

About

Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages