Skip to content

OpenEnvision/Awesome-Multimodal-Modeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

109 Commits
 
 
 
 

Repository files navigation

Awesome Multimodal Modeling

A Comprehensive Survey & Curated List of Multimodal Modeling
From Traditional Fusion to Native & Unified Architectures

Awesome PRs Welcome GitHub Stars Last Updated License CC0 1.0

Overview Traditional MLLMs UMMs NMMs Closed Source Models Resources

Overview · Traditional · MLLMs · UMMs · NMMs ·Closed Source Models · Resources

📢 News

  • [2026-04-13] ⭐ The repository has already gained over 100 stars in just one day! Thank you all for the incredible support. We will keep updating this list with more cutting-edge models and resources. Your continued stars and PRs are warmly welcomed!
  • [2026-04-12] 🎉 We are excited to launch Awesome Multimodal Modeling — a curated reading list organized by architectural paradigms. A comprehensive survey paper is coming soon! Stay tuned.

Table of Contents

Browse the list

About This List

In this section: At a Glance · Curation Principles

This repository provides a structured, community-maintained survey of multimodal models, covering the full evolutionary arc from early fusion methods to today's natively-trained omni-models. We emphasize precise architectural definitions and classification, especially for the often-conflated categories of Unified Multimodal Models (UMMs) and Native Multimodal Models (NMMs).

Scope: Primary focus on image + text modalities; audio/video/3D are annotated where present. Omni/any-to-any models are marked with Omni.

At a Glance

Dimension Coverage
Primary scope Image + text multimodal models, with explicit annotations for video, audio, and omni extensions
Core taxonomy Traditional multimodal models, MLLMs, UMMs, and strict NMMs
Key distinction U+G unification for UMMs vs. joint training from scratch for NMMs
What makes this repo different Architecture-first categorization, fusion-aware definitions, and curated links to adjacent awesome lists
Intended audience Researchers, students, and engineers building or surveying multimodal systems

Curation Principles

Principle Rule
Source quality Prefer official conference proceedings, OpenReview, ACL Anthology, CVF Open Access, arXiv, and official project pages
Classification policy Category assignment is based on this repository's architecture-first definitions, which may differ from authors' own branding
Venue policy If a peer-reviewed venue is known, we list that venue; otherwise we keep the entry as arXiv
Scope discipline Models, benchmarks, datasets, and analysis papers are tracked separately to avoid mixing artifacts
Inclusion bar We prioritize landmark papers, broadly adopted benchmarks, open implementations, or papers that clarify important taxonomy boundaries

Classification note: for ambiguous models sitting between MLLM, UMM, and strict NMM, this list records the category that best matches the training recipe and architectural coupling, not just the paper title.

Back to Top


1. Introduction & Definitions

In this section: 1.1 Multimodal Model Evolution Stages · 1.2 Scope & Taxonomy · 1.3 Architecture Diagrams

1.1 Multimodal Model Evolution Stages

Subtopics: Traditional Multimodal Models · Multimodal Large Language Models (MLLMs) · Unified Multimodal Models (UMMs) · Native Multimodal Models (NMMs)

We use the following precise, architecture-first definitions throughout this list. Understanding these distinctions is critical for correctly classifying modern models.

Traditional Multimodal Models

Traditional category Alignment and Fusion

Pre-2023 mainstream era

Independent per-modality processing followed by simple fusion (early, late, or hybrid). No large-scale language model backbone. Focuses on representation alignment, cross-modal retrieval, and captioning. Examples: CLIP, ALIGN, ViLBERT, BLIP.

Multimodal Large Language Models (MLLMs)

MLLM category Modular late fusion

Pretrained-backbone multimodal language models

Combine a pretrained visual backbone or visual abstractor (e.g., ViT/CLIP/SigLIP, Q-Former, cross-attention adapter) with a pretrained LLM through a connector. The defining property is inheritance from strong pretrained unimodal backbones rather than joint multimodal pretraining from scratch. These models are primarily text-output understanding/reasoning systems, even when auxiliary generators are attached externally.

Key characteristics:

  • ✅ Pretrained visual encoder / abstractor
  • ✅ Pretrained LLM backbone
  • ✅ Connector layer or cross-attention bridge
  • ❌ No end-to-end multimodal pretraining from scratch
  • ❌ No native image generation inside the same backbone

Examples: LLaVA, Qwen-VL, InternVL, MiniCPM-V, CogVLM

Unified Multimodal Models (UMMs)

UMM category Understanding and generation

Single framework for Understanding + Generation (U+G)

A single framework that handles both multimodal understanding and visual generation. UMMs may reuse pretrained components or modular tokenizers; the defining feature is U+G unification, not whether the model is trained from scratch.

Key characteristics:

  • ✅ Unified understanding + generation
  • ✅ Shared model interface or shared backbone for U+G
  • ⚠️ May use pretrained components
  • ⚠️ May use decoupled encoders / modular tokenizers
  • ⚠️ If a model is also natively trained from scratch, its architectural details belong primarily in NMMs (§5)

Examples: Show-o, Janus, OpenUni, BAGEL, BLIP3-o

Native Multimodal Models (NMMs)

NMM category Trained from scratch

Jointly trained from scratch — no pretrained backbone

The strictest category. NMMs are trained jointly from scratch on all modalities — they do not rely on any pretrained LLM or pretrained vision encoder as initialization. All parameters are learned end-to-end from raw multimodal data.

Key characteristics:

  • ✅ No pretrained LLM backbone
  • ✅ No pretrained vision encoder
  • ✅ All components jointly trained from scratch
  • ✅ Input: text tokens + image patches/tokens
  • ✅ Output: text (understanding focus; generation optional)

NMMs are further divided by fusion architecture:

NMM — Early Fusion

Multimodal interaction begins from the first layer. A single Transformer decoder processes tokenized text and continuous/discrete image patches together, with minimal modality-specific parameters (only a linear patchify layer for images). No separate image encoder is maintained.

  • Single unified Transformer (decoder-only)
  • Continuous image patches or minimal discrete tokenization
  • Modality interaction from layer 1
  • Near-zero modality-specific parameters (excluding linear patch embed)
  • Examples: Emu3 (if trained from scratch)
NMM — Late Fusion

Each modality is first processed by a dedicated unimodal component (e.g., a vision tower or image encoder), but these components are jointly trained from scratch (not pretrained). Cross-modal interaction occurs at deeper layers.

  • Separate unimodal processing stages (trained from scratch)
  • Cross-modal interaction at deeper layers
  • More modality-specific parameters
  • Examples: Models with jointly-trained vision encoders → decoder interaction

1.2 Scope & Taxonomy

Multimodal Models
├── 2. Traditional Multimodal Models
│   ├── 2.1 Multimodel Representations & Alignment
│   │   ├── Multimodal Representations
│   │   ├── Multimodal Fusion
│   │   └── Multimodal Alignment
│   └── 2.2 Multimodal Pretraining
├── 3. Multimodal Large Language Models (MLLMs)
│   ├── 3.1 Foundation MLLMs
│   └── 3.2 Omni MLLMs
├── 4. Unified Multimodal Models (UMMs)
│   ├── 4.1 Taxonomy by Generation Paradigm
│   │   ├── Diffusion-Based UMMs
│   │   ├── Autoregressive (AR) UMMs
│   │   │   ├── Pixel Encoding
│   │   │   ├── Semantic Encoding
│   │   │   ├── Learnable Query Encoding
│   │   │   ├── Hybrid Encoding (Pseduo)
│   │   │   └── Hybrid Encoding (Joint)
│   │   └── Hybrid (AR + Diffusion) UMMs
│   │       ├── Pixel Encoding
│   │       └── Hybrid Encoding
│   └── 4.2 Any-to-Any / Omni UMMs
└── 5. Native Multimodal Models (NMMs)
    ├── 5.1 Design Analyses & Scaling Laws
    ├── 5.2 Early Fusion NMMs
    ├── 5.3 Late Fusion NMMs
    └── 5.4 Any-to-Any / Omni NMMs

1.3 Architecture Diagrams

┌─────────────────────────────────────────────────────────────────┐
│           TRADITIONAL MULTIMODAL MODEL                          │
│                                                                 │
│  [Image] ──► [CNN/ViT Encoder] ──┐                              │
│                                  ├──► [Fusion] ──► [Output]     │
│  [Text]  ──► [LSTM/BERT]       ──┘                              │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│           MLLM — MODULAR LATE FUSION                            │
│                                                                 │
│  [Image] ──► [Pretrained ViT/CLIP] ──► [Projector/Q-Former]     │
│                                                │                │
│                                                ▼                │
│  [Text]  ──────────────────────────► [Pretrained LLM] ──► [Text]│
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│           UMM — UNIFIED UNDERSTANDING + GENERATION              │
│                                                                 │
│  [Image/Text Input] ──► [Shared/Modular Tokenizer]              │
│                                │                                │
│                                ▼                                │
│                   [Unified Transformer]                         │
│                       │           │                             │
│                       ▼           ▼                             │
│             [Text Output]      [Image Output]                   │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│           NMM — EARLY FUSION (Trained from Scratch)             │
│                                                                 │
│  [Text tokens] ──┐                                              │
│                  └──► [Single Decoder Transformer] ──► [Text]   │
│  [Image patches ──► Linear Patchify] ──┘                        │
│   (raw pixels, minimal preprocessing)                           │
│   Multimodal interaction from Layer 1                           │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│           NMM — LATE FUSION (Trained from Scratch)              │
│                                                                 │
│  [Image] ──► [Jointly-Trained Vision Component]                 │
│                         │                                       │
│                         ▼  (deep layers)                        │
│  [Text]  ──────────► [Cross-Modal Interaction] ──► [Text]       │
│           (All components trained jointly from scratch)         │
└─────────────────────────────────────────────────────────────────┘

Back to Top


2. Traditional Multimodal Models

In this section: 2.1 Multimodel Representations & Alignment · 2.2 Multimodal Pretraining

Pre-chat-MLLM and non-native multimodal systems that established the basic vocabulary of alignment, fusion, retrieval, captioning, and multimodal pretraining.

2.1 Multimodel Representations & Alignment

Subtopics: Multimodal Representations · Multimodal Fusion · Multimodal Alignment

Multimodal Representations

Paper Venue Links Notes Task
Identifiability Results for Multimodal Contrastive Learning ICLR 2023 Paper Theoretical identifiability analysis of contrastive multimodal learning representation learning
Unpaired Vision-Language Pre-training via Cross-Modal CutMix ICML 2022 Paper Introduces CutMix-style augmentation for unpaired VLP vision-language pretraining
Balanced Multimodal Learning via On-the-fly Gradient Modulation CVPR 2022 Paper Balances modality learning via dynamic gradient reweighting multimodal optimization
Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast IJCAI 2021 Paper Cross-modal prototype contrast for voice-face alignment audio-visual representation learning
Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text arXiv 2021 Paper Early unified transformer for unpaired multimodal pretraining unified multimodal pretraining
FLAVA: A Foundational Language And Vision Alignment Model arXiv 2021 Paper Unified architecture for vision-language understanding and generation foundation multimodal model
Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer arXiv 2021 Paper Single transformer for multiple multimodal tasks multimodal multitask learning
MultiBench: Multiscale Benchmarks for Multimodal Representation Learning NeurIPS 2021 Paper Benchmark suite for multimodal learning evaluation benchmarking
Perceiver: General Perception with Iterative Attention ICML 2021 Paper General-purpose architecture for high-dimensional multimodal inputs general multimodal architecture
Learning Transferable Visual Models From Natural Language Supervision arXiv 2021 Paper Contrastive vision-language pretraining at scale vision-language contrastive learning
VinVL: Revisiting Visual Representations in Vision-Language Models arXiv 2021 Paper Improved visual features for VL tasks vision-language representation improvement
Learning Transferable Visual Models From Natural Language Supervision arXiv 2020 Paper Early large-scale vision-language contrastive learning vision-language pretraining
12-in-1: Multi-Task Vision and Language Representation Learning CVPR 2020 Paper Unified multi-task learning across 12 VL tasks multi-task learning
Watching the World Go By: Representation Learning from Unlabeled Videos arXiv 2020 Paper Self-supervised video representation learning video representation learning
Learning Video Representations using Contrastive Bidirectional Transformer arXiv 2019 Paper Contrastive transformer for video representation learning video contrastive learning
Visual Concept-Metaconcept Learning NeurIPS 2019 Paper Hierarchical concept learning from visual data concept learning
OmniNet: A Unified Architecture for Multi-modal Multi-task Learning arXiv 2019 Paper Unified encoder-decoder for multimodal tasks unified multimodal architecture
Learning Representations by Maximizing Mutual Information Across Views arXiv 2019 Paper InfoMax principle for cross-view representation learning self-supervised learning
ViCo: Word Embeddings from Visual Co-occurrences ICCV 2019 Paper Learning word embeddings from visual context vision-language embeddings
Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations CVPR 2019 Paper Structured embedding space for vision-language alignment embedding learning
Multi-Task Learning of Hierarchical Vision-Language Representation CVPR 2019 Paper Hierarchical representation learning across VL tasks multi-task learning
Learning Factorized Multimodal Representations ICLR 2019 Paper Factorized latent space for multimodal data representation disentanglement
A Probabilistic Framework for Multi-view Feature Learning with Many-to-many Associations via Neural Networks ICML 2018 Paper Probabilistic modeling of multi-view correspondence multi-view learning
Do Neural Network Cross-Modal Mappings Really Bridge Modalities? ACL 2018 Paper Analyzes limitations of cross-modal mapping theoretical analysis
Learning Robust Visual-Semantic Embeddings ICCV 2017 Paper Improved robustness in vision-language embeddings embedding learning
Deep Multimodal Representation Learning from Temporal Data CVPR 2017 Paper Temporal multimodal representation learning multimodal temporal learning
Is an Image Worth More than a Thousand Words? On the Fine-Grain Semantic Differences between Visual and Linguistic Representations COLING 2016 Paper Analyzes semantic gap between vision and language representation analysis
Combining Language and Vision with a Multimodal Skip-gram Model NAACL 2015 Paper Extends skip-gram with visual context multimodal embeddings
Deep Fragment Embeddings for Bidirectional Image Sentence Mapping NeurIPS 2014 Paper Fragment-level image-sentence alignment vision-language alignment
Multimodal Deep Learning JMLR 2014 Paper Probabilistic generative multimodal model generative multimodal learning
Learning Grounded Meaning Representations with Autoencoders ACL 2014 Paper Autoencoder-based grounded semantics representation learning
DeViSE: A Deep Visual-Semantic Embedding Model NeurIPS 2013 Paper Early deep vision-to-language embedding model vision-language embedding
Multimodal Deep Learning ICML 2011 Paper Foundational multimodal deep learning framework multimodal deep learning

Multimodal Fusion

Paper Venue Links Notes Task
Robust Contrastive Learning against Noisy Views arXiv 2022 Paper Robust contrastive learning under noisy multi-view inputs contrastive learning
Cooperative Learning for Multi-view Analysis arXiv 2022 Paper Cooperative optimization across multiple views for representation learning multi-view learning
What Makes Multi-modal Learning Better than Single (Provably) NeurIPS 2021 Paper Theoretical guarantees showing when multimodal learning improves over unimodal theoretical analysis
Efficient Multi-Modal Fusion with Diversity Analysis ACMMM 2021 Paper Fusion method emphasizing diversity-aware multimodal integration multimodal fusion
Attention Bottlenecks for Multimodal Fusion NeurIPS 2021 Paper Introduces bottleneck attention mechanism for efficient multimodal fusion multimodal fusion
VMLoc: Variational Fusion For Learning-Based Multimodal Camera Localization AAAI 2021 Paper Variational multimodal fusion for camera localization tasks multimodal localization
Trusted Multi-View Classification ICLR 2021 Paper Confidence-aware weighting for multi-view classification multi-view classification
Deep-HOSeq: Deep Higher-Order Sequence Fusion for Multimodal Sentiment Analysis ICDM 2020 Paper Higher-order sequence fusion for multimodal sentiment analysis multimodal sentiment analysis
Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies NeurIPS 2020 Paper Entropy-based regularization to reduce modality bias multimodal fairness/robustness
Deep Multimodal Fusion by Channel Exchanging NeurIPS 2020 Paper Channel exchange mechanism for cross-modal feature interaction multimodal fusion
What Makes Training Multi-Modal Classification Networks Hard? CVPR 2020 Paper Analyzes optimization challenges in multimodal classification theoretical/empirical analysis
Dynamic Fusion for Multimodal Data arXiv 2019 Paper Adaptive fusion strategy depending on input modality quality multimodal fusion
DeepCU: Integrating Both Common and Unique Latent Information for Multimodal Sentiment Analysis IJCAI 2019 Paper Separates shared and private latent representations for fusion multimodal sentiment analysis
Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling NeurIPS 2019 Paper High-order tensor/polynomial fusion for multimodal features multimodal fusion
XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification IEEE TNNLS 2019 Paper Cross-modal feature exchange network for audio-visual tasks audio-visual classification
MFAS: Multimodal Fusion Architecture Search CVPR 2019 Paper Neural architecture search for optimal multimodal fusion design architecture search
The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision ICLR 2019 Paper Neuro-symbolic model combining perception and reasoning neuro-symbolic learning
Unifying and merging well-trained deep neural networks for inference stage IJCAI 2018 Paper Model merging strategy for inference-time multimodal integration model fusion
Efficient Low-rank Multimodal Fusion with Modality-Specific Factors ACL 2018 Paper Low-rank factorization for efficient multimodal fusion efficient fusion
Memory Fusion Network for Multi-view Sequential Learning AAAI 2018 Paper Memory-based fusion across temporal multimodal sequences sequential multimodal learning
Tensor Fusion Network for Multimodal Sentiment Analysis EMNLP 2017 Paper Tensor-based full interaction modeling across modalities multimodal sentiment analysis
Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework AAAI 2015 Paper Joint modeling of video and compositional language vision-language modeling
A co-regularized approach to semi-supervised learning with multiple views ICML 2005 Paper Early multi-view co-regularization framework multi-view semi-supervised learning

Multimodal Alignment

Paper Venue Links Notes Task
CLIP arXiv 2021 Paper 400M+ Dual-encoder (Vision Transformer + Text Transformer); contrastive alignment at embedding level; classic late-fusion foundation
Reconsidering Representation Alignment for Multi-view Clustering CVPR 2021 Paper Revisits representation alignment objectives for multi-view clustering multimodal alignment
CoMIR: Contrastive Multimodal Image Representation for Registration NeurIPS 2020 Paper Contrastive learning for multimodal image registration alignment multimodal alignment
Multimodal Transformer for Unaligned Multimodal Language Sequences ACL 2019 Paper Transformer-based alignment for unaligned multimodal sequences sequence alignment
Temporal Cycle-Consistency Learning CVPR 2019 Paper Uses cycle-consistency for temporal cross-modal alignment temporal alignment
See, Hear, and Read: Deep Aligned Representations arXiv 2017 Paper Learns aligned representations across vision, audio, and text multimodal alignment
On Deep Multi-View Representation Learning ICML 2015 Paper Theoretical and empirical study of multi-view representation alignment multi-view learning
Unsupervised Alignment of Natural Language Instructions with Video Segments AAAI 2014 Paper Aligns language instructions with video segments without supervision language-video alignment
Multimodal Alignment of Videos ACM MM 2014 Paper Early multimodal alignment framework for video modalities video alignment
Deep Canonical Correlation Analysis ICML 2013 Paper Deep learning extension of CCA for cross-view representation alignment representation alignment

2.2 Multimodal Pretraining

Paper Venue Links Notes Task
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation NeurIPS 2021 Spotlight Paper Momentum distillation for aligning vision-language representations before fusion vision-language pretraining
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling CVPR 2021 Paper Sparse frame sampling for efficient video-language pretraining video-language pretraining
Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer arXiv 2021 Paper Unified transformer for multitask multimodal learning unified multimodal pretraining
Large-Scale Adversarial Training for Vision-and-Language Representation Learning NeurIPS 2020 Paper Adversarial training improves robustness of vision-language representations robust multimodal pretraining
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision EMNLP 2020 Paper Grounds language tokens in visual context via voken supervision vision-grounded language modeling
Integrating Multimodal Information in Large Pretrained Transformers ACL 2020 Paper Injects multimodal signals into large pretrained transformer architectures multimodal transformer pretraining
VL-BERT: Pre-training of Generic Visual-Linguistic Representations arXiv 2019 Paper Joint vision-language BERT-style pretraining vision-language pretraining
VisualBERT: A Simple and Performant Baseline for Vision and Language arXiv 2019 Paper Early unified transformer for vision-language understanding vision-language pretraining
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks NeurIPS 2019 Paper Two-stream transformer for cross-modal vision-language learning vision-language pretraining
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training arXiv 2019 Paper Cross-modal encoder for universal vision-language representations vision-language pretraining
LXMERT: Learning Cross-Modality Encoder Representations from Transformers EMNLP 2019 Paper Cross-modality transformer encoder for vision-language reasoning vision-language pretraining
VideoBERT: A Joint Model for Video and Language Representation Learning ICCV 2019 Paper Joint discrete token modeling for video and language video-language pretraining

Back to Top


3. Multimodal Large Language Models (MLLMs)

In this section: 3.1 Foundation MLLMs · 3.2 Omni MLLMs

Models that connect a pretrained visual encoder / abstractor to a pretrained LLM. Primarily text-output understanding and reasoning systems, defined by inherited pretrained unimodal backbones rather than multimodal pretraining from scratch.

3.1 Taxonomy Based on Vision Adapter

MLP/Others Projector

Paper Venue Links Notes Task
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders arXiv 2026 Paper LLM-initialized vision encoder (non-CLIP); text-to-vision weight reuse, generative-aligned visual features, optimized for dense perception. visual understanding
Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision arXiv 2026 Paper Tri-modal (V+A+L) unified framework; parameter-efficient tuning, seamless cross-modal reasoning for mobile/IoT deployment. visual understanding
STEP3-VL-10B Technical Report arXiv 2026 Paper 10B-scale foundation multimodal; unified unfrozen pre-training + PaCoRe test-time scaling, frontier-level reasoning with compact footprint. visual understanding
GLM-OCR arXiv 2026 Paper GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. OCR, structured extraction
Kimi K2.5 arXiv 2026 Paper joint text-vision pretraining, Agent Swarm framework; coding, vision, reasoning, agentic tasks; reduces latency by up to 4.5x visual agentic intelligence, agentic, reasoning
Kwai Keye-VL 1.5 Technical Report arXiv 2025 Paper Adaptive Slow-Fast encoding; 8B parameter scale with 128K long-context; SOTA video reasoning & human-preference aligned. visual understanding
olmOCR / olmOCR-2 arXiv 2025 Paper Efficient low-VRAM OCR model based on Qwen2.5-VL fine-tune; excels at preserving semantic structure and markdown output OCR, structured extraction
PaddleOCR-VL arXiv 2025 HF / Official Lightweight (0.9B+) multimodal OCR with 109 languages support; excellent chart-to-HTML/Markdown conversion and high-throughput OCR, multilingual document
DeepSeek-OCR arXiv 2025 Paper HF Lightweight ~3B MoE vision model optimized for high-volume OCR, document digitization, charts and formulas; efficient inference OCR, document
Kimi-VL arXiv 2025 Paper HF Projector + MoE backbone; long video/PDF/GUI, agentic capabilities, chain-of-thought vision reasoning visual understanding, agentic, video
Seed1.5-VL Technical Report arXiv 2025 Paper 20B MoE + 532M ViT; native-resolution vision-language foundation model; efficient asymmetric architecture. visual understanding
Qwen3-VL arXiv 2025 Paper HF Frontier-grade vision/OCR (32+ languages), video analysis, agentic capabilities, strong multimodal reasoning; includes large MoE variants (e.g., 235B) visual understanding, video, omni
SmolVLM arXiv 2025 HF Ultra-lightweight (256M–2.2B) projector-based series; efficient on-device video and image understanding visual understanding, efficiency
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning arXiv 2025 Paper Diffusion llm as llm backbone; Vision encoder: Siglip visual understanding
jina-vlm arXiv 2025 Paper HF SigLIP2 + Qwen backbone with custom projector; optimized for semantic VQA, diagrams, scans and document semantics visual understanding, VQA, document
Phi-4-Multimodal arXiv 2025 Paper HF Small-parameter (LoRA + projectors) multimodal; vision + speech support, efficient on-device deployment visual understanding, on-device
Molmo / PixMo CVPR 2025 Paper Code Strong open-data/open-weight VLM pipeline visual understanding
FastVLM: Efficient Vision Encoding for Vision Language Models CVPR 2025 Paper efficient multimodal visual encoding for on-device deployment visual understanding, on-device
Qwen2.5-VL: Technical Report arXiv 2025 Paper HF Stronger document, grounding, and video capabilities visual understanding
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model arXiv 2024 Paper HF/Code Specialized end-to-end OCR model with grounding (boxes + points); strong on scientific papers, slides, and mixed visual-text docs OCR, grounding
LLaVA-OneVision: Easy Visual Task Transfer arXiv 2024 Paper Code Single model for image, multi-image, and video transfer visual understanding
MiniCPM-V: A GPT-4V Level MLLM on Your Phone arXiv 2024 Paper Code On-device efficient MLLM visual understanding
NVILA: Efficient Frontier Visual Language Models CVPR 2025 Paper Efficient general purpose multimodal llm; spatial and temporal "Scale then compress" design; vision encoder: Siglip visual understanding
xGen-MM (BLIP-3) arXiv 2024 Paper Open training recipe, datasets, and safety-tuned variants visual understanding
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models arXiv 2024 Paper Code MoE VLM with dynamic tiling and efficient inference visual understanding
Pixtral arXiv 2024 Paper HF 12B open-weight model with strong instruction following, image+text understanding; competitive with larger open VLMs visual understanding
Qwen2-VL arXiv 2024 Paper HF Dynamic resolution; native video visual understanding
Cambrian-1: A Fully Open, Vision-Centric Exploration NeurIPS 2024 Paper Code Spatial Vision Aggregator visual understanding
PaliGemma: A Versatile 3B VLM for Transfer arXiv 2024 Paper HF SigLIP encoder + Gemma backbone; strong transfer model visual understanding
InternLM-XComposer2 arXiv 2024 Paper Code Compositional visual grounding visual understanding
Phi-3-Vision arXiv 2024 Paper HF Small but capable visual understanding
LLaVA-HR: High Resolution MLLMs CVPR 2024 Paper Mixture-of-Resolution Adaptation visual understanding
InternVL2 Model release 2024 HF Instruction-tuned InternVL family release with strong multilingual and OCR capabilities visual understanding
InternVL: Scaling up Vision Foundation Models CVPR 2024 Paper Code Progressively aligned ViT + LLM visual understanding
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training arXiv 2024 Paper Large-scale proprietary recipe study for multimodal LLM pretraining visual understanding
LLaVA arXiv 2023 Paper Code 7B / 13B+ CLIP Vision Encoder (frozen/pretrained) + linear projection to LLM (Vicuna/LLaMA); vision tokens inserted into LLM input; common late-fusion baseline

Q-Former

Paper Venue Links Notes Task
M-MiniGPT4: Multilingual VLLM Alignment via Translated Data arXiv 2026 Paper Q-Former based (inherits from MiniGPT-4 / BLIP-2) vision-language understanding
Video Q-Former: Multimodal Large Language Model with Spatio-Temporal Querying Transformer Openreview Paper Spatio-temporal Q-Former (learnable queries for video spatial-temporal feature extraction) video understanding
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding arXiv 2025 Paper Hierarchical Q-Former (multi-level learnable queries with memory bank for long video) long video understanding
Towards Efficient Visual-Language Alignment of the Q-Former arXiv 2024 Paper PEFT-tuned Q-Former (parameter-efficient fine-tuning on InstructBLIP-style Q-Former) visual reasoning
Matryoshka Query Transformer (MQT) for Large Vision-Language Models NeurIPS 2024 Paper Matryoshka Query Transformer (elastic learnable queries, variable token count) vision-language understanding
Semantically Grounded QFormer for Efficient Vision Language Understanding arXiv 2023 Paper Improved Grounded QFormer (direct latent conditioning, bypass input projection) vision-language understanding

Cross-Attention

Paper Venue Links Notes Task
CASA: Cross-Attention over Self-Attention arXiv 2025 Paper Efficient cross-attention via self-attention reformulation; competitive with token insertion on image benchmarks, strong for long video efficient vision-language fusion, video captioning
LLaMA 3.2 Vision arXiv 2024 Paper HF Adapter-based vision addition to Llama 3.2; strong OCR, document VQA, 128K context visual understanding, document
Idefics2 arXiv 2024 Paper HF Flamingo-style with Perceiver Resampler + gated cross-attention; improved efficiency on Mistral backbone open multimodal understanding
CogVLM: Visual Expert for Pretrained Language Models arXiv 2023 Paper Code Deep fusion with visual expert modules inside a pretrained LLM visual understanding
Qwen-VL: A Versatile Vision-Language Model arXiv 2023 Paper HF High-res, multi-lang, bounding box visual understanding
IDEFICS Hugging Face 80B Flamingo-inspired; late fusion with vision encoder and LLM

Hybrid Adaptor

Paper Venue Links Notes Task
DeepSeek-OCR-2 arXiv 2026 Paper HF Optimized for high-volume OCR, document digitization, charts and formulas; efficient inference OCR, document
Ovis2.5 arXiv 2025 Paper Following VET architecture; excellent document understanding and fine-grained quantization visual understanding, document
Ovis2 arXiv 2025 HF Embedding table / projector architecture; excellent document understanding and fine-grained quantization visual understanding, document
MiniMax-01: Scaling Foundation Models with Lightning Attention arXiv 2025 Paper Hybrid Lightning-Softmax Attention; MoE-based (45.9B active) multimodal; 4M long-context with near-zero prefill latency. visual understanding
mPLUG-Owl3 arXiv 2024 Paper Code Long visual sequences visual understanding
Idefics3 arXiv 2024 Paper HF Open-data recipe with strong document understanding visual understanding
NVLM 1.0: Open Frontier-Class Multimodal LLMs arXiv 2024 Paper HF Hybrid multimodal design with strong OCR and reasoning visual understanding
Idefics2 arXiv 2024 Paper HF Fully open; built on Mistral visual understanding
mPLUG-DocOwl 1.5 / 2: Unified Structure Learning for OCR-free Document Understanding arXiv 2024 Paper Code OCR-free document understanding with unified structure learning; excels at long documents and complex layouts document understanding, OCR

3.2 Omni MLLMs

Paper Venue Links Notes Task Adaptor
OmniGAIA: Towards Native Omni-Modal AI Agents arXiv 2026 Paper Code Comprehensive benchmark for omni-modal agents with complex multi-hop queries across video, audio, and image; includes OmniAtlas agent with tool-integrated reasoning omni-modal understanding & reasoning Native
ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding arXiv 2026 Paper Training-free framework that lifts textual reasoning to omni-modal scenarios using LRM guidance and stepwise contrastive scaling omni-modal reasoning Hybrid
OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention arXiv 2026 Paper Reinforced audio-visual reasoning framework with query intention grounding and modality attention fusion audio-visual reasoning Hybrid
ChronusOmni: Improving Time Awareness of Omni Large Language Models arXiv 2025 Paper Code Enhances temporal awareness in omni-modal LLMs time-aware omni-modal understanding Hybrid
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data arXiv 2025 Paper Code MoE-based scaling for omnimodal understanding and generation omni-modal understanding & generation MLP Projector
Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models arXiv 2025 Paper Code Unified audio-visual speech recognition using LLMs audio-visual speech recognition Hybrid
LongCat-Flash-Omni Technical Report arXiv 2025 Paper Code Long-context omni-modal model supporting text and audio generation long-context omni-modal Hybrid
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM arXiv 2025 Paper Code Architecture and data enhancements for omni-modal understanding omni-modal understanding Hybrid
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue arXiv 2025 Paper Code Unified model for audio-visual multi-turn dialogue audio-visual dialogue Hybrid
OneLLM: One Framework to Align All Modalities with Languag CVPR 2024 Paper Mixture of Matryoshka experts for efficient audio-visual speech recognition all-in-one LLM Hybrid
MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition NeurIPS 2025 Paper Mixture of Matryoshka experts for efficient audio-visual speech recognition audio-visual speech recognition Hybrid
Qwen3-Omni Technical Report arXiv 2025 Paper Code Omni-modal model with text and audio capabilities (Alibaba/Qwen series) omni-modal Native
Qwen2.5-Omni Technical Report arXiv 2025 Paper Code Omni-modal technical report with text and audio support (Alibaba/Qwen series) omni-modal Hybrid
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech, and Multimodal Live Streaming on Your Phone 2025 Paper Code On-device GPT-4o level MLLM for vision, speech and multimodal live streaming (OpenBMB) on-device multimodal live streaming Hybrid
Baichuan-Omni Technical Report arXiv 2024 Paper Code Technical report for Baichuan-Omni (Baichuan Inc.) omni-modal Hybrid
Baichuan-Omni-1.5 Technical Report arXiv 2025 Paper Code Technical report for Baichuan-Omni 1.5 (Baichuan Inc.) omni-modal Hybrid
VITA: Towards Open-Source Interactive Omni Multimodal LLM arXiv 2024 Paper Code Open-source interactive omni multimodal LLM interactive omni multimodal Hybrid
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction arXiv 2024 Paper Code Real-time vision and speech interaction model real-time multimodal interaction Hybrid
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities NeurIPS 2024 Paper Code Open-source GPT-4o style model with vision, speech and duplex capabilities vision-speech duplex Hybrid
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment arXiv 2025 Paper Code Progressive modality alignment for omni-modal language model omni-modal alignment MLP Projector
MIO: A Foundation Model on Multimodal Tokens arXiv 2024 Paper Code Foundation model based on multimodal tokens multimodal tokens Native
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions CVPR 2024 Paper Code Multimodal model supporting seeing, hearing and emotional speech emotional multimodal Hybrid
Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model arXiv 2025 Paper Code Simultaneous multimodal interactions with language-vision-speech model simultaneous multimodal Hybrid
ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding arXiv 2025 Paper Code Native multimodal LLM focused on 3D generation and understanding 3D multimodal Native

Back to Top


4. Unified Multimodal Models (UMMs)

In this section: 4.1 Taxonomy by Generation Paradigm · 4.2 Any-to-Any / Omni UMMs

Models that unify multimodal understanding and visual generation within one framework. The defining property is U+G unification, not necessarily training from scratch.

Boundary with NMMs: if a unified model's central contribution is native end-to-end multimodal pretraining from scratch, we document its architectural details primarily in §5 NMMs and keep §4 focused on the unified U+G perspective.


Overview of representative paradigms and architectures of Unified Multimodal Models (UMMs). Source: https://github.com/AIDC-AI/Awesome-Unified-Multimodal-Models

4.1 Taxonomy by Generation Paradigm

Subtopics: Diffusion-Based UMMs · Autoregressive (AR) UMMs · Hybrid (AR + Diffusion) UMMs

Unified models are categorized according to their core generation mechanism for visual output (while supporting strong multimodal understanding). This taxonomy highlights trade-offs in fidelity, reasoning, efficiency, and training stability.

Diffusion-Based UMMs

Model Venue Links Paradigm Notes Task
Dual Diffusion arXiv 2025 Paper Code Dual Diffusion Unified image generation + understanding via bidirectional diffusion visual understanding, visual generation
UniDisc arXiv 2025 Paper Code Unified Discrete Diffusion Discrete diffusion for multimodal U+G visual understanding, visual generation
MMaDA arXiv 2025 Paper Code Multimodal Large Diffusion LM Diffusion LM for unified understanding/generation visual understanding, visual generation
FUDOKI arXiv 2025 Paper Discrete Flow-based Unified Kinetic-optimal velocities for U+G visual understanding, visual generation
Muddit arXiv 2025 Paper Code Unified Discrete Diffusion Liberating generation beyond T2I visual understanding, visual generation
Lavida-O arXiv 2025 Paper Code Elastic Large Masked Diffusion Elastic masked diffusion for U+G visual understanding, visual generation
UniModel arXiv 2025 Paper Visual-Only MMDiT Framework Visual-only unified multimodal U+G visual understanding, visual generation

Autoregressive (AR) UMMs

Pixel Encoding
Model Venue Links Modalities Notes Task
LWM arXiv 2024 Paper video + language World model on million-length video and language with blockwise ring attention visual understanding, visual generation
Chameleon arXiv 2024 Paper Code image + text Mixed-modal early-fusion foundation models; token-by-token generation visual understanding, visual generation
ANOLE arXiv 2024 Paper Code image + text Open autoregressive native LMM for interleaved image-text generation visual understanding, visual generation
Emu3 arXiv 2024 Paper Code image + text Next-token prediction is all you need; single next-token model visual understanding, visual generation
MMAR arXiv 2024 Paper image + text Lossless multi-modal auto-regressive probabilistic modeling visual understanding, visual generation
Orthus arXiv 2024 Paper Code image + text Autoregressive interleaved image-text generation with modality-specific heads visual understanding, visual generation
SynerGen-VL arXiv 2024 Paper image + text Synergistic image understanding and generation with vision experts and token folding visual understanding, visual generation
Liquid arXiv 2024 Paper Code image + text Language models are scalable and unified multi-modal generators visual understanding, visual generation
UGen arXiv 2025 Paper image + text Unified autoregressive multimodal model with progressive vocabulary learning visual understanding, visual generation
Harmon arXiv 2025 Paper Code image + text Shared MAR encoder for semantic + fine-grained harmony; SOTA GenEval visual understanding, visual generation
TokLIP arXiv 2025 Paper Code image + text Marry visual tokens to CLIP for U+G visual understanding, visual generation
Selftok arXiv 2025 Paper Code image + text Discrete visual tokens for AR / Diffusion / Reasoning visual understanding, visual generation
OneCat arXiv 2025 Paper Code image + text Pure decoder-only unified U+G visual understanding, visual generation
Uni-X arXiv 2025 Paper Code image + text Two-end-separated architecture mitigating modality conflict visual understanding, visual generation
Emu3.5 Nature 2026 Paper Code image + text Native multimodal world learner; next-token only visual understanding, visual generation
Semantic Encoding
Title Venue Links Focus Task
Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer arXiv 2025 Paper Code Unified continuous tokenizer for joint understanding and generation visual understanding, visual generation
Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents arXiv 2025 Paper Code Bridging MLLMs and diffusion models via patch-level CLIP latents visual understanding, visual generation
Qwen-Image Technical Report arXiv 2025 Paper Code High-quality image generation with strong text rendering visual generation
X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again arXiv 2025 Paper Code RL-enhanced discrete autoregressive unified modeling visual understanding, visual generation
Ovis-U1 Technical Report arXiv 2025 Paper Code 3B unified model for understanding, text-to-image and editing visual understanding, visual generation
UniCode²: Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation arXiv 2025 Paper Cascaded large-scale codebooks for unified modeling visual understanding, visual generation
OmniGen2: Exploration to Advanced Multimodal Generation arXiv 2025 Paper Code Versatile open-source unified generation model visual generation
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations arXiv 2025 Paper Code Text-aligned discrete semantic representations visual understanding, visual generation
UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation arXiv 2025 Paper Code Y-shaped architecture for modality alignment visual understanding, visual generation
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation arXiv 2025 Paper Code High-resolution semantic encoders visual understanding, visual generation
Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation arXiv 2025 Paper Auto-regressive foundation model visual understanding, visual generation
DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies arXiv 2025 Paper Dual visual vocabularies visual understanding, visual generation
UniTok: A Unified Tokenizer for Visual Generation and Understanding arXiv 2025 Paper Code Unified tokenizer visual understanding, visual generation
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation arXiv 2025 Paper Code Text-aligned visual tokenization visual understanding, visual generation
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning arXiv 2024 Paper Instruction tuning for unified multimodal visual understanding, visual generation
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance arXiv 2024 Paper Self-enhancing unified see-and-draw visual understanding, visual generation
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation arXiv 2024 Paper Code Multi-granular visual generation visual understanding, visual generation
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation ICLR 2024 Paper Code Unified foundation model visual understanding, visual generation
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models arXiv 2024 Paper Code Multi-modality potential mining visual understanding, visual generation
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer arXiv 2024 Paper Code Interleaved image-text generative modeling visual understanding, visual generation
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation arXiv 2023 Paper Generative pre-trained transformer visual understanding, visual generation
Generative Multimodal Models are In-Context Learners CVPR 2023 Paper In-context learning generative multimodal visual understanding, visual generation
DreamLLM: Synergistic Multimodal Comprehension and Creation ICLR 2023 Paper Synergistic multimodal comprehension and creation visual understanding, visual generation
LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization ICLR 2023 Paper Code Dynamic discrete visual tokenization visual understanding, visual generation
Emu: Generative Pretraining in Multimodality ICLR 2023 Paper Generative pretraining in multimodality visual understanding, visual generation
Learnable Query Encoding
Title Venue Links Focus Task
Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model arXiv 2025 Paper Kontext model with online RL and MetaQuery connector for unified multimodal framework visual understanding, visual generation, editing
TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning arXiv 2025 Paper Ladder-side diffusion tuning integrating MLLM and DiT via layer-wise alignment visual understanding, visual generation
UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing arXiv 2025 Paper Adapting CLIP with unified continuous tokenizer for reconstruction, generation and editing visual understanding, visual generation, editing
OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation arXiv 2025 Paper Code Simple baseline with learnable queries and lightweight connector bridging MLLM and diffusion visual understanding, visual generation
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset arXiv 2025 Paper Fully open unified multimodal models with complete architecture, training recipe and datasets visual understanding, visual generation
Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction arXiv 2025 Paper Unified visual generator and native multimodal autoregressive model for natural interaction visual understanding, visual generation
Nexus-Gen: A Unified Model for Image Understanding, Generation, and Editing arXiv 2025 Paper Code Prefilled autoregression in shared embedding space unifying understanding, generation and editing visual understanding, visual generation, editing
Transfer between Modalities with MetaQueries arXiv 2025 Paper Code Learnable MetaQueries as efficient interface between autoregressive MLLMs and diffusion models visual understanding, visual generation
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation arXiv 2024 Paper Code Unified multi-granularity visual semantics for arbitrary-size comprehension and generation visual understanding, visual generation
Making LLaMA SEE and Draw with SEED Tokenizer ICLR 2023 Paper Code SEED tokenizer enabling LLaMA for scalable multimodal autoregression (see and draw) visual understanding, visual generation
Planting a SEED of Vision in Large Language Model arXiv 2023 Paper Code SEED image tokenizer with 1D causal dependency and high-level semantics for LLM vision visual understanding, visual generation
Hybrid Encoding (Pseduo)
Title Venue Links Focus Task
Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation arXiv 2025 Paper Code Unified autoregressive modeling with decoupled encoding for image understanding, generation and editing visual understanding, visual generation
MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO arXiv 2025 Paper Code Unified VLM with reasoning generation via Reinforcement Learning (RGPO) multimodal understanding, reasoning generation
UniFluid: Unified Autoregressive Visual Generation and Understanding with Continuous Tokens arXiv 2025 Paper Unified autoregressive framework using continuous visual tokens visual understanding, visual generation
OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models arXiv 2025 Paper Code Efficient linear-time unified multimodal model based on Mamba (state space models) multimodal understanding, visual generation
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling arXiv 2025 Paper Code Scaled-up version of Janus with improved training strategy, more data and larger model size multimodal understanding, visual generation
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation arXiv 2024 Paper Code Decoupling visual encoding to enable unified understanding and generation in an autoregressive framework multimodal understanding, visual generation
Hybrid Encoding (Joint)
Title Venue Links Focus Task
AToken: A Unified Tokenizer for Vision arXiv 2025 Paper Code AToken unified visual tokenizer achieving high-fidelity reconstruction and semantic understanding for images, videos and 3D visual understanding, visual generation
UniWeTok: An Unified Binary Tokenizer with Codebook Size 2128 for Unified Multimodal Large Language Model arXiv 2026 Paper UniWeTok unified binary tokenizer with 2^{128} codebook, pre-post distillation and generative-aware prior for MLLMs visual understanding, visual generation
Towards Scalable Pre-training of Visual Tokenizers for Generation arXiv 2025 Paper Code VTP unified visual tokenizer pre-training framework with joint image-text contrastive, self-supervised and reconstruction losses visual understanding, visual generation
The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding arXiv 2025 Paper Code Prism Hypothesis and unified autoencoding (UAE) harmonizing semantic and pixel representations across modalities visual understanding, visual generation
Show-o2: Improved Native Unified Multimodal Models arXiv 2025 Paper Code Improved native unified multimodal models with autoregressive modeling and flow matching for understanding and generation multimodal understanding and generation
UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding CVPRW 2025 Paper Code Unified visual encoding combining discrete and continuous representations for autoregressive multimodal models multimodal understanding and generation
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning arXiv 2025 Paper Code Enhanced visual autoregressive unified model with iterative instruction tuning and DPO reinforcement learning visual understanding, generation and editing
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement arXiv 2025 Paper Code Dual visual tokenization and diffusion refinement for unified multimodal large language model multimodal understanding and generation
SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation arXiv 2025 Paper Semantic-guided hierarchical codebook for unified image tokenization supporting understanding and generation multimodal understanding and generation
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model arXiv 2025 Paper Code Visual autoregressive framework unifying understanding and generation in a single MLLM visual understanding and generation
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation CVPR 2025 Paper Code Unified image tokenizer with dual-codebook architecture bridging understanding and generation multimodal understanding and generation
MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding arXiv 2024 Paper Semantic discrete encoding for unified vision-language model enabling efficient multimodal understanding and generation multimodal understanding and generation

Hybrid (AR + Diffusion) UMMs

Pixel Encoding
Paper Venue Links Notes Task
Tuna: Taming Unified Visual Representations for Native Unified Multimodal Models arXiv 2025 Paper Code Native unified multimodal model with cascaded VAE + representation encoder for unified continuous visual representations multimodal understanding and generation
LMFusion: Adapting Pretrained Language Models for Multimodal Generation arXiv 2024 Paper Adapting pretrained LLMs (Llama) for multimodal generation by adding parallel diffusion modules while keeping autoregressive text modeling multimodal understanding and generation
MonoFormer: One Transformer for Both Diffusion and Autoregression arXiv 2024 Paper Code Single shared transformer backbone that handles both autoregressive modeling and diffusion for unified multimodal tasks visual understanding and generation
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation ICLR 2025 Paper Code Unified transformer combining autoregressive and discrete diffusion modeling to flexibly handle mixed-modality inputs/outputs multimodal understanding and generation
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model ICLR 2025 Paper Joint training of next-token prediction (AR) and diffusion in one transformer over mixed discrete/continuous multimodal sequences visual understanding, visual generation
Hybrid Encoding
Paper Venue Links Notes Task
EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture arXiv 2025 Paper Code Efficient unified architecture with autoencoders, channel-wise concatenation, shared-decoupled networks and MoE for understanding, generation and editing multimodal understanding, generation and editing
HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation arXiv 2025 Paper Asymmetric H-shaped architecture bridging heterogeneous experts with symmetric dense mid-layer connections for unified multimodal modeling multimodal understanding and generation
LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation arXiv 2025 Paper Code Light-weighted double fusion framework that efficiently integrates pretrained vision-language and diffusion models multimodal understanding and generation
BAGEL: Emerging Properties in Unified Multimodal Pretraining arXiv 2025 Paper Code Open-source foundational decoder-only model pretrained on trillions of interleaved multimodal tokens supporting native understanding and generation multimodal understanding and generation
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation arXiv 2025 Paper Causal interleaved multi-modal generation framework with deep-fusion, dual vision encoders and multi-modal classifier-free guidance interleaved multimodal generation
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation arXiv 2024 Paper Code Minimalist framework harmonizing autoregressive LLMs with rectified flow for efficient unified understanding and generation multimodal understanding and generation

4.2 Any-to-Any / Omni UMMs

Models that extend unified understanding + generation beyond text and image to support any-to-any modality conversion (audio, video, speech, etc.). These often build on the paradigms above but emphasize native omni-modal tokenization, long-context handling, and cross-modal generation.

Model Paper Links Notes Task
LongCat-Flash-Omni arXiv 2025 Paper Code Efficient omni model with flash-style acceleration and real-time audio-visual interaction (560B parameters) any-to-any multimodal generation and understanding
Ming-Flash-Omni arXiv 2025 Paper Code Sparse unified MoE architecture (100B total, 6.1B active) for efficient multimodal perception and generation any-to-any multimodal perception and generation
Qwen3-Omni arXiv 2025 Paper Code Next-gen Qwen omni model with unified modality space, maintaining SOTA across text/image/audio/video any-to-any multimodal understanding and generation
Ming-Omni arXiv 2025 Paper Code Unified multimodal architecture for perception + generation (images, text, audio, video) any-to-any multimodal tasks
M2-Omni arXiv 2025 Paper Extends Omni-MLLM with broader modality support and competitive performance to GPT-4o any-to-any multimodal modeling
Spider arXiv 2024 Paper Code Any-to-many multimodal LLM with flexible output heads for arbitrary modality combinations multimodal understanding and generation
MIO arXiv 2024 Paper Token-level unified multimodal foundation model on discrete multimodal tokens any-to-any multimodal token modeling
X-VILA arXiv 2024 Paper Cross-modality alignment for LLM-based multimodal systems (image/video/audio) multimodal understanding
AnyGPT arXiv 2024 Paper Code Discrete token modeling for unified multimodal generation any-to-any multimodal generation
OmniFlow CVPR 2025 Paper Uses multi-modal rectified flows for any-to-any generation across modalities any-to-any generation across modalities
Video-LaVIT ICML 2024 Paper Code Decoupled visual-motion tokenization for video-language modeling video understanding and generation
Unified-IO 2 CVPR 2024 Paper Code Scales autoregressive multimodal models across modalities any-to-any multimodal tasks (vision, language, audio, action)
NExT-GPT arXiv 2023 Paper Code Any-to-any; encoder+LLM+diffusion decoders visual understanding, visual generation, omni

Back to Top


5. Native Multimodal Models (NMMs)

In this section: 5.1 Design Analyses & Scaling Laws · 5.2 Early Fusion NMMs · 5.3 Late Fusion NMMs · 5.4 Any-to-Any / Omni NMMs

The most restrictive category. NMMs are trained completely from scratch on multimodal data — no pretrained LLM or vision encoder is used as initialization. All weights are jointly learned end-to-end.

What recent arXiv work emphasizes: native multimodality is increasingly defined by end-to-end multimodal pretraining, tokenizer/representation co-design, and scaling strategies that explicitly address the asymmetry between vision and language.

5.1 Design Analyses & Scaling Laws

Recent arXiv papers sharpen the definition of NMMs and identify the main bottlenecks in native multimodal pretraining.

Paper Venue Links Insights
Beyond Language Modeling: An Exploration of Multimodal Pretraining arXiv 2026 Paper Highlights representation autoencoders, vision-language data synergy, and MoE for native pretraining
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints arXiv 2025 Paper Code End-to-end native MLLM scaling shows positive correlation between visual encoder and LLM size under data constraints; optimal meta-architecture balances cost and performance
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training arXiv 2025 Paper reveals that LLMs develop latent visual priors during text-only pre-training, where reasoning-centric data (code and math) builds transferable visual reasoning skills while broad corpora foster perception, enabling models to 'see' before ever processing an image.
Scaling Laws for Native Multimodal Models arXiv 2025 Paper Early-fusion NMMs match or outperform late-fusion at low compute; early-fusion needs fewer params; MoE with modality-agnostic routing boosts sparse NMM scaling
The Narrow Gate: Localized Image-Text Communication in Native Multimodal Models arXiv 2024 Paper Native models often funnel image-to-text communication through a single post-image token

5.2 Early Fusion NMMs

Single Transformer decoder processes tokenized text and image inputs from layer 1, with minimal modality-specific parameters (only a linear patch embedding for images). No separate image encoder component.

Recent scaling-law evidence suggests early-fusion NMMs are often stronger at lower parameter counts and simpler to deploy when paired with sufficiently strong visual representations.

Model Paper Links Training Scale Notes Task
NEO arXiv 2025 Paper NEO as a cornerstone for scalable and powerful native VLM development, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem vision-language understanding
NEO-Unify - Blog NEO as a cornerstone for scalable and powerful native VLM development, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem vision-language understanding
Emu3.5 Nature 2026 Paper Code Large-scale (trillion+ tokens) Native world model; next-state prediction on interleaved video/text; Discrete Diffusion Adaptation for efficiency interleaved generation, world modeling, any-to-image

5.3 Late Fusion NMMs

Models where separate unimodal components are jointly trained from scratch (not pretrained), with cross-modal interaction occurring at deeper layers. Distinct from MLLMs where vision encoders are pretrained.

Model Paper Links Training Scale Notes Task
Llama4 arXiv 2026 Paper Blog Scout/Maverick: 17B active / ~109B–400B total; Behemoth: ~2T total Native multimodal, MoE architecture with early fusion and vision encoder vision-language understanding
LongCat-Next arXiv 2026 Paper Discrete Native Any-resolution Visual Transformer vision-language understanding
VL-JEPA arXiv 2025 Paper 1.6B Vision-Language Joint Native Model vision-language understanding
InternVL3 arXiv 2025 Paper A pre-trained InternViT encoder coupled with a cross-attention visual expert, employing a deep but late-fusion strategy to ensure seamless multimodal alignment while strictly preserving native LLM reasoning and linguistic proficiency. vision-language understanding
InternVL3.5 arXiv 2025 Paper A pre-trained ViT encoder with a visual expert that uses cross-attention for deep but late-style fusion to the LLM, preserving its capabilities. vision-language understanding
Qwen3.5 - Blog Discrete Native Any-resolution Visual Transformer vision-language understanding
Gemma4 - Blog A pre-trained ViT encoder with a visual expert that uses cross-attention for deep but late-style fusion to the LLM, preserving its capabilities. vision-language understanding
Emu3 arXiv 2024 Paper Code 8B Next-token prediction over VQ image tokens; native multimodal decoder-only; minimal modality-specific params visual understanding, visual generation

5.4 Any-to-Any / Omni NMMs

The latest arXiv-native multimodal papers increasingly blur the boundaries between omni understanding, any-to-any generation, world modeling, and RL-enhanced post-training.

Model Paper Links Notes Task
Qwen3.5-Omni (Late Fusion) Blog Discrete Native Any-resolution Visual Transformer
ERNIE 5.0 Technical Report arXiv 2026 (Late fusion) paper a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio

Back to Top


6. Closed-Source Multimodal Models

Year 2026

Model Venue Links Notes Task
Claude 4.6 Family (Opus 4.6 / Sonnet 4.6) Anthropic Blog Anthropic Claude Updates Released ~February 2026. Further multimodal refinements (vision + tool/computer use). Proprietary. Multimodal + Agentic/Coding/Computer-Use
Gemini 3.x (Pro / Flash / 3.1 Pro) Google DeepMind Blog Gemini 3 Announcements Released late 2025–early 2026 (Flash Dec 2025, Pro variants Feb 2026). State-of-the-art multimodal with massive context and Deep Think modes. Proprietary. Frontier Multimodal (text/image/audio/video + reasoning)
GPT-5.4 (and Pro/Codex variants) OpenAI Blog OpenAI GPT-5 Updates Released ~March 2026. Enhanced efficiency, multimodal, and professional/agentic features. Proprietary. Omni-Modal + Professional/Agentic Workflows
Grok 4.x updates (e.g., Grok 4.1) xAI Announcement xAI Blog Continued 2025–2026 iterations with improved vision and real-time capabilities. Proprietary via X platform. Multimodal + Real-Time/Data-Integrated Reasoning

Year 2025

Model Venue Links Notes Task
Gemini 2.0 / 2.5 (Pro / Flash) Google DeepMind Blog Gemini 2.x Announcements Released early–mid 2025 (Flash ~Jan/Feb, Pro variants through March–June). Native multimodal with improved agentic and long-context capabilities. Proprietary. Advanced Native Multimodal + Agentic (text/image/audio/video)
Claude 4 Family (Opus 4 / Sonnet 4 / Haiku 4) Anthropic Blog Claude 4 Announcement Released ~May 2025. Enhanced vision, reasoning, and early agentic features. Proprietary. Vision + Advanced Reasoning/Agentic Workflows
Grok 3 / Grok 4 (including vision/speech) xAI Announcement xAI Blog Major updates throughout 2025 (Grok 3 ~early 2025, Grok 4 ~mid-late 2025). Multimodal input (text/image/speech). Proprietary. Multimodal Reasoning + Real-Time Integration
GPT-4.5 / GPT-5 (and variants like GPT-5 Codex) OpenAI Blog OpenAI Announcements GPT-4.5 ~early 2025; full GPT-5 ~August 2025. Unified multimodal with strong reasoning and tool use. Proprietary. Omni-Modal + Advanced Reasoning/Agentic
Mistral Large / Medium Multimodal variants Mistral AI Mistral Platform Proprietary multimodal offerings (e.g., Medium 3.1 ~2025). Text + vision capabilities via API. General Multimodal Tasks

Year 2024

Model Venue Links Notes Task
Gemini 1.5 (Pro / Flash) Google DeepMind Blog Gemini 1.5 Announcement Released February 2024. Massive context (>1M tokens), strong long-context multimodal (video, audio, images). Proprietary. Long-Context Multimodal (video/audio/image/text)
Claude 3 Family (Opus / Sonnet / Haiku) Anthropic Blog Claude 3 Family Released March 2024. Strong native vision for images, charts, diagrams, and documents. Proprietary API + Claude.ai. Vision-Language + Reasoning
Grok-1.5V / Grok-2 Vision xAI Announcement Grok Vision Updates Vision capabilities added ~April 2024 (Grok-1.5V), expanded in Grok-2 (August 2024). Image understanding with real-world and diagram reasoning. Proprietary via X/Grok API. Vision-Language (real-world visuals, diagrams)
GPT-4o (Omni) OpenAI Blog GPT-4o Announcement Released May 2024. Full real-time omni-modal: text + image + audio (voice) input/output. Proprietary. Real-Time Omni-Modal (text/vision/audio)
Amazon Nova (Pro / Lite) Amazon Announcement AWS Bedrock docs Released late 2024. Multimodal (text + image + video). Proprietary via Amazon Bedrock API. Multimodal Understanding (text/image/video)

Year 2023

Model Venue Links Notes Task
GPT-4V (Vision) OpenAI Announcement GPT-4V System Card Released September 2023. First widely available multimodal GPT-4 variant. Image + text input, text output. API/ChatGPT access only. Vision-Language (image understanding, VQA, OCR, document analysis, captioning)
Gemini 1.0 (Ultra / Pro / Nano) Google DeepMind Blog Gemini Announcement Released December 2023. Native multimodal from training (text + image + audio + video). Proprietary API + Gemini chatbot. Native Multimodal Understanding (text/image/audio/video)

7. Resources

In this section: 7.1 Related Awesome Lists · 7.2 Slides & Survey Papers · 7.3 Code Repositories & Tools

7.1 Related Awesome Lists

Repository Focus Author
awesome-multimodal-ml General multimodal ML pliang279
Awesome-Multimodal-Large-Language-Models MLLMs + evaluation BradyFU
Awesome-Multimodal-Research Broad multimodal research Eurus-Holmes
Awesome-Unified-Multimodal-Models UMMs ShowLab
Awesome-Multimodal-Large-Language-Models MLLMs yfzhang114
awesome-foundation-and-multimodal-models Foundation + multimodal SkalskiP
Awesome-Multimodality General multimodality Yutong-Zhou-cv
Awesome-Unified-Multimodal Unified models Purshow
Awesome-Unified-Multimodal Unified models AIDC-AI

7.2 Slides & Survey Papers

Type Resource Notes
Slides Native LMM Slides Ziwei Liu (NTU); concise framing for native multimodal models
Survey A Survey on Multimodal Large Language Models Broad survey of MLLM architectures, data, and evaluation
Report The Dawn of LMMs: Preliminary Explorations with GPT-4V Early capability analysis around GPT-4V
Survey Multimodal Foundation Models: From Specialists to General-Purpose Assistants Broader foundation-model view across multimodal systems

7.3 Code Repositories & Tools

Tool Description Link
LMMs-Eval Unified evaluation harness for multimodal models Code
LAVIS Library for Language-Vision Intelligence (Salesforce) Code
OpenFlamingo Open reproduction of DeepMind Flamingo Code
xtuner Efficient fine-tuning for multimodal LLMs Code
LLaMA-Factory Multimodal instruction tuning framework Code
MMEngine Foundation for perception research (OpenMMLab) Code
DeepSpeed-VisualChat Scalable multimodal chat training Code

Back to Top


How to Contribute

In this section: Validation Rules · Entry Format

We welcome contributions! Please follow these guidelines:

Validation Rules

For NMM submissions (strict):

  • Confirm the model does NOT use any pretrained LLM backbone
  • Confirm the model does NOT use any pretrained vision encoder (CLIP, ViT, etc.)
  • All weights are jointly trained from scratch on multimodal data
  • Classify as Early Fusion or Late Fusion (both must be "from scratch")

For UMM submissions:

  • Confirm the model handles both image understanding AND image generation
  • Note whether pretrained components are used (annotate accordingly)

For MLLM submissions:

  • Note which vision encoder is used (must be a pretrained encoder)
  • Note which LLM backbone is used (must be a pretrained LLM)

Entry Format

| **Model Name** | [Paper](arxiv_link) [Code](github_link) [HF](huggingface_link) BADGES | Scale | Key contribution / notes |

Submit a PR with:

  1. The paper/model entry in the correct section
  2. A one-line justification for the chosen category
  3. Links to paper, code, and/or weights

Back to Top


Citation

If this list is useful in your research, please consider citing:

@misc{awesome-multimodal-modeling-2026,
  title     = {Awesome Multimodal Modeling: From Traditional to Native & Unified},
  author    = OpenEnvision-Lab,
  year      = {2026},
  url       = {https://github.com/OpenEnvision-Lab/Awesome-Multimodal-Model-Traditional-Advanced},
  note      = {GitHub repository}
}

Back to Top


License

CC0

This list is released under the CC0 1.0 Universal license.

Star this repo

Maintained by the community for the multimodal research community.

Back to Top

About

Awesome Multimodal Modeling [Covers MLLM, UMM, and NMM]

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors