Awesome Multimodal Modeling

A Comprehensive Survey & Curated List of Multimodal Modeling
_{From Traditional Fusion to Native & Unified Architectures}

_{Overview · Traditional · MLLMs · UMMs · NMMs ·Closed Source Models · Resources}

📢 News

[2026-04-13] ⭐ The repository has already gained over 100 stars in just one day! Thank you all for the incredible support. We will keep updating this list with more cutting-edge models and resources. Your continued stars and PRs are warmly welcomed!
[2026-04-12] 🎉 We are excited to launch Awesome Multimodal Modeling — a curated reading list organized by architectural paradigms. A comprehensive survey paper is coming soon! Stay tuned.

About This List

In this section: At a Glance · Curation Principles

This repository provides a structured, community-maintained survey of multimodal models, covering the full evolutionary arc from early fusion methods to today's natively-trained omni-models. We emphasize precise architectural definitions and classification, especially for the often-conflated categories of Unified Multimodal Models (UMMs) and Native Multimodal Models (NMMs).

Scope: Primary focus on image + text modalities; audio/video/3D are annotated where present. Omni/any-to-any models are marked with Omni.

At a Glance

Dimension	Coverage
Primary scope	Image + text multimodal models, with explicit annotations for video, audio, and omni extensions
Core taxonomy	Traditional multimodal models, MLLMs, UMMs, and strict NMMs
Key distinction	`U+G unification` for UMMs vs. `joint training from scratch` for NMMs
What makes this repo different	Architecture-first categorization, fusion-aware definitions, and curated links to adjacent awesome lists
Intended audience	Researchers, students, and engineers building or surveying multimodal systems

Curation Principles

Principle	Rule
Source quality	Prefer official conference proceedings, OpenReview, ACL Anthology, CVF Open Access, arXiv, and official project pages
Classification policy	Category assignment is based on this repository's architecture-first definitions, which may differ from authors' own branding
Venue policy	If a peer-reviewed venue is known, we list that venue; otherwise we keep the entry as `arXiv`
Scope discipline	Models, benchmarks, datasets, and analysis papers are tracked separately to avoid mixing artifacts
Inclusion bar	We prioritize landmark papers, broadly adopted benchmarks, open implementations, or papers that clarify important taxonomy boundaries

Classification note: for ambiguous models sitting between MLLM, UMM, and strict NMM, this list records the category that best matches the training recipe and architectural coupling, not just the paper title.

Back to Top

1. Introduction & Definitions

In this section: 1.1 Multimodal Model Evolution Stages · 1.2 Scope & Taxonomy · 1.3 Architecture Diagrams

1.1 Multimodal Model Evolution Stages

Subtopics: Traditional Multimodal Models · Multimodal Large Language Models (MLLMs) · Unified Multimodal Models (UMMs) · Native Multimodal Models (NMMs)

We use the following precise, architecture-first definitions throughout this list. Understanding these distinctions is critical for correctly classifying modern models.

Traditional Multimodal Models

Pre-2023 mainstream era

Independent per-modality processing followed by simple fusion (early, late, or hybrid). No large-scale language model backbone. Focuses on representation alignment, cross-modal retrieval, and captioning. Examples: CLIP, ALIGN, ViLBERT, BLIP.

Multimodal Large Language Models (MLLMs)

Pretrained-backbone multimodal language models

Combine a pretrained visual backbone or visual abstractor (e.g., ViT/CLIP/SigLIP, Q-Former, cross-attention adapter) with a pretrained LLM through a connector. The defining property is inheritance from strong pretrained unimodal backbones rather than joint multimodal pretraining from scratch. These models are primarily text-output understanding/reasoning systems, even when auxiliary generators are attached externally.

Key characteristics:

✅ Pretrained visual encoder / abstractor
✅ Pretrained LLM backbone
✅ Connector layer or cross-attention bridge
❌ No end-to-end multimodal pretraining from scratch
❌ No native image generation inside the same backbone

Examples: LLaVA, Qwen-VL, InternVL, MiniCPM-V, CogVLM

Unified Multimodal Models (UMMs)

Single framework for Understanding + Generation (U+G)

A single framework that handles both multimodal understanding and visual generation. UMMs may reuse pretrained components or modular tokenizers; the defining feature is U+G unification, not whether the model is trained from scratch.

Key characteristics:

✅ Unified understanding + generation
✅ Shared model interface or shared backbone for U+G
⚠️ May use pretrained components
⚠️ May use decoupled encoders / modular tokenizers
⚠️ If a model is also natively trained from scratch, its architectural details belong primarily in NMMs (§5)

Examples: Show-o, Janus, OpenUni, BAGEL, BLIP3-o

Native Multimodal Models (NMMs)

Jointly trained from scratch — no pretrained backbone

The strictest category. NMMs are trained jointly from scratch on all modalities — they do not rely on any pretrained LLM or pretrained vision encoder as initialization. All parameters are learned end-to-end from raw multimodal data.

Key characteristics:

✅ No pretrained LLM backbone
✅ No pretrained vision encoder
✅ All components jointly trained from scratch
✅ Input: text tokens + image patches/tokens
✅ Output: text (understanding focus; generation optional)

NMMs are further divided by fusion architecture:

NMM — Early Fusion

Multimodal interaction begins from the first layer. A single Transformer decoder processes tokenized text and continuous/discrete image patches together, with minimal modality-specific parameters (only a linear patchify layer for images). No separate image encoder is maintained.

Single unified Transformer (decoder-only)
Continuous image patches or minimal discrete tokenization
Modality interaction from layer 1
Near-zero modality-specific parameters (excluding linear patch embed)
Examples: Emu3 (if trained from scratch)

NMM — Late Fusion

Each modality is first processed by a dedicated unimodal component (e.g., a vision tower or image encoder), but these components are jointly trained from scratch (not pretrained). Cross-modal interaction occurs at deeper layers.

Separate unimodal processing stages (trained from scratch)
Cross-modal interaction at deeper layers
More modality-specific parameters
Examples: Models with jointly-trained vision encoders → decoder interaction

1.2 Scope & Taxonomy

Multimodal Models
├── 2. Traditional Multimodal Models
│   ├── 2.1 Multimodel Representations & Alignment
│   │   ├── Multimodal Representations
│   │   ├── Multimodal Fusion
│   │   └── Multimodal Alignment
│   └── 2.2 Multimodal Pretraining
├── 3. Multimodal Large Language Models (MLLMs)
│   ├── 3.1 Foundation MLLMs
│   └── 3.2 Omni MLLMs
├── 4. Unified Multimodal Models (UMMs)
│   ├── 4.1 Taxonomy by Generation Paradigm
│   │   ├── Diffusion-Based UMMs
│   │   ├── Autoregressive (AR) UMMs
│   │   │   ├── Pixel Encoding
│   │   │   ├── Semantic Encoding
│   │   │   ├── Learnable Query Encoding
│   │   │   ├── Hybrid Encoding (Pseduo)
│   │   │   └── Hybrid Encoding (Joint)
│   │   └── Hybrid (AR + Diffusion) UMMs
│   │       ├── Pixel Encoding
│   │       └── Hybrid Encoding
│   └── 4.2 Any-to-Any / Omni UMMs
└── 5. Native Multimodal Models (NMMs)
    ├── 5.1 Design Analyses & Scaling Laws
    ├── 5.2 Early Fusion NMMs
    ├── 5.3 Late Fusion NMMs
    └── 5.4 Any-to-Any / Omni NMMs

1.3 Architecture Diagrams

┌─────────────────────────────────────────────────────────────────┐
│           TRADITIONAL MULTIMODAL MODEL                          │
│                                                                 │
│  [Image] ──► [CNN/ViT Encoder] ──┐                              │
│                                  ├──► [Fusion] ──► [Output]     │
│  [Text]  ──► [LSTM/BERT]       ──┘                              │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│           MLLM — MODULAR LATE FUSION                            │
│                                                                 │
│  [Image] ──► [Pretrained ViT/CLIP] ──► [Projector/Q-Former]     │
│                                                │                │
│                                                ▼                │
│  [Text]  ──────────────────────────► [Pretrained LLM] ──► [Text]│
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│           UMM — UNIFIED UNDERSTANDING + GENERATION              │
│                                                                 │
│  [Image/Text Input] ──► [Shared/Modular Tokenizer]              │
│                                │                                │
│                                ▼                                │
│                   [Unified Transformer]                         │
│                       │           │                             │
│                       ▼           ▼                             │
│             [Text Output]      [Image Output]                   │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│           NMM — EARLY FUSION (Trained from Scratch)             │
│                                                                 │
│  [Text tokens] ──┐                                              │
│                  └──► [Single Decoder Transformer] ──► [Text]   │
│  [Image patches ──► Linear Patchify] ──┘                        │
│   (raw pixels, minimal preprocessing)                           │
│   Multimodal interaction from Layer 1                           │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│           NMM — LATE FUSION (Trained from Scratch)              │
│                                                                 │
│  [Image] ──► [Jointly-Trained Vision Component]                 │
│                         │                                       │
│                         ▼  (deep layers)                        │
│  [Text]  ──────────► [Cross-Modal Interaction] ──► [Text]       │
│           (All components trained jointly from scratch)         │
└─────────────────────────────────────────────────────────────────┘

Back to Top

2. Traditional Multimodal Models

In this section: 2.1 Multimodel Representations & Alignment · 2.2 Multimodal Pretraining

Pre-chat-MLLM and non-native multimodal systems that established the basic vocabulary of alignment, fusion, retrieval, captioning, and multimodal pretraining.

2.1 Multimodel Representations & Alignment

Subtopics: Multimodal Representations · Multimodal Fusion · Multimodal Alignment

Multimodal Representations

Paper	Venue	Links	Notes	Task
Identifiability Results for Multimodal Contrastive Learning	ICLR 2023	Paper	Theoretical identifiability analysis of contrastive multimodal learning	representation learning
Unpaired Vision-Language Pre-training via Cross-Modal CutMix	ICML 2022	Paper	Introduces CutMix-style augmentation for unpaired VLP	vision-language pretraining
Balanced Multimodal Learning via On-the-fly Gradient Modulation	CVPR 2022	Paper	Balances modality learning via dynamic gradient reweighting	multimodal optimization
Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast	IJCAI 2021	Paper	Cross-modal prototype contrast for voice-face alignment	audio-visual representation learning
Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text	arXiv 2021	Paper	Early unified transformer for unpaired multimodal pretraining	unified multimodal pretraining
FLAVA: A Foundational Language And Vision Alignment Model	arXiv 2021	Paper	Unified architecture for vision-language understanding and generation	foundation multimodal model
Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer	arXiv 2021	Paper	Single transformer for multiple multimodal tasks	multimodal multitask learning
MultiBench: Multiscale Benchmarks for Multimodal Representation Learning	NeurIPS 2021	Paper	Benchmark suite for multimodal learning evaluation	benchmarking
Perceiver: General Perception with Iterative Attention	ICML 2021	Paper	General-purpose architecture for high-dimensional multimodal inputs	general multimodal architecture
Learning Transferable Visual Models From Natural Language Supervision	arXiv 2021	Paper	Contrastive vision-language pretraining at scale	vision-language contrastive learning
VinVL: Revisiting Visual Representations in Vision-Language Models	arXiv 2021	Paper	Improved visual features for VL tasks	vision-language representation improvement
Learning Transferable Visual Models From Natural Language Supervision	arXiv 2020	Paper	Early large-scale vision-language contrastive learning	vision-language pretraining
12-in-1: Multi-Task Vision and Language Representation Learning	CVPR 2020	Paper	Unified multi-task learning across 12 VL tasks	multi-task learning
Watching the World Go By: Representation Learning from Unlabeled Videos	arXiv 2020	Paper	Self-supervised video representation learning	video representation learning
Learning Video Representations using Contrastive Bidirectional Transformer	arXiv 2019	Paper	Contrastive transformer for video representation learning	video contrastive learning
Visual Concept-Metaconcept Learning	NeurIPS 2019	Paper	Hierarchical concept learning from visual data	concept learning
OmniNet: A Unified Architecture for Multi-modal Multi-task Learning	arXiv 2019	Paper	Unified encoder-decoder for multimodal tasks	unified multimodal architecture
Learning Representations by Maximizing Mutual Information Across Views	arXiv 2019	Paper	InfoMax principle for cross-view representation learning	self-supervised learning
ViCo: Word Embeddings from Visual Co-occurrences	ICCV 2019	Paper	Learning word embeddings from visual context	vision-language embeddings
Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations	CVPR 2019	Paper	Structured embedding space for vision-language alignment	embedding learning
Multi-Task Learning of Hierarchical Vision-Language Representation	CVPR 2019	Paper	Hierarchical representation learning across VL tasks	multi-task learning
Learning Factorized Multimodal Representations	ICLR 2019	Paper	Factorized latent space for multimodal data	representation disentanglement
A Probabilistic Framework for Multi-view Feature Learning with Many-to-many Associations via Neural Networks	ICML 2018	Paper	Probabilistic modeling of multi-view correspondence	multi-view learning
Do Neural Network Cross-Modal Mappings Really Bridge Modalities?	ACL 2018	Paper	Analyzes limitations of cross-modal mapping	theoretical analysis
Learning Robust Visual-Semantic Embeddings	ICCV 2017	Paper	Improved robustness in vision-language embeddings	embedding learning
Deep Multimodal Representation Learning from Temporal Data	CVPR 2017	Paper	Temporal multimodal representation learning	multimodal temporal learning
Is an Image Worth More than a Thousand Words? On the Fine-Grain Semantic Differences between Visual and Linguistic Representations	COLING 2016	Paper	Analyzes semantic gap between vision and language	representation analysis
Combining Language and Vision with a Multimodal Skip-gram Model	NAACL 2015	Paper	Extends skip-gram with visual context	multimodal embeddings
Deep Fragment Embeddings for Bidirectional Image Sentence Mapping	NeurIPS 2014	Paper	Fragment-level image-sentence alignment	vision-language alignment
Multimodal Deep Learning	JMLR 2014	Paper	Probabilistic generative multimodal model	generative multimodal learning
Learning Grounded Meaning Representations with Autoencoders	ACL 2014	Paper	Autoencoder-based grounded semantics	representation learning
DeViSE: A Deep Visual-Semantic Embedding Model	NeurIPS 2013	Paper	Early deep vision-to-language embedding model	vision-language embedding
Multimodal Deep Learning	ICML 2011	Paper	Foundational multimodal deep learning framework	multimodal deep learning

Multimodal Fusion

Paper	Venue	Links	Notes	Task
Robust Contrastive Learning against Noisy Views	arXiv 2022	Paper	Robust contrastive learning under noisy multi-view inputs	contrastive learning
Cooperative Learning for Multi-view Analysis	arXiv 2022	Paper	Cooperative optimization across multiple views for representation learning	multi-view learning
What Makes Multi-modal Learning Better than Single (Provably)	NeurIPS 2021	Paper	Theoretical guarantees showing when multimodal learning improves over unimodal	theoretical analysis
Efficient Multi-Modal Fusion with Diversity Analysis	ACMMM 2021	Paper	Fusion method emphasizing diversity-aware multimodal integration	multimodal fusion
Attention Bottlenecks for Multimodal Fusion	NeurIPS 2021	Paper	Introduces bottleneck attention mechanism for efficient multimodal fusion	multimodal fusion
VMLoc: Variational Fusion For Learning-Based Multimodal Camera Localization	AAAI 2021	Paper	Variational multimodal fusion for camera localization tasks	multimodal localization
Trusted Multi-View Classification	ICLR 2021	Paper	Confidence-aware weighting for multi-view classification	multi-view classification
Deep-HOSeq: Deep Higher-Order Sequence Fusion for Multimodal Sentiment Analysis	ICDM 2020	Paper	Higher-order sequence fusion for multimodal sentiment analysis	multimodal sentiment analysis
Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies	NeurIPS 2020	Paper	Entropy-based regularization to reduce modality bias	multimodal fairness/robustness
Deep Multimodal Fusion by Channel Exchanging	NeurIPS 2020	Paper	Channel exchange mechanism for cross-modal feature interaction	multimodal fusion
What Makes Training Multi-Modal Classification Networks Hard?	CVPR 2020	Paper	Analyzes optimization challenges in multimodal classification	theoretical/empirical analysis
Dynamic Fusion for Multimodal Data	arXiv 2019	Paper	Adaptive fusion strategy depending on input modality quality	multimodal fusion
DeepCU: Integrating Both Common and Unique Latent Information for Multimodal Sentiment Analysis	IJCAI 2019	Paper	Separates shared and private latent representations for fusion	multimodal sentiment analysis
Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling	NeurIPS 2019	Paper	High-order tensor/polynomial fusion for multimodal features	multimodal fusion
XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification	IEEE TNNLS 2019	Paper	Cross-modal feature exchange network for audio-visual tasks	audio-visual classification
MFAS: Multimodal Fusion Architecture Search	CVPR 2019	Paper	Neural architecture search for optimal multimodal fusion design	architecture search
The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision	ICLR 2019	Paper	Neuro-symbolic model combining perception and reasoning	neuro-symbolic learning
Unifying and merging well-trained deep neural networks for inference stage	IJCAI 2018	Paper	Model merging strategy for inference-time multimodal integration	model fusion
Efficient Low-rank Multimodal Fusion with Modality-Specific Factors	ACL 2018	Paper	Low-rank factorization for efficient multimodal fusion	efficient fusion
Memory Fusion Network for Multi-view Sequential Learning	AAAI 2018	Paper	Memory-based fusion across temporal multimodal sequences	sequential multimodal learning
Tensor Fusion Network for Multimodal Sentiment Analysis	EMNLP 2017	Paper	Tensor-based full interaction modeling across modalities	multimodal sentiment analysis
Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework	AAAI 2015	Paper	Joint modeling of video and compositional language	vision-language modeling
A co-regularized approach to semi-supervised learning with multiple views	ICML 2005	Paper	Early multi-view co-regularization framework	multi-view semi-supervised learning

Multimodal Alignment

Paper	Venue	Links	Notes	Task
CLIP	arXiv 2021	Paper	400M+	Dual-encoder (Vision Transformer + Text Transformer); contrastive alignment at embedding level; classic late-fusion foundation
Reconsidering Representation Alignment for Multi-view Clustering	CVPR 2021	Paper	Revisits representation alignment objectives for multi-view clustering	multimodal alignment
CoMIR: Contrastive Multimodal Image Representation for Registration	NeurIPS 2020	Paper	Contrastive learning for multimodal image registration alignment	multimodal alignment
Multimodal Transformer for Unaligned Multimodal Language Sequences	ACL 2019	Paper	Transformer-based alignment for unaligned multimodal sequences	sequence alignment
Temporal Cycle-Consistency Learning	CVPR 2019	Paper	Uses cycle-consistency for temporal cross-modal alignment	temporal alignment
See, Hear, and Read: Deep Aligned Representations	arXiv 2017	Paper	Learns aligned representations across vision, audio, and text	multimodal alignment
On Deep Multi-View Representation Learning	ICML 2015	Paper	Theoretical and empirical study of multi-view representation alignment	multi-view learning
Unsupervised Alignment of Natural Language Instructions with Video Segments	AAAI 2014	Paper	Aligns language instructions with video segments without supervision	language-video alignment
Multimodal Alignment of Videos	ACM MM 2014	Paper	Early multimodal alignment framework for video modalities	video alignment
Deep Canonical Correlation Analysis	ICML 2013	Paper	Deep learning extension of CCA for cross-view representation alignment	representation alignment

2.2 Multimodal Pretraining

Paper	Venue	Links	Notes	Task
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation	NeurIPS 2021 Spotlight	Paper	Momentum distillation for aligning vision-language representations before fusion	vision-language pretraining
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling	CVPR 2021	Paper	Sparse frame sampling for efficient video-language pretraining	video-language pretraining
Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer	arXiv 2021	Paper	Unified transformer for multitask multimodal learning	unified multimodal pretraining
Large-Scale Adversarial Training for Vision-and-Language Representation Learning	NeurIPS 2020	Paper	Adversarial training improves robustness of vision-language representations	robust multimodal pretraining
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision	EMNLP 2020	Paper	Grounds language tokens in visual context via voken supervision	vision-grounded language modeling
Integrating Multimodal Information in Large Pretrained Transformers	ACL 2020	Paper	Injects multimodal signals into large pretrained transformer architectures	multimodal transformer pretraining
VL-BERT: Pre-training of Generic Visual-Linguistic Representations	arXiv 2019	Paper	Joint vision-language BERT-style pretraining	vision-language pretraining
VisualBERT: A Simple and Performant Baseline for Vision and Language	arXiv 2019	Paper	Early unified transformer for vision-language understanding	vision-language pretraining
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks	NeurIPS 2019	Paper	Two-stream transformer for cross-modal vision-language learning	vision-language pretraining
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training	arXiv 2019	Paper	Cross-modal encoder for universal vision-language representations	vision-language pretraining
LXMERT: Learning Cross-Modality Encoder Representations from Transformers	EMNLP 2019	Paper	Cross-modality transformer encoder for vision-language reasoning	vision-language pretraining
VideoBERT: A Joint Model for Video and Language Representation Learning	ICCV 2019	Paper	Joint discrete token modeling for video and language	video-language pretraining

Back to Top

3. Multimodal Large Language Models (MLLMs)

In this section: 3.1 Foundation MLLMs · 3.2 Omni MLLMs

Models that connect a pretrained visual encoder / abstractor to a pretrained LLM. Primarily text-output understanding and reasoning systems, defined by inherited pretrained unimodal backbones rather than multimodal pretraining from scratch.

3.1 Taxonomy Based on Vision Adapter

MLP/Others Projector

Paper	Venue	Links	Notes	Task
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders	arXiv 2026	Paper	LLM-initialized vision encoder (non-CLIP); text-to-vision weight reuse, generative-aligned visual features, optimized for dense perception.	visual understanding
Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision	arXiv 2026	Paper	Tri-modal (V+A+L) unified framework; parameter-efficient tuning, seamless cross-modal reasoning for mobile/IoT deployment.	visual understanding
STEP3-VL-10B Technical Report	arXiv 2026	Paper	10B-scale foundation multimodal; unified unfrozen pre-training + PaCoRe test-time scaling, frontier-level reasoning with compact footprint.	visual understanding
GLM-OCR	arXiv 2026	Paper	GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding.	OCR, structured extraction
Kimi K2.5	arXiv 2026	Paper	joint text-vision pretraining, Agent Swarm framework; coding, vision, reasoning, agentic tasks; reduces latency by up to 4.5x	visual agentic intelligence, agentic, reasoning
Kwai Keye-VL 1.5 Technical Report	arXiv 2025	Paper	Adaptive Slow-Fast encoding; 8B parameter scale with 128K long-context; SOTA video reasoning & human-preference aligned.	visual understanding
olmOCR / olmOCR-2	arXiv 2025	Paper	Efficient low-VRAM OCR model based on Qwen2.5-VL fine-tune; excels at preserving semantic structure and markdown output	OCR, structured extraction
PaddleOCR-VL	arXiv 2025	HF / Official	Lightweight (0.9B+) multimodal OCR with 109 languages support; excellent chart-to-HTML/Markdown conversion and high-throughput	OCR, multilingual document
DeepSeek-OCR	arXiv 2025	Paper HF	Lightweight ~3B MoE vision model optimized for high-volume OCR, document digitization, charts and formulas; efficient inference	OCR, document
Kimi-VL	arXiv 2025	Paper HF	Projector + MoE backbone; long video/PDF/GUI, agentic capabilities, chain-of-thought vision reasoning	visual understanding, agentic, video
Seed1.5-VL Technical Report	arXiv 2025	Paper	20B MoE + 532M ViT; native-resolution vision-language foundation model; efficient asymmetric architecture.	visual understanding
Qwen3-VL	arXiv 2025	Paper HF	Frontier-grade vision/OCR (32+ languages), video analysis, agentic capabilities, strong multimodal reasoning; includes large MoE variants (e.g., 235B)	visual understanding, video, omni
SmolVLM	arXiv 2025	HF	Ultra-lightweight (256M–2.2B) projector-based series; efficient on-device video and image understanding	visual understanding, efficiency
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning	arXiv 2025	Paper	Diffusion llm as llm backbone; Vision encoder: Siglip	visual understanding
jina-vlm	arXiv 2025	Paper HF	SigLIP2 + Qwen backbone with custom projector; optimized for semantic VQA, diagrams, scans and document semantics	visual understanding, VQA, document
Phi-4-Multimodal	arXiv 2025	Paper HF	Small-parameter (LoRA + projectors) multimodal; vision + speech support, efficient on-device deployment	visual understanding, on-device
Molmo / PixMo	CVPR 2025	Paper Code	Strong open-data/open-weight VLM pipeline	visual understanding
FastVLM: Efficient Vision Encoding for Vision Language Models	CVPR 2025	Paper	efficient multimodal visual encoding for on-device deployment	visual understanding, on-device
Qwen2.5-VL: Technical Report	arXiv 2025	Paper HF	Stronger document, grounding, and video capabilities	visual understanding
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model	arXiv 2024	Paper HF/Code	Specialized end-to-end OCR model with grounding (boxes + points); strong on scientific papers, slides, and mixed visual-text docs	OCR, grounding
LLaVA-OneVision: Easy Visual Task Transfer	arXiv 2024	Paper Code	Single model for image, multi-image, and video transfer	visual understanding
MiniCPM-V: A GPT-4V Level MLLM on Your Phone	arXiv 2024	Paper Code	On-device efficient MLLM	visual understanding
NVILA: Efficient Frontier Visual Language Models	CVPR 2025	Paper	Efficient general purpose multimodal llm; spatial and temporal "Scale then compress" design; vision encoder: Siglip	visual understanding
xGen-MM (BLIP-3)	arXiv 2024	Paper	Open training recipe, datasets, and safety-tuned variants	visual understanding
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models	arXiv 2024	Paper Code	MoE VLM with dynamic tiling and efficient inference	visual understanding
Pixtral	arXiv 2024	Paper HF	12B open-weight model with strong instruction following, image+text understanding; competitive with larger open VLMs	visual understanding
Qwen2-VL	arXiv 2024	Paper HF	Dynamic resolution; native video	visual understanding
Cambrian-1: A Fully Open, Vision-Centric Exploration	NeurIPS 2024	Paper Code	Spatial Vision Aggregator	visual understanding
PaliGemma: A Versatile 3B VLM for Transfer	arXiv 2024	Paper HF	SigLIP encoder + Gemma backbone; strong transfer model	visual understanding
InternLM-XComposer2	arXiv 2024	Paper Code	Compositional visual grounding	visual understanding
Phi-3-Vision	arXiv 2024	Paper HF	Small but capable	visual understanding
LLaVA-HR: High Resolution MLLMs	CVPR 2024	Paper	Mixture-of-Resolution Adaptation	visual understanding
InternVL2	Model release 2024	HF	Instruction-tuned InternVL family release with strong multilingual and OCR capabilities	visual understanding
InternVL: Scaling up Vision Foundation Models	CVPR 2024	Paper Code	Progressively aligned ViT + LLM	visual understanding
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training	arXiv 2024	Paper	Large-scale proprietary recipe study for multimodal LLM pretraining	visual understanding
LLaVA	arXiv 2023	Paper Code	7B / 13B+	CLIP Vision Encoder (frozen/pretrained) + linear projection to LLM (Vicuna/LLaMA); vision tokens inserted into LLM input; common late-fusion baseline

Q-Former

Paper	Venue	Links	Notes	Task
M-MiniGPT4: Multilingual VLLM Alignment via Translated Data	arXiv 2026	Paper	Q-Former based (inherits from MiniGPT-4 / BLIP-2)	vision-language understanding
Video Q-Former: Multimodal Large Language Model with Spatio-Temporal Querying Transformer	Openreview	Paper	Spatio-temporal Q-Former (learnable queries for video spatial-temporal feature extraction)	video understanding
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding	arXiv 2025	Paper	Hierarchical Q-Former (multi-level learnable queries with memory bank for long video)	long video understanding
Towards Efficient Visual-Language Alignment of the Q-Former	arXiv 2024	Paper	PEFT-tuned Q-Former (parameter-efficient fine-tuning on InstructBLIP-style Q-Former)	visual reasoning
Matryoshka Query Transformer (MQT) for Large Vision-Language Models	NeurIPS 2024	Paper	Matryoshka Query Transformer (elastic learnable queries, variable token count)	vision-language understanding
Semantically Grounded QFormer for Efficient Vision Language Understanding	arXiv 2023	Paper	Improved Grounded QFormer (direct latent conditioning, bypass input projection)	vision-language understanding

Cross-Attention

Paper	Venue	Links	Notes	Task
CASA: Cross-Attention over Self-Attention	arXiv 2025	Paper	Efficient cross-attention via self-attention reformulation; competitive with token insertion on image benchmarks, strong for long video	efficient vision-language fusion, video captioning
LLaMA 3.2 Vision	arXiv 2024	Paper HF	Adapter-based vision addition to Llama 3.2; strong OCR, document VQA, 128K context	visual understanding, document
Idefics2	arXiv 2024	Paper HF	Flamingo-style with Perceiver Resampler + gated cross-attention; improved efficiency on Mistral backbone	open multimodal understanding
CogVLM: Visual Expert for Pretrained Language Models	arXiv 2023	Paper Code	Deep fusion with visual expert modules inside a pretrained LLM	visual understanding
Qwen-VL: A Versatile Vision-Language Model	arXiv 2023	Paper HF	High-res, multi-lang, bounding box	visual understanding
IDEFICS	—	Hugging Face	80B	Flamingo-inspired; late fusion with vision encoder and LLM

Hybrid Adaptor

Paper	Venue	Links	Notes	Task
DeepSeek-OCR-2	arXiv 2026	Paper HF	Optimized for high-volume OCR, document digitization, charts and formulas; efficient inference	OCR, document
Ovis2.5	arXiv 2025	Paper	Following VET architecture; excellent document understanding and fine-grained quantization	visual understanding, document
Ovis2	arXiv 2025	HF	Embedding table / projector architecture; excellent document understanding and fine-grained quantization	visual understanding, document
MiniMax-01: Scaling Foundation Models with Lightning Attention	arXiv 2025	Paper	Hybrid Lightning-Softmax Attention; MoE-based (45.9B active) multimodal; 4M long-context with near-zero prefill latency.	visual understanding
mPLUG-Owl3	arXiv 2024	Paper Code	Long visual sequences	visual understanding
Idefics3	arXiv 2024	Paper HF	Open-data recipe with strong document understanding	visual understanding
NVLM 1.0: Open Frontier-Class Multimodal LLMs	arXiv 2024	Paper HF	Hybrid multimodal design with strong OCR and reasoning	visual understanding
Idefics2	arXiv 2024	Paper HF	Fully open; built on Mistral	visual understanding
mPLUG-DocOwl 1.5 / 2: Unified Structure Learning for OCR-free Document Understanding	arXiv 2024	Paper Code	OCR-free document understanding with unified structure learning; excels at long documents and complex layouts	document understanding, OCR

3.2 Omni MLLMs

Paper	Venue	Links	Notes	Task	Adaptor
OmniGAIA: Towards Native Omni-Modal AI Agents	arXiv 2026	Paper Code	Comprehensive benchmark for omni-modal agents with complex multi-hop queries across video, audio, and image; includes OmniAtlas agent with tool-integrated reasoning	omni-modal understanding & reasoning	Native
ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding	arXiv 2026	Paper	Training-free framework that lifts textual reasoning to omni-modal scenarios using LRM guidance and stepwise contrastive scaling	omni-modal reasoning	Hybrid
OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention	arXiv 2026	Paper	Reinforced audio-visual reasoning framework with query intention grounding and modality attention fusion	audio-visual reasoning	Hybrid
ChronusOmni: Improving Time Awareness of Omni Large Language Models	arXiv 2025	Paper Code	Enhances temporal awareness in omni-modal LLMs	time-aware omni-modal understanding	Hybrid
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data	arXiv 2025	Paper Code	MoE-based scaling for omnimodal understanding and generation	omni-modal understanding & generation	MLP Projector
Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models	arXiv 2025	Paper Code	Unified audio-visual speech recognition using LLMs	audio-visual speech recognition	Hybrid
LongCat-Flash-Omni Technical Report	arXiv 2025	Paper Code	Long-context omni-modal model supporting text and audio generation	long-context omni-modal	Hybrid
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM	arXiv 2025	Paper Code	Architecture and data enhancements for omni-modal understanding	omni-modal understanding	Hybrid
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue	arXiv 2025	Paper Code	Unified model for audio-visual multi-turn dialogue	audio-visual dialogue	Hybrid
OneLLM: One Framework to Align All Modalities with Languag	CVPR 2024	Paper	Mixture of Matryoshka experts for efficient audio-visual speech recognition	all-in-one LLM	Hybrid
MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition	NeurIPS 2025	Paper	Mixture of Matryoshka experts for efficient audio-visual speech recognition	audio-visual speech recognition	Hybrid
Qwen3-Omni Technical Report	arXiv 2025	Paper Code	Omni-modal model with text and audio capabilities (Alibaba/Qwen series)	omni-modal	Native
Qwen2.5-Omni Technical Report	arXiv 2025	Paper Code	Omni-modal technical report with text and audio support (Alibaba/Qwen series)	omni-modal	Hybrid
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech, and Multimodal Live Streaming on Your Phone	2025	Paper Code	On-device GPT-4o level MLLM for vision, speech and multimodal live streaming (OpenBMB)	on-device multimodal live streaming	Hybrid
Baichuan-Omni Technical Report	arXiv 2024	Paper Code	Technical report for Baichuan-Omni (Baichuan Inc.)	omni-modal	Hybrid
Baichuan-Omni-1.5 Technical Report	arXiv 2025	Paper Code	Technical report for Baichuan-Omni 1.5 (Baichuan Inc.)	omni-modal	Hybrid
VITA: Towards Open-Source Interactive Omni Multimodal LLM	arXiv 2024	Paper Code	Open-source interactive omni multimodal LLM	interactive omni multimodal	Hybrid
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction	arXiv 2024	Paper Code	Real-time vision and speech interaction model	real-time multimodal interaction	Hybrid
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities	NeurIPS 2024	Paper Code	Open-source GPT-4o style model with vision, speech and duplex capabilities	vision-speech duplex	Hybrid
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment	arXiv 2025	Paper Code	Progressive modality alignment for omni-modal language model	omni-modal alignment	MLP Projector
MIO: A Foundation Model on Multimodal Tokens	arXiv 2024	Paper Code	Foundation model based on multimodal tokens	multimodal tokens	Native
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions	CVPR 2024	Paper Code	Multimodal model supporting seeing, hearing and emotional speech	emotional multimodal	Hybrid
Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model	arXiv 2025	Paper Code	Simultaneous multimodal interactions with language-vision-speech model	simultaneous multimodal	Hybrid
ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding	arXiv 2025	Paper Code	Native multimodal LLM focused on 3D generation and understanding	3D multimodal	Native

Back to Top

4. Unified Multimodal Models (UMMs)

In this section: 4.1 Taxonomy by Generation Paradigm · 4.2 Any-to-Any / Omni UMMs

Models that unify multimodal understanding and visual generation within one framework. The defining property is U+G unification, not necessarily training from scratch.

Boundary with NMMs: if a unified model's central contribution is native end-to-end multimodal pretraining from scratch, we document its architectural details primarily in §5 NMMs and keep §4 focused on the unified U+G perspective.

Overview of representative paradigms and architectures of Unified Multimodal Models (UMMs). Source: https://github.com/AIDC-AI/Awesome-Unified-Multimodal-Models

4.1 Taxonomy by Generation Paradigm

Subtopics: Diffusion-Based UMMs · Autoregressive (AR) UMMs · Hybrid (AR + Diffusion) UMMs

Unified models are categorized according to their core generation mechanism for visual output (while supporting strong multimodal understanding). This taxonomy highlights trade-offs in fidelity, reasoning, efficiency, and training stability.

Diffusion-Based UMMs

Model	Venue	Links	Paradigm	Notes	Task
Dual Diffusion	arXiv 2025	Paper Code	Dual Diffusion	Unified image generation + understanding via bidirectional diffusion	visual understanding, visual generation
UniDisc	arXiv 2025	Paper Code	Unified Discrete Diffusion	Discrete diffusion for multimodal U+G	visual understanding, visual generation
MMaDA	arXiv 2025	Paper Code	Multimodal Large Diffusion LM	Diffusion LM for unified understanding/generation	visual understanding, visual generation
FUDOKI	arXiv 2025	Paper	Discrete Flow-based Unified	Kinetic-optimal velocities for U+G	visual understanding, visual generation
Muddit	arXiv 2025	Paper Code	Unified Discrete Diffusion	Liberating generation beyond T2I	visual understanding, visual generation
Lavida-O	arXiv 2025	Paper Code	Elastic Large Masked Diffusion	Elastic masked diffusion for U+G	visual understanding, visual generation
UniModel	arXiv 2025	Paper	Visual-Only MMDiT Framework	Visual-only unified multimodal U+G	visual understanding, visual generation

Autoregressive (AR) UMMs

Pixel Encoding

Model	Venue	Links	Modalities	Notes	Task
LWM	arXiv 2024	Paper	video + language	World model on million-length video and language with blockwise ring attention	visual understanding, visual generation
Chameleon	arXiv 2024	Paper Code	image + text	Mixed-modal early-fusion foundation models; token-by-token generation	visual understanding, visual generation
ANOLE	arXiv 2024	Paper Code	image + text	Open autoregressive native LMM for interleaved image-text generation	visual understanding, visual generation
Emu3	arXiv 2024	Paper Code	image + text	Next-token prediction is all you need; single next-token model	visual understanding, visual generation
MMAR	arXiv 2024	Paper	image + text	Lossless multi-modal auto-regressive probabilistic modeling	visual understanding, visual generation
Orthus	arXiv 2024	Paper Code	image + text	Autoregressive interleaved image-text generation with modality-specific heads	visual understanding, visual generation
SynerGen-VL	arXiv 2024	Paper	image + text	Synergistic image understanding and generation with vision experts and token folding	visual understanding, visual generation
Liquid	arXiv 2024	Paper Code	image + text	Language models are scalable and unified multi-modal generators	visual understanding, visual generation
UGen	arXiv 2025	Paper	image + text	Unified autoregressive multimodal model with progressive vocabulary learning	visual understanding, visual generation
Harmon	arXiv 2025	Paper Code	image + text	Shared MAR encoder for semantic + fine-grained harmony; SOTA GenEval	visual understanding, visual generation
TokLIP	arXiv 2025	Paper Code	image + text	Marry visual tokens to CLIP for U+G	visual understanding, visual generation
Selftok	arXiv 2025	Paper Code	image + text	Discrete visual tokens for AR / Diffusion / Reasoning	visual understanding, visual generation
OneCat	arXiv 2025	Paper Code	image + text	Pure decoder-only unified U+G	visual understanding, visual generation
Uni-X	arXiv 2025	Paper Code	image + text	Two-end-separated architecture mitigating modality conflict	visual understanding, visual generation
Emu3.5	Nature 2026	Paper Code	image + text	Native multimodal world learner; next-token only	visual understanding, visual generation

Semantic Encoding

Title	Venue	Links	Focus	Task
Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer	arXiv 2025	Paper Code	Unified continuous tokenizer for joint understanding and generation	visual understanding, visual generation
Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents	arXiv 2025	Paper Code	Bridging MLLMs and diffusion models via patch-level CLIP latents	visual understanding, visual generation
Qwen-Image Technical Report	arXiv 2025	Paper Code	High-quality image generation with strong text rendering	visual generation
X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again	arXiv 2025	Paper Code	RL-enhanced discrete autoregressive unified modeling	visual understanding, visual generation
Ovis-U1 Technical Report	arXiv 2025	Paper Code	3B unified model for understanding, text-to-image and editing	visual understanding, visual generation
UniCode²: Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation	arXiv 2025	Paper	Cascaded large-scale codebooks for unified modeling	visual understanding, visual generation
OmniGen2: Exploration to Advanced Multimodal Generation	arXiv 2025	Paper Code	Versatile open-source unified generation model	visual generation
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations	arXiv 2025	Paper Code	Text-aligned discrete semantic representations	visual understanding, visual generation
UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation	arXiv 2025	Paper Code	Y-shaped architecture for modality alignment	visual understanding, visual generation
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation	arXiv 2025	Paper Code	High-resolution semantic encoders	visual understanding, visual generation
Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation	arXiv 2025	Paper	Auto-regressive foundation model	visual understanding, visual generation
DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies	arXiv 2025	Paper	Dual visual vocabularies	visual understanding, visual generation
UniTok: A Unified Tokenizer for Visual Generation and Understanding	arXiv 2025	Paper Code	Unified tokenizer	visual understanding, visual generation
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation	arXiv 2025	Paper Code	Text-aligned visual tokenization	visual understanding, visual generation
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning	arXiv 2024	Paper	Instruction tuning for unified multimodal	visual understanding, visual generation
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance	arXiv 2024	Paper	Self-enhancing unified see-and-draw	visual understanding, visual generation
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation	arXiv 2024	Paper Code	Multi-granular visual generation	visual understanding, visual generation
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation	ICLR 2024	Paper Code	Unified foundation model	visual understanding, visual generation
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	arXiv 2024	Paper Code	Multi-modality potential mining	visual understanding, visual generation
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer	arXiv 2024	Paper Code	Interleaved image-text generative modeling	visual understanding, visual generation
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation	arXiv 2023	Paper	Generative pre-trained transformer	visual understanding, visual generation
Generative Multimodal Models are In-Context Learners	CVPR 2023	Paper	In-context learning generative multimodal	visual understanding, visual generation
DreamLLM: Synergistic Multimodal Comprehension and Creation	ICLR 2023	Paper	Synergistic multimodal comprehension and creation	visual understanding, visual generation
LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization	ICLR 2023	Paper Code	Dynamic discrete visual tokenization	visual understanding, visual generation
Emu: Generative Pretraining in Multimodality	ICLR 2023	Paper	Generative pretraining in multimodality	visual understanding, visual generation

Learnable Query Encoding

Title	Venue	Links	Focus	Task
Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model	arXiv 2025	Paper	Kontext model with online RL and MetaQuery connector for unified multimodal framework	visual understanding, visual generation, editing
TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning	arXiv 2025	Paper	Ladder-side diffusion tuning integrating MLLM and DiT via layer-wise alignment	visual understanding, visual generation
UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing	arXiv 2025	Paper	Adapting CLIP with unified continuous tokenizer for reconstruction, generation and editing	visual understanding, visual generation, editing
OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation	arXiv 2025	Paper Code	Simple baseline with learnable queries and lightweight connector bridging MLLM and diffusion	visual understanding, visual generation
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset	arXiv 2025	Paper	Fully open unified multimodal models with complete architecture, training recipe and datasets	visual understanding, visual generation
Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction	arXiv 2025	Paper	Unified visual generator and native multimodal autoregressive model for natural interaction	visual understanding, visual generation
Nexus-Gen: A Unified Model for Image Understanding, Generation, and Editing	arXiv 2025	Paper Code	Prefilled autoregression in shared embedding space unifying understanding, generation and editing	visual understanding, visual generation, editing
Transfer between Modalities with MetaQueries	arXiv 2025	Paper Code	Learnable MetaQueries as efficient interface between autoregressive MLLMs and diffusion models	visual understanding, visual generation
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation	arXiv 2024	Paper Code	Unified multi-granularity visual semantics for arbitrary-size comprehension and generation	visual understanding, visual generation
Making LLaMA SEE and Draw with SEED Tokenizer	ICLR 2023	Paper Code	SEED tokenizer enabling LLaMA for scalable multimodal autoregression (see and draw)	visual understanding, visual generation
Planting a SEED of Vision in Large Language Model	arXiv 2023	Paper Code	SEED image tokenizer with 1D causal dependency and high-level semantics for LLM vision	visual understanding, visual generation

Hybrid Encoding (Pseduo)

Title	Venue	Links	Focus	Task
Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation	arXiv 2025	Paper Code	Unified autoregressive modeling with decoupled encoding for image understanding, generation and editing	visual understanding, visual generation
MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO	arXiv 2025	Paper Code	Unified VLM with reasoning generation via Reinforcement Learning (RGPO)	multimodal understanding, reasoning generation
UniFluid: Unified Autoregressive Visual Generation and Understanding with Continuous Tokens	arXiv 2025	Paper	Unified autoregressive framework using continuous visual tokens	visual understanding, visual generation
OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models	arXiv 2025	Paper Code	Efficient linear-time unified multimodal model based on Mamba (state space models)	multimodal understanding, visual generation
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling	arXiv 2025	Paper Code	Scaled-up version of Janus with improved training strategy, more data and larger model size	multimodal understanding, visual generation
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation	arXiv 2024	Paper Code	Decoupling visual encoding to enable unified understanding and generation in an autoregressive framework	multimodal understanding, visual generation

Hybrid Encoding (Joint)

Title	Venue	Links	Focus	Task
AToken: A Unified Tokenizer for Vision	arXiv 2025	Paper Code	AToken unified visual tokenizer achieving high-fidelity reconstruction and semantic understanding for images, videos and 3D	visual understanding, visual generation
UniWeTok: An Unified Binary Tokenizer with Codebook Size 2128 for Unified Multimodal Large Language Model	arXiv 2026	Paper	UniWeTok unified binary tokenizer with 2^{128} codebook, pre-post distillation and generative-aware prior for MLLMs	visual understanding, visual generation
Towards Scalable Pre-training of Visual Tokenizers for Generation	arXiv 2025	Paper Code	VTP unified visual tokenizer pre-training framework with joint image-text contrastive, self-supervised and reconstruction losses	visual understanding, visual generation
The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding	arXiv 2025	Paper Code	Prism Hypothesis and unified autoencoding (UAE) harmonizing semantic and pixel representations across modalities	visual understanding, visual generation
Show-o2: Improved Native Unified Multimodal Models	arXiv 2025	Paper Code	Improved native unified multimodal models with autoregressive modeling and flow matching for understanding and generation	multimodal understanding and generation
UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding	CVPRW 2025	Paper Code	Unified visual encoding combining discrete and continuous representations for autoregressive multimodal models	multimodal understanding and generation
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning	arXiv 2025	Paper Code	Enhanced visual autoregressive unified model with iterative instruction tuning and DPO reinforcement learning	visual understanding, generation and editing
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement	arXiv 2025	Paper Code	Dual visual tokenization and diffusion refinement for unified multimodal large language model	multimodal understanding and generation
SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation	arXiv 2025	Paper	Semantic-guided hierarchical codebook for unified image tokenization supporting understanding and generation	multimodal understanding and generation
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model	arXiv 2025	Paper Code	Visual autoregressive framework unifying understanding and generation in a single MLLM	visual understanding and generation
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation	CVPR 2025	Paper Code	Unified image tokenizer with dual-codebook architecture bridging understanding and generation	multimodal understanding and generation
MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding	arXiv 2024	Paper	Semantic discrete encoding for unified vision-language model enabling efficient multimodal understanding and generation	multimodal understanding and generation

Hybrid (AR + Diffusion) UMMs

Pixel Encoding

Paper	Venue	Links	Notes	Task
Tuna: Taming Unified Visual Representations for Native Unified Multimodal Models	arXiv 2025	Paper Code	Native unified multimodal model with cascaded VAE + representation encoder for unified continuous visual representations	multimodal understanding and generation
LMFusion: Adapting Pretrained Language Models for Multimodal Generation	arXiv 2024	Paper	Adapting pretrained LLMs (Llama) for multimodal generation by adding parallel diffusion modules while keeping autoregressive text modeling	multimodal understanding and generation
MonoFormer: One Transformer for Both Diffusion and Autoregression	arXiv 2024	Paper Code	Single shared transformer backbone that handles both autoregressive modeling and diffusion for unified multimodal tasks	visual understanding and generation
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation	ICLR 2025	Paper Code	Unified transformer combining autoregressive and discrete diffusion modeling to flexibly handle mixed-modality inputs/outputs	multimodal understanding and generation
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model	ICLR 2025	Paper	Joint training of next-token prediction (AR) and diffusion in one transformer over mixed discrete/continuous multimodal sequences	visual understanding, visual generation

Hybrid Encoding

Paper	Venue	Links	Notes	Task
EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture	arXiv 2025	Paper Code	Efficient unified architecture with autoencoders, channel-wise concatenation, shared-decoupled networks and MoE for understanding, generation and editing	multimodal understanding, generation and editing
HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation	arXiv 2025	Paper	Asymmetric H-shaped architecture bridging heterogeneous experts with symmetric dense mid-layer connections for unified multimodal modeling	multimodal understanding and generation
LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation	arXiv 2025	Paper Code	Light-weighted double fusion framework that efficiently integrates pretrained vision-language and diffusion models	multimodal understanding and generation
BAGEL: Emerging Properties in Unified Multimodal Pretraining	arXiv 2025	Paper Code	Open-source foundational decoder-only model pretrained on trillions of interleaved multimodal tokens supporting native understanding and generation	multimodal understanding and generation
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation	arXiv 2025	Paper	Causal interleaved multi-modal generation framework with deep-fusion, dual vision encoders and multi-modal classifier-free guidance	interleaved multimodal generation
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation	arXiv 2024	Paper Code	Minimalist framework harmonizing autoregressive LLMs with rectified flow for efficient unified understanding and generation	multimodal understanding and generation

4.2 Any-to-Any / Omni UMMs

Models that extend unified understanding + generation beyond text and image to support any-to-any modality conversion (audio, video, speech, etc.). These often build on the paradigms above but emphasize native omni-modal tokenization, long-context handling, and cross-modal generation.

Model	Paper	Links	Notes	Task
LongCat-Flash-Omni	arXiv 2025	Paper Code	Efficient omni model with flash-style acceleration and real-time audio-visual interaction (560B parameters)	any-to-any multimodal generation and understanding
Ming-Flash-Omni	arXiv 2025	Paper Code	Sparse unified MoE architecture (100B total, 6.1B active) for efficient multimodal perception and generation	any-to-any multimodal perception and generation
Qwen3-Omni	arXiv 2025	Paper Code	Next-gen Qwen omni model with unified modality space, maintaining SOTA across text/image/audio/video	any-to-any multimodal understanding and generation
Ming-Omni	arXiv 2025	Paper Code	Unified multimodal architecture for perception + generation (images, text, audio, video)	any-to-any multimodal tasks
M2-Omni	arXiv 2025	Paper	Extends Omni-MLLM with broader modality support and competitive performance to GPT-4o	any-to-any multimodal modeling
Spider	arXiv 2024	Paper Code	Any-to-many multimodal LLM with flexible output heads for arbitrary modality combinations	multimodal understanding and generation
MIO	arXiv 2024	Paper	Token-level unified multimodal foundation model on discrete multimodal tokens	any-to-any multimodal token modeling
X-VILA	arXiv 2024	Paper	Cross-modality alignment for LLM-based multimodal systems (image/video/audio)	multimodal understanding
AnyGPT	arXiv 2024	Paper Code	Discrete token modeling for unified multimodal generation	any-to-any multimodal generation
OmniFlow	CVPR 2025	Paper	Uses multi-modal rectified flows for any-to-any generation across modalities	any-to-any generation across modalities
Video-LaVIT	ICML 2024	Paper Code	Decoupled visual-motion tokenization for video-language modeling	video understanding and generation
Unified-IO 2	CVPR 2024	Paper Code	Scales autoregressive multimodal models across modalities	any-to-any multimodal tasks (vision, language, audio, action)
NExT-GPT	arXiv 2023	Paper Code	Any-to-any; encoder+LLM+diffusion decoders	visual understanding, visual generation, omni

Back to Top

5. Native Multimodal Models (NMMs)

In this section: 5.1 Design Analyses & Scaling Laws · 5.2 Early Fusion NMMs · 5.3 Late Fusion NMMs · 5.4 Any-to-Any / Omni NMMs

The most restrictive category. NMMs are trained completely from scratch on multimodal data — no pretrained LLM or vision encoder is used as initialization. All weights are jointly learned end-to-end.

What recent arXiv work emphasizes: native multimodality is increasingly defined by end-to-end multimodal pretraining, tokenizer/representation co-design, and scaling strategies that explicitly address the asymmetry between vision and language.

5.1 Design Analyses & Scaling Laws

Recent arXiv papers sharpen the definition of NMMs and identify the main bottlenecks in native multimodal pretraining.

Paper	Venue	Links	Insights
Beyond Language Modeling: An Exploration of Multimodal Pretraining	arXiv 2026	Paper	Highlights representation autoencoders, vision-language data synergy, and MoE for native pretraining
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints	arXiv 2025	Paper Code	End-to-end native MLLM scaling shows positive correlation between visual encoder and LLM size under data constraints; optimal meta-architecture balances cost and performance
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training	arXiv 2025	Paper	reveals that LLMs develop latent visual priors during text-only pre-training, where reasoning-centric data (code and math) builds transferable visual reasoning skills while broad corpora foster perception, enabling models to 'see' before ever processing an image.
Scaling Laws for Native Multimodal Models	arXiv 2025	Paper	Early-fusion NMMs match or outperform late-fusion at low compute; early-fusion needs fewer params; MoE with modality-agnostic routing boosts sparse NMM scaling
The Narrow Gate: Localized Image-Text Communication in Native Multimodal Models	arXiv 2024	Paper	Native models often funnel image-to-text communication through a single post-image token

5.2 Early Fusion NMMs

Single Transformer decoder processes tokenized text and image inputs from layer 1, with minimal modality-specific parameters (only a linear patch embedding for images). No separate image encoder component.

Recent scaling-law evidence suggests early-fusion NMMs are often stronger at lower parameter counts and simpler to deploy when paired with sufficiently strong visual representations.

Model	Paper	Links	Training Scale	Notes	Task
NEO	arXiv 2025	Paper	—	NEO as a cornerstone for scalable and powerful native VLM development, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem	vision-language understanding
NEO-Unify	-	Blog	—	NEO as a cornerstone for scalable and powerful native VLM development, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem	vision-language understanding
Emu3.5	Nature 2026	Paper Code	Large-scale (trillion+ tokens)	Native world model; next-state prediction on interleaved video/text; Discrete Diffusion Adaptation for efficiency	interleaved generation, world modeling, any-to-image

5.3 Late Fusion NMMs

Models where separate unimodal components are jointly trained from scratch (not pretrained), with cross-modal interaction occurring at deeper layers. Distinct from MLLMs where vision encoders are pretrained.

Model	Paper	Links	Training Scale	Notes	Task
Llama4	arXiv 2026	Paper Blog	Scout/Maverick: 17B active / ~109B–400B total; Behemoth: ~2T total	Native multimodal, MoE architecture with early fusion and vision encoder	vision-language understanding
LongCat-Next	arXiv 2026	Paper	—	Discrete Native Any-resolution Visual Transformer	vision-language understanding
VL-JEPA	arXiv 2025	Paper	1.6B	Vision-Language Joint Native Model	vision-language understanding
InternVL3	arXiv 2025	Paper	—	A pre-trained InternViT encoder coupled with a cross-attention visual expert, employing a deep but late-fusion strategy to ensure seamless multimodal alignment while strictly preserving native LLM reasoning and linguistic proficiency.	vision-language understanding
InternVL3.5	arXiv 2025	Paper	—	A pre-trained ViT encoder with a visual expert that uses cross-attention for deep but late-style fusion to the LLM, preserving its capabilities.	vision-language understanding
Qwen3.5	-	Blog	—	Discrete Native Any-resolution Visual Transformer	vision-language understanding
Gemma4	-	Blog	—	A pre-trained ViT encoder with a visual expert that uses cross-attention for deep but late-style fusion to the LLM, preserving its capabilities.	vision-language understanding
Emu3	arXiv 2024	Paper Code	8B	Next-token prediction over VQ image tokens; native multimodal decoder-only; minimal modality-specific params	visual understanding, visual generation

5.4 Any-to-Any / Omni NMMs

The latest arXiv-native multimodal papers increasingly blur the boundaries between omni understanding, any-to-any generation, world modeling, and RL-enhanced post-training.

Model	Paper	Links	Notes	Task
Qwen3.5-Omni	(Late Fusion)	Blog	—	Discrete Native Any-resolution Visual Transformer
ERNIE 5.0 Technical Report	arXiv 2026 (Late fusion)	paper	—	a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio

Back to Top

6. Closed-Source Multimodal Models

Year 2026

Model	Venue	Links	Notes	Task
Claude 4.6 Family (Opus 4.6 / Sonnet 4.6)	Anthropic Blog	Anthropic Claude Updates	Released ~February 2026. Further multimodal refinements (vision + tool/computer use). Proprietary.	Multimodal + Agentic/Coding/Computer-Use
Gemini 3.x (Pro / Flash / 3.1 Pro)	Google DeepMind Blog	Gemini 3 Announcements	Released late 2025–early 2026 (Flash Dec 2025, Pro variants Feb 2026). State-of-the-art multimodal with massive context and Deep Think modes. Proprietary.	Frontier Multimodal (text/image/audio/video + reasoning)
GPT-5.4 (and Pro/Codex variants)	OpenAI Blog	OpenAI GPT-5 Updates	Released ~March 2026. Enhanced efficiency, multimodal, and professional/agentic features. Proprietary.	Omni-Modal + Professional/Agentic Workflows
Grok 4.x updates (e.g., Grok 4.1)	xAI Announcement	xAI Blog	Continued 2025–2026 iterations with improved vision and real-time capabilities. Proprietary via X platform.	Multimodal + Real-Time/Data-Integrated Reasoning

Year 2025

Model	Venue	Links	Notes	Task
Gemini 2.0 / 2.5 (Pro / Flash)	Google DeepMind Blog	Gemini 2.x Announcements	Released early–mid 2025 (Flash ~Jan/Feb, Pro variants through March–June). Native multimodal with improved agentic and long-context capabilities. Proprietary.	Advanced Native Multimodal + Agentic (text/image/audio/video)
Claude 4 Family (Opus 4 / Sonnet 4 / Haiku 4)	Anthropic Blog	Claude 4 Announcement	Released ~May 2025. Enhanced vision, reasoning, and early agentic features. Proprietary.	Vision + Advanced Reasoning/Agentic Workflows
Grok 3 / Grok 4 (including vision/speech)	xAI Announcement	xAI Blog	Major updates throughout 2025 (Grok 3 ~early 2025, Grok 4 ~mid-late 2025). Multimodal input (text/image/speech). Proprietary.	Multimodal Reasoning + Real-Time Integration
GPT-4.5 / GPT-5 (and variants like GPT-5 Codex)	OpenAI Blog	OpenAI Announcements	GPT-4.5 ~early 2025; full GPT-5 ~August 2025. Unified multimodal with strong reasoning and tool use. Proprietary.	Omni-Modal + Advanced Reasoning/Agentic
Mistral Large / Medium Multimodal variants	Mistral AI	Mistral Platform	Proprietary multimodal offerings (e.g., Medium 3.1 ~2025). Text + vision capabilities via API.	General Multimodal Tasks

Year 2024

Model	Venue	Links	Notes	Task
Gemini 1.5 (Pro / Flash)	Google DeepMind Blog	Gemini 1.5 Announcement	Released February 2024. Massive context (>1M tokens), strong long-context multimodal (video, audio, images). Proprietary.	Long-Context Multimodal (video/audio/image/text)
Claude 3 Family (Opus / Sonnet / Haiku)	Anthropic Blog	Claude 3 Family	Released March 2024. Strong native vision for images, charts, diagrams, and documents. Proprietary API + Claude.ai.	Vision-Language + Reasoning
Grok-1.5V / Grok-2 Vision	xAI Announcement	Grok Vision Updates	Vision capabilities added ~April 2024 (Grok-1.5V), expanded in Grok-2 (August 2024). Image understanding with real-world and diagram reasoning. Proprietary via X/Grok API.	Vision-Language (real-world visuals, diagrams)
GPT-4o (Omni)	OpenAI Blog	GPT-4o Announcement	Released May 2024. Full real-time omni-modal: text + image + audio (voice) input/output. Proprietary.	Real-Time Omni-Modal (text/vision/audio)
Amazon Nova (Pro / Lite)	Amazon Announcement	AWS Bedrock docs	Released late 2024. Multimodal (text + image + video). Proprietary via Amazon Bedrock API.	Multimodal Understanding (text/image/video)

Year 2023

Model	Venue	Links	Notes	Task
GPT-4V (Vision)	OpenAI Announcement	GPT-4V System Card	Released September 2023. First widely available multimodal GPT-4 variant. Image + text input, text output. API/ChatGPT access only.	Vision-Language (image understanding, VQA, OCR, document analysis, captioning)
Gemini 1.0 (Ultra / Pro / Nano)	Google DeepMind Blog	Gemini Announcement	Released December 2023. Native multimodal from training (text + image + audio + video). Proprietary API + Gemini chatbot.	Native Multimodal Understanding (text/image/audio/video)

7. Resources

In this section: 7.1 Related Awesome Lists · 7.2 Slides & Survey Papers · 7.3 Code Repositories & Tools

7.1 Related Awesome Lists

Repository	Focus	Author
awesome-multimodal-ml	General multimodal ML	pliang279
Awesome-Multimodal-Large-Language-Models	MLLMs + evaluation	BradyFU
Awesome-Multimodal-Research	Broad multimodal research	Eurus-Holmes
Awesome-Unified-Multimodal-Models	UMMs	ShowLab
Awesome-Multimodal-Large-Language-Models	MLLMs	yfzhang114
awesome-foundation-and-multimodal-models	Foundation + multimodal	SkalskiP
Awesome-Multimodality	General multimodality	Yutong-Zhou-cv
Awesome-Unified-Multimodal	Unified models	Purshow
Awesome-Unified-Multimodal	Unified models	AIDC-AI

7.2 Slides & Survey Papers

Type	Resource	Notes
Slides	Native LMM Slides	Ziwei Liu (NTU); concise framing for native multimodal models
Survey	A Survey on Multimodal Large Language Models	Broad survey of MLLM architectures, data, and evaluation
Report	The Dawn of LMMs: Preliminary Explorations with GPT-4V	Early capability analysis around GPT-4V
Survey	Multimodal Foundation Models: From Specialists to General-Purpose Assistants	Broader foundation-model view across multimodal systems

7.3 Code Repositories & Tools

Tool	Description	Link
LMMs-Eval	Unified evaluation harness for multimodal models	Code
LAVIS	Library for Language-Vision Intelligence (Salesforce)	Code
OpenFlamingo	Open reproduction of DeepMind Flamingo	Code
xtuner	Efficient fine-tuning for multimodal LLMs	Code
LLaMA-Factory	Multimodal instruction tuning framework	Code
MMEngine	Foundation for perception research (OpenMMLab)	Code
DeepSpeed-VisualChat	Scalable multimodal chat training	Code

Back to Top

How to Contribute

In this section: Validation Rules · Entry Format

We welcome contributions! Please follow these guidelines:

Validation Rules

For NMM submissions (strict):

Confirm the model does NOT use any pretrained LLM backbone
Confirm the model does NOT use any pretrained vision encoder (CLIP, ViT, etc.)
All weights are jointly trained from scratch on multimodal data
Classify as Early Fusion or Late Fusion (both must be "from scratch")

For UMM submissions:

Confirm the model handles both image understanding AND image generation
Note whether pretrained components are used (annotate accordingly)

For MLLM submissions:

Note which vision encoder is used (must be a pretrained encoder)
Note which LLM backbone is used (must be a pretrained LLM)

Entry Format

| **Model Name** | [Paper](arxiv_link) [Code](github_link) [HF](huggingface_link) BADGES | Scale | Key contribution / notes |

Submit a PR with:

The paper/model entry in the correct section
A one-line justification for the chosen category
Links to paper, code, and/or weights

Back to Top

Citation

If this list is useful in your research, please consider citing:

@misc{awesome-multimodal-modeling-2026,
  title     = {Awesome Multimodal Modeling: From Traditional to Native & Unified},
  author    = OpenEnvision-Lab,
  year      = {2026},
  url       = {https://github.com/OpenEnvision-Lab/Awesome-Multimodal-Model-Traditional-Advanced},
  note      = {GitHub repository}
}

Back to Top

License

This list is released under the CC0 1.0 Universal license.

Maintained by the community for the multimodal research community.

Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
assets		assets
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Awesome Multimodal Modeling

📢 News

Table of Contents

About This List

At a Glance

Curation Principles

1. Introduction & Definitions

1.1 Multimodal Model Evolution Stages

Traditional Multimodal Models

Multimodal Large Language Models (MLLMs)

Unified Multimodal Models (UMMs)

Native Multimodal Models (NMMs)

NMM — Early Fusion

NMM — Late Fusion

1.2 Scope & Taxonomy

1.3 Architecture Diagrams

2. Traditional Multimodal Models

2.1 Multimodel Representations & Alignment

Multimodal Representations

Multimodal Fusion

Multimodal Alignment

2.2 Multimodal Pretraining

3. Multimodal Large Language Models (MLLMs)

3.1 Taxonomy Based on Vision Adapter

MLP/Others Projector

Q-Former

Cross-Attention

Hybrid Adaptor

3.2 Omni MLLMs

4. Unified Multimodal Models (UMMs)

4.1 Taxonomy by Generation Paradigm

Diffusion-Based UMMs

Autoregressive (AR) UMMs

Pixel Encoding

Semantic Encoding

Learnable Query Encoding

Hybrid Encoding (Pseduo)

Hybrid Encoding (Joint)

Hybrid (AR + Diffusion) UMMs

Pixel Encoding

Hybrid Encoding

4.2 Any-to-Any / Omni UMMs

5. Native Multimodal Models (NMMs)

5.1 Design Analyses & Scaling Laws

5.2 Early Fusion NMMs

5.3 Late Fusion NMMs

5.4 Any-to-Any / Omni NMMs

6. Closed-Source Multimodal Models

Year 2026

Year 2025

Year 2024

Year 2023

7. Resources

7.1 Related Awesome Lists

7.2 Slides & Survey Papers

7.3 Code Repositories & Tools

How to Contribute

Validation Rules

Entry Format

Citation

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages