A Comprehensive Survey & Curated List of Multimodal Modeling
From Traditional Fusion to Native & Unified Architectures
Overview · Traditional · MLLMs · UMMs · NMMs ·Closed Source Models · Resources
- [2026-04-13] ⭐ The repository has already gained over 100 stars in just one day! Thank you all for the incredible support. We will keep updating this list with more cutting-edge models and resources. Your continued stars and PRs are warmly welcomed!
- [2026-04-12] 🎉 We are excited to launch Awesome Multimodal Modeling — a curated reading list organized by architectural paradigms. A comprehensive survey paper is coming soon! Stay tuned.
Browse the list
- Awesome Multimodal Modeling
In this section: At a Glance · Curation Principles
This repository provides a structured, community-maintained survey of multimodal models, covering the full evolutionary arc from early fusion methods to today's natively-trained omni-models. We emphasize precise architectural definitions and classification, especially for the often-conflated categories of Unified Multimodal Models (UMMs) and Native Multimodal Models (NMMs).
Scope: Primary focus on image + text modalities; audio/video/3D are annotated where present. Omni/any-to-any models are marked with Omni.
| Dimension | Coverage |
|---|---|
| Primary scope | Image + text multimodal models, with explicit annotations for video, audio, and omni extensions |
| Core taxonomy | Traditional multimodal models, MLLMs, UMMs, and strict NMMs |
| Key distinction | U+G unification for UMMs vs. joint training from scratch for NMMs |
| What makes this repo different | Architecture-first categorization, fusion-aware definitions, and curated links to adjacent awesome lists |
| Intended audience | Researchers, students, and engineers building or surveying multimodal systems |
| Principle | Rule |
|---|---|
| Source quality | Prefer official conference proceedings, OpenReview, ACL Anthology, CVF Open Access, arXiv, and official project pages |
| Classification policy | Category assignment is based on this repository's architecture-first definitions, which may differ from authors' own branding |
| Venue policy | If a peer-reviewed venue is known, we list that venue; otherwise we keep the entry as arXiv |
| Scope discipline | Models, benchmarks, datasets, and analysis papers are tracked separately to avoid mixing artifacts |
| Inclusion bar | We prioritize landmark papers, broadly adopted benchmarks, open implementations, or papers that clarify important taxonomy boundaries |
Classification note: for ambiguous models sitting between
MLLM,UMM, and strictNMM, this list records the category that best matches the training recipe and architectural coupling, not just the paper title.
In this section: 1.1 Multimodal Model Evolution Stages · 1.2 Scope & Taxonomy · 1.3 Architecture Diagrams
Subtopics: Traditional Multimodal Models · Multimodal Large Language Models (MLLMs) · Unified Multimodal Models (UMMs) · Native Multimodal Models (NMMs)
We use the following precise, architecture-first definitions throughout this list. Understanding these distinctions is critical for correctly classifying modern models.
Pre-2023 mainstream era
Independent per-modality processing followed by simple fusion (early, late, or hybrid). No large-scale language model backbone. Focuses on representation alignment, cross-modal retrieval, and captioning. Examples: CLIP, ALIGN, ViLBERT, BLIP.
Pretrained-backbone multimodal language models
Combine a pretrained visual backbone or visual abstractor (e.g., ViT/CLIP/SigLIP, Q-Former, cross-attention adapter) with a pretrained LLM through a connector. The defining property is inheritance from strong pretrained unimodal backbones rather than joint multimodal pretraining from scratch. These models are primarily text-output understanding/reasoning systems, even when auxiliary generators are attached externally.
Key characteristics:
- ✅ Pretrained visual encoder / abstractor
- ✅ Pretrained LLM backbone
- ✅ Connector layer or cross-attention bridge
- ❌ No end-to-end multimodal pretraining from scratch
- ❌ No native image generation inside the same backbone
Examples: LLaVA, Qwen-VL, InternVL, MiniCPM-V, CogVLM
Single framework for Understanding + Generation (U+G)
A single framework that handles both multimodal understanding and visual generation. UMMs may reuse pretrained components or modular tokenizers; the defining feature is U+G unification, not whether the model is trained from scratch.
Key characteristics:
- ✅ Unified understanding + generation
- ✅ Shared model interface or shared backbone for U+G
⚠️ May use pretrained components⚠️ May use decoupled encoders / modular tokenizers⚠️ If a model is also natively trained from scratch, its architectural details belong primarily in NMMs (§5)
Examples: Show-o, Janus, OpenUni, BAGEL, BLIP3-o
Jointly trained from scratch — no pretrained backbone
The strictest category. NMMs are trained jointly from scratch on all modalities — they do not rely on any pretrained LLM or pretrained vision encoder as initialization. All parameters are learned end-to-end from raw multimodal data.
Key characteristics:
- ✅ No pretrained LLM backbone
- ✅ No pretrained vision encoder
- ✅ All components jointly trained from scratch
- ✅ Input: text tokens + image patches/tokens
- ✅ Output: text (understanding focus; generation optional)
NMMs are further divided by fusion architecture:
Multimodal interaction begins from the first layer. A single Transformer decoder processes tokenized text and continuous/discrete image patches together, with minimal modality-specific parameters (only a linear patchify layer for images). No separate image encoder is maintained.
- Single unified Transformer (decoder-only)
- Continuous image patches or minimal discrete tokenization
- Modality interaction from layer 1
- Near-zero modality-specific parameters (excluding linear patch embed)
- Examples: Emu3 (if trained from scratch)
Each modality is first processed by a dedicated unimodal component (e.g., a vision tower or image encoder), but these components are jointly trained from scratch (not pretrained). Cross-modal interaction occurs at deeper layers.
- Separate unimodal processing stages (trained from scratch)
- Cross-modal interaction at deeper layers
- More modality-specific parameters
- Examples: Models with jointly-trained vision encoders → decoder interaction
Multimodal Models
├── 2. Traditional Multimodal Models
│ ├── 2.1 Multimodel Representations & Alignment
│ │ ├── Multimodal Representations
│ │ ├── Multimodal Fusion
│ │ └── Multimodal Alignment
│ └── 2.2 Multimodal Pretraining
├── 3. Multimodal Large Language Models (MLLMs)
│ ├── 3.1 Foundation MLLMs
│ └── 3.2 Omni MLLMs
├── 4. Unified Multimodal Models (UMMs)
│ ├── 4.1 Taxonomy by Generation Paradigm
│ │ ├── Diffusion-Based UMMs
│ │ ├── Autoregressive (AR) UMMs
│ │ │ ├── Pixel Encoding
│ │ │ ├── Semantic Encoding
│ │ │ ├── Learnable Query Encoding
│ │ │ ├── Hybrid Encoding (Pseduo)
│ │ │ └── Hybrid Encoding (Joint)
│ │ └── Hybrid (AR + Diffusion) UMMs
│ │ ├── Pixel Encoding
│ │ └── Hybrid Encoding
│ └── 4.2 Any-to-Any / Omni UMMs
└── 5. Native Multimodal Models (NMMs)
├── 5.1 Design Analyses & Scaling Laws
├── 5.2 Early Fusion NMMs
├── 5.3 Late Fusion NMMs
└── 5.4 Any-to-Any / Omni NMMs
┌─────────────────────────────────────────────────────────────────┐
│ TRADITIONAL MULTIMODAL MODEL │
│ │
│ [Image] ──► [CNN/ViT Encoder] ──┐ │
│ ├──► [Fusion] ──► [Output] │
│ [Text] ──► [LSTM/BERT] ──┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ MLLM — MODULAR LATE FUSION │
│ │
│ [Image] ──► [Pretrained ViT/CLIP] ──► [Projector/Q-Former] │
│ │ │
│ ▼ │
│ [Text] ──────────────────────────► [Pretrained LLM] ──► [Text]│
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ UMM — UNIFIED UNDERSTANDING + GENERATION │
│ │
│ [Image/Text Input] ──► [Shared/Modular Tokenizer] │
│ │ │
│ ▼ │
│ [Unified Transformer] │
│ │ │ │
│ ▼ ▼ │
│ [Text Output] [Image Output] │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ NMM — EARLY FUSION (Trained from Scratch) │
│ │
│ [Text tokens] ──┐ │
│ └──► [Single Decoder Transformer] ──► [Text] │
│ [Image patches ──► Linear Patchify] ──┘ │
│ (raw pixels, minimal preprocessing) │
│ Multimodal interaction from Layer 1 │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ NMM — LATE FUSION (Trained from Scratch) │
│ │
│ [Image] ──► [Jointly-Trained Vision Component] │
│ │ │
│ ▼ (deep layers) │
│ [Text] ──────────► [Cross-Modal Interaction] ──► [Text] │
│ (All components trained jointly from scratch) │
└─────────────────────────────────────────────────────────────────┘
In this section: 2.1 Multimodel Representations & Alignment · 2.2 Multimodal Pretraining
Pre-chat-MLLM and non-native multimodal systems that established the basic vocabulary of alignment, fusion, retrieval, captioning, and multimodal pretraining.
Subtopics: Multimodal Representations · Multimodal Fusion · Multimodal Alignment
| Paper | Venue | Links | Notes | Task |
|---|---|---|---|---|
| Identifiability Results for Multimodal Contrastive Learning | ICLR 2023 | Paper | Theoretical identifiability analysis of contrastive multimodal learning | representation learning |
| Unpaired Vision-Language Pre-training via Cross-Modal CutMix | ICML 2022 | Paper | Introduces CutMix-style augmentation for unpaired VLP | vision-language pretraining |
| Balanced Multimodal Learning via On-the-fly Gradient Modulation | CVPR 2022 | Paper | Balances modality learning via dynamic gradient reweighting | multimodal optimization |
| Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast | IJCAI 2021 | Paper | Cross-modal prototype contrast for voice-face alignment | audio-visual representation learning |
| Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text | arXiv 2021 | Paper | Early unified transformer for unpaired multimodal pretraining | unified multimodal pretraining |
| FLAVA: A Foundational Language And Vision Alignment Model | arXiv 2021 | Paper | Unified architecture for vision-language understanding and generation | foundation multimodal model |
| Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer | arXiv 2021 | Paper | Single transformer for multiple multimodal tasks | multimodal multitask learning |
| MultiBench: Multiscale Benchmarks for Multimodal Representation Learning | NeurIPS 2021 | Paper | Benchmark suite for multimodal learning evaluation | benchmarking |
| Perceiver: General Perception with Iterative Attention | ICML 2021 | Paper | General-purpose architecture for high-dimensional multimodal inputs | general multimodal architecture |
| Learning Transferable Visual Models From Natural Language Supervision | arXiv 2021 | Paper | Contrastive vision-language pretraining at scale | vision-language contrastive learning |
| VinVL: Revisiting Visual Representations in Vision-Language Models | arXiv 2021 | Paper | Improved visual features for VL tasks | vision-language representation improvement |
| Learning Transferable Visual Models From Natural Language Supervision | arXiv 2020 | Paper | Early large-scale vision-language contrastive learning | vision-language pretraining |
| 12-in-1: Multi-Task Vision and Language Representation Learning | CVPR 2020 | Paper | Unified multi-task learning across 12 VL tasks | multi-task learning |
| Watching the World Go By: Representation Learning from Unlabeled Videos | arXiv 2020 | Paper | Self-supervised video representation learning | video representation learning |
| Learning Video Representations using Contrastive Bidirectional Transformer | arXiv 2019 | Paper | Contrastive transformer for video representation learning | video contrastive learning |
| Visual Concept-Metaconcept Learning | NeurIPS 2019 | Paper | Hierarchical concept learning from visual data | concept learning |
| OmniNet: A Unified Architecture for Multi-modal Multi-task Learning | arXiv 2019 | Paper | Unified encoder-decoder for multimodal tasks | unified multimodal architecture |
| Learning Representations by Maximizing Mutual Information Across Views | arXiv 2019 | Paper | InfoMax principle for cross-view representation learning | self-supervised learning |
| ViCo: Word Embeddings from Visual Co-occurrences | ICCV 2019 | Paper | Learning word embeddings from visual context | vision-language embeddings |
| Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations | CVPR 2019 | Paper | Structured embedding space for vision-language alignment | embedding learning |
| Multi-Task Learning of Hierarchical Vision-Language Representation | CVPR 2019 | Paper | Hierarchical representation learning across VL tasks | multi-task learning |
| Learning Factorized Multimodal Representations | ICLR 2019 | Paper | Factorized latent space for multimodal data | representation disentanglement |
| A Probabilistic Framework for Multi-view Feature Learning with Many-to-many Associations via Neural Networks | ICML 2018 | Paper | Probabilistic modeling of multi-view correspondence | multi-view learning |
| Do Neural Network Cross-Modal Mappings Really Bridge Modalities? | ACL 2018 | Paper | Analyzes limitations of cross-modal mapping | theoretical analysis |
| Learning Robust Visual-Semantic Embeddings | ICCV 2017 | Paper | Improved robustness in vision-language embeddings | embedding learning |
| Deep Multimodal Representation Learning from Temporal Data | CVPR 2017 | Paper | Temporal multimodal representation learning | multimodal temporal learning |
| Is an Image Worth More than a Thousand Words? On the Fine-Grain Semantic Differences between Visual and Linguistic Representations | COLING 2016 | Paper | Analyzes semantic gap between vision and language | representation analysis |
| Combining Language and Vision with a Multimodal Skip-gram Model | NAACL 2015 | Paper | Extends skip-gram with visual context | multimodal embeddings |
| Deep Fragment Embeddings for Bidirectional Image Sentence Mapping | NeurIPS 2014 | Paper | Fragment-level image-sentence alignment | vision-language alignment |
| Multimodal Deep Learning | JMLR 2014 | Paper | Probabilistic generative multimodal model | generative multimodal learning |
| Learning Grounded Meaning Representations with Autoencoders | ACL 2014 | Paper | Autoencoder-based grounded semantics | representation learning |
| DeViSE: A Deep Visual-Semantic Embedding Model | NeurIPS 2013 | Paper | Early deep vision-to-language embedding model | vision-language embedding |
| Multimodal Deep Learning | ICML 2011 | Paper | Foundational multimodal deep learning framework | multimodal deep learning |
| Paper | Venue | Links | Notes | Task |
|---|---|---|---|---|
| Robust Contrastive Learning against Noisy Views | arXiv 2022 | Paper | Robust contrastive learning under noisy multi-view inputs | contrastive learning |
| Cooperative Learning for Multi-view Analysis | arXiv 2022 | Paper | Cooperative optimization across multiple views for representation learning | multi-view learning |
| What Makes Multi-modal Learning Better than Single (Provably) | NeurIPS 2021 | Paper | Theoretical guarantees showing when multimodal learning improves over unimodal | theoretical analysis |
| Efficient Multi-Modal Fusion with Diversity Analysis | ACMMM 2021 | Paper | Fusion method emphasizing diversity-aware multimodal integration | multimodal fusion |
| Attention Bottlenecks for Multimodal Fusion | NeurIPS 2021 | Paper | Introduces bottleneck attention mechanism for efficient multimodal fusion | multimodal fusion |
| VMLoc: Variational Fusion For Learning-Based Multimodal Camera Localization | AAAI 2021 | Paper | Variational multimodal fusion for camera localization tasks | multimodal localization |
| Trusted Multi-View Classification | ICLR 2021 | Paper | Confidence-aware weighting for multi-view classification | multi-view classification |
| Deep-HOSeq: Deep Higher-Order Sequence Fusion for Multimodal Sentiment Analysis | ICDM 2020 | Paper | Higher-order sequence fusion for multimodal sentiment analysis | multimodal sentiment analysis |
| Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies | NeurIPS 2020 | Paper | Entropy-based regularization to reduce modality bias | multimodal fairness/robustness |
| Deep Multimodal Fusion by Channel Exchanging | NeurIPS 2020 | Paper | Channel exchange mechanism for cross-modal feature interaction | multimodal fusion |
| What Makes Training Multi-Modal Classification Networks Hard? | CVPR 2020 | Paper | Analyzes optimization challenges in multimodal classification | theoretical/empirical analysis |
| Dynamic Fusion for Multimodal Data | arXiv 2019 | Paper | Adaptive fusion strategy depending on input modality quality | multimodal fusion |
| DeepCU: Integrating Both Common and Unique Latent Information for Multimodal Sentiment Analysis | IJCAI 2019 | Paper | Separates shared and private latent representations for fusion | multimodal sentiment analysis |
| Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling | NeurIPS 2019 | Paper | High-order tensor/polynomial fusion for multimodal features | multimodal fusion |
| XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification | IEEE TNNLS 2019 | Paper | Cross-modal feature exchange network for audio-visual tasks | audio-visual classification |
| MFAS: Multimodal Fusion Architecture Search | CVPR 2019 | Paper | Neural architecture search for optimal multimodal fusion design | architecture search |
| The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision | ICLR 2019 | Paper | Neuro-symbolic model combining perception and reasoning | neuro-symbolic learning |
| Unifying and merging well-trained deep neural networks for inference stage | IJCAI 2018 | Paper | Model merging strategy for inference-time multimodal integration | model fusion |
| Efficient Low-rank Multimodal Fusion with Modality-Specific Factors | ACL 2018 | Paper | Low-rank factorization for efficient multimodal fusion | efficient fusion |
| Memory Fusion Network for Multi-view Sequential Learning | AAAI 2018 | Paper | Memory-based fusion across temporal multimodal sequences | sequential multimodal learning |
| Tensor Fusion Network for Multimodal Sentiment Analysis | EMNLP 2017 | Paper | Tensor-based full interaction modeling across modalities | multimodal sentiment analysis |
| Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework | AAAI 2015 | Paper | Joint modeling of video and compositional language | vision-language modeling |
| A co-regularized approach to semi-supervised learning with multiple views | ICML 2005 | Paper | Early multi-view co-regularization framework | multi-view semi-supervised learning |
| Paper | Venue | Links | Notes | Task |
|---|---|---|---|---|
| CLIP | arXiv 2021 | Paper | 400M+ | Dual-encoder (Vision Transformer + Text Transformer); contrastive alignment at embedding level; classic late-fusion foundation |
| Reconsidering Representation Alignment for Multi-view Clustering | CVPR 2021 | Paper | Revisits representation alignment objectives for multi-view clustering | multimodal alignment |
| CoMIR: Contrastive Multimodal Image Representation for Registration | NeurIPS 2020 | Paper | Contrastive learning for multimodal image registration alignment | multimodal alignment |
| Multimodal Transformer for Unaligned Multimodal Language Sequences | ACL 2019 | Paper | Transformer-based alignment for unaligned multimodal sequences | sequence alignment |
| Temporal Cycle-Consistency Learning | CVPR 2019 | Paper | Uses cycle-consistency for temporal cross-modal alignment | temporal alignment |
| See, Hear, and Read: Deep Aligned Representations | arXiv 2017 | Paper | Learns aligned representations across vision, audio, and text | multimodal alignment |
| On Deep Multi-View Representation Learning | ICML 2015 | Paper | Theoretical and empirical study of multi-view representation alignment | multi-view learning |
| Unsupervised Alignment of Natural Language Instructions with Video Segments | AAAI 2014 | Paper | Aligns language instructions with video segments without supervision | language-video alignment |
| Multimodal Alignment of Videos | ACM MM 2014 | Paper | Early multimodal alignment framework for video modalities | video alignment |
| Deep Canonical Correlation Analysis | ICML 2013 | Paper | Deep learning extension of CCA for cross-view representation alignment | representation alignment |
| Paper | Venue | Links | Notes | Task |
|---|---|---|---|---|
| Align before Fuse: Vision and Language Representation Learning with Momentum Distillation | NeurIPS 2021 Spotlight | Paper | Momentum distillation for aligning vision-language representations before fusion | vision-language pretraining |
| Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling | CVPR 2021 | Paper | Sparse frame sampling for efficient video-language pretraining | video-language pretraining |
| Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer | arXiv 2021 | Paper | Unified transformer for multitask multimodal learning | unified multimodal pretraining |
| Large-Scale Adversarial Training for Vision-and-Language Representation Learning | NeurIPS 2020 | Paper | Adversarial training improves robustness of vision-language representations | robust multimodal pretraining |
| Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision | EMNLP 2020 | Paper | Grounds language tokens in visual context via voken supervision | vision-grounded language modeling |
| Integrating Multimodal Information in Large Pretrained Transformers | ACL 2020 | Paper | Injects multimodal signals into large pretrained transformer architectures | multimodal transformer pretraining |
| VL-BERT: Pre-training of Generic Visual-Linguistic Representations | arXiv 2019 | Paper | Joint vision-language BERT-style pretraining | vision-language pretraining |
| VisualBERT: A Simple and Performant Baseline for Vision and Language | arXiv 2019 | Paper | Early unified transformer for vision-language understanding | vision-language pretraining |
| ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks | NeurIPS 2019 | Paper | Two-stream transformer for cross-modal vision-language learning | vision-language pretraining |
| Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training | arXiv 2019 | Paper | Cross-modal encoder for universal vision-language representations | vision-language pretraining |
| LXMERT: Learning Cross-Modality Encoder Representations from Transformers | EMNLP 2019 | Paper | Cross-modality transformer encoder for vision-language reasoning | vision-language pretraining |
| VideoBERT: A Joint Model for Video and Language Representation Learning | ICCV 2019 | Paper | Joint discrete token modeling for video and language | video-language pretraining |
In this section: 3.1 Foundation MLLMs · 3.2 Omni MLLMs
Models that connect a pretrained visual encoder / abstractor to a pretrained LLM. Primarily text-output understanding and reasoning systems, defined by inherited pretrained unimodal backbones rather than multimodal pretraining from scratch.
| Paper | Venue | Links | Notes | Task |
|---|---|---|---|---|
| Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders | arXiv 2026 | Paper | LLM-initialized vision encoder (non-CLIP); text-to-vision weight reuse, generative-aligned visual features, optimized for dense perception. | visual understanding |
| Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision | arXiv 2026 | Paper | Tri-modal (V+A+L) unified framework; parameter-efficient tuning, seamless cross-modal reasoning for mobile/IoT deployment. | visual understanding |
| STEP3-VL-10B Technical Report | arXiv 2026 | Paper | 10B-scale foundation multimodal; unified unfrozen pre-training + PaCoRe test-time scaling, frontier-level reasoning with compact footprint. | visual understanding |
| GLM-OCR | arXiv 2026 | Paper | GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. | OCR, structured extraction |
| Kimi K2.5 | arXiv 2026 | Paper | joint text-vision pretraining, Agent Swarm framework; coding, vision, reasoning, agentic tasks; reduces latency by up to 4.5x | visual agentic intelligence, agentic, reasoning |
| Kwai Keye-VL 1.5 Technical Report | arXiv 2025 | Paper | Adaptive Slow-Fast encoding; 8B parameter scale with 128K long-context; SOTA video reasoning & human-preference aligned. | visual understanding |
| olmOCR / olmOCR-2 | arXiv 2025 | Paper | Efficient low-VRAM OCR model based on Qwen2.5-VL fine-tune; excels at preserving semantic structure and markdown output | OCR, structured extraction |
| PaddleOCR-VL | arXiv 2025 | HF / Official | Lightweight (0.9B+) multimodal OCR with 109 languages support; excellent chart-to-HTML/Markdown conversion and high-throughput | OCR, multilingual document |
| DeepSeek-OCR | arXiv 2025 | Paper HF | Lightweight ~3B MoE vision model optimized for high-volume OCR, document digitization, charts and formulas; efficient inference | OCR, document |
| Kimi-VL | arXiv 2025 | Paper HF | Projector + MoE backbone; long video/PDF/GUI, agentic capabilities, chain-of-thought vision reasoning | visual understanding, agentic, video |
| Seed1.5-VL Technical Report | arXiv 2025 | Paper | 20B MoE + 532M ViT; native-resolution vision-language foundation model; efficient asymmetric architecture. | visual understanding |
| Qwen3-VL | arXiv 2025 | Paper HF | Frontier-grade vision/OCR (32+ languages), video analysis, agentic capabilities, strong multimodal reasoning; includes large MoE variants (e.g., 235B) | visual understanding, video, omni |
| SmolVLM | arXiv 2025 | HF | Ultra-lightweight (256M–2.2B) projector-based series; efficient on-device video and image understanding | visual understanding, efficiency |
| LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning | arXiv 2025 | Paper | Diffusion llm as llm backbone; Vision encoder: Siglip | visual understanding |
| jina-vlm | arXiv 2025 | Paper HF | SigLIP2 + Qwen backbone with custom projector; optimized for semantic VQA, diagrams, scans and document semantics | visual understanding, VQA, document |
| Phi-4-Multimodal | arXiv 2025 | Paper HF | Small-parameter (LoRA + projectors) multimodal; vision + speech support, efficient on-device deployment | visual understanding, on-device |
| Molmo / PixMo | CVPR 2025 | Paper Code | Strong open-data/open-weight VLM pipeline | visual understanding |
| FastVLM: Efficient Vision Encoding for Vision Language Models | CVPR 2025 | Paper | efficient multimodal visual encoding for on-device deployment | visual understanding, on-device |
| Qwen2.5-VL: Technical Report | arXiv 2025 | Paper HF | Stronger document, grounding, and video capabilities | visual understanding |
| General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model | arXiv 2024 | Paper HF/Code | Specialized end-to-end OCR model with grounding (boxes + points); strong on scientific papers, slides, and mixed visual-text docs | OCR, grounding |
| LLaVA-OneVision: Easy Visual Task Transfer | arXiv 2024 | Paper Code | Single model for image, multi-image, and video transfer | visual understanding |
| MiniCPM-V: A GPT-4V Level MLLM on Your Phone | arXiv 2024 | Paper Code | On-device efficient MLLM | visual understanding |
| NVILA: Efficient Frontier Visual Language Models | CVPR 2025 | Paper | Efficient general purpose multimodal llm; spatial and temporal "Scale then compress" design; vision encoder: Siglip | visual understanding |
| xGen-MM (BLIP-3) | arXiv 2024 | Paper | Open training recipe, datasets, and safety-tuned variants | visual understanding |
| DeepSeek-VL2: Mixture-of-Experts Vision-Language Models | arXiv 2024 | Paper Code | MoE VLM with dynamic tiling and efficient inference | visual understanding |
| Pixtral | arXiv 2024 | Paper HF | 12B open-weight model with strong instruction following, image+text understanding; competitive with larger open VLMs | visual understanding |
| Qwen2-VL | arXiv 2024 | Paper HF | Dynamic resolution; native video | visual understanding |
| Cambrian-1: A Fully Open, Vision-Centric Exploration | NeurIPS 2024 | Paper Code | Spatial Vision Aggregator | visual understanding |
| PaliGemma: A Versatile 3B VLM for Transfer | arXiv 2024 | Paper HF | SigLIP encoder + Gemma backbone; strong transfer model | visual understanding |
| InternLM-XComposer2 | arXiv 2024 | Paper Code | Compositional visual grounding | visual understanding |
| Phi-3-Vision | arXiv 2024 | Paper HF | Small but capable | visual understanding |
| LLaVA-HR: High Resolution MLLMs | CVPR 2024 | Paper | Mixture-of-Resolution Adaptation | visual understanding |
| InternVL2 | Model release 2024 | HF | Instruction-tuned InternVL family release with strong multilingual and OCR capabilities | visual understanding |
| InternVL: Scaling up Vision Foundation Models | CVPR 2024 | Paper Code | Progressively aligned ViT + LLM | visual understanding |
| MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training | arXiv 2024 | Paper | Large-scale proprietary recipe study for multimodal LLM pretraining | visual understanding |
| LLaVA | arXiv 2023 | Paper Code | 7B / 13B+ | CLIP Vision Encoder (frozen/pretrained) + linear projection to LLM (Vicuna/LLaMA); vision tokens inserted into LLM input; common late-fusion baseline |
| Paper | Venue | Links | Notes | Task |
|---|---|---|---|---|
| M-MiniGPT4: Multilingual VLLM Alignment via Translated Data | arXiv 2026 | Paper | Q-Former based (inherits from MiniGPT-4 / BLIP-2) | vision-language understanding |
| Video Q-Former: Multimodal Large Language Model with Spatio-Temporal Querying Transformer | Openreview | Paper | Spatio-temporal Q-Former (learnable queries for video spatial-temporal feature extraction) | video understanding |
| HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding | arXiv 2025 | Paper | Hierarchical Q-Former (multi-level learnable queries with memory bank for long video) | long video understanding |
| Towards Efficient Visual-Language Alignment of the Q-Former | arXiv 2024 | Paper | PEFT-tuned Q-Former (parameter-efficient fine-tuning on InstructBLIP-style Q-Former) | visual reasoning |
| Matryoshka Query Transformer (MQT) for Large Vision-Language Models | NeurIPS 2024 | Paper | Matryoshka Query Transformer (elastic learnable queries, variable token count) | vision-language understanding |
| Semantically Grounded QFormer for Efficient Vision Language Understanding | arXiv 2023 | Paper | Improved Grounded QFormer (direct latent conditioning, bypass input projection) | vision-language understanding |
| Paper | Venue | Links | Notes | Task |
|---|---|---|---|---|
| CASA: Cross-Attention over Self-Attention | arXiv 2025 | Paper | Efficient cross-attention via self-attention reformulation; competitive with token insertion on image benchmarks, strong for long video | efficient vision-language fusion, video captioning |
| LLaMA 3.2 Vision | arXiv 2024 | Paper HF | Adapter-based vision addition to Llama 3.2; strong OCR, document VQA, 128K context | visual understanding, document |
| Idefics2 | arXiv 2024 | Paper HF | Flamingo-style with Perceiver Resampler + gated cross-attention; improved efficiency on Mistral backbone | open multimodal understanding |
| CogVLM: Visual Expert for Pretrained Language Models | arXiv 2023 | Paper Code | Deep fusion with visual expert modules inside a pretrained LLM | visual understanding |
| Qwen-VL: A Versatile Vision-Language Model | arXiv 2023 | Paper HF | High-res, multi-lang, bounding box | visual understanding |
| IDEFICS | — | Hugging Face | 80B | Flamingo-inspired; late fusion with vision encoder and LLM |
| Paper | Venue | Links | Notes | Task |
|---|---|---|---|---|
| DeepSeek-OCR-2 | arXiv 2026 | Paper HF | Optimized for high-volume OCR, document digitization, charts and formulas; efficient inference | OCR, document |
| Ovis2.5 | arXiv 2025 | Paper | Following VET architecture; excellent document understanding and fine-grained quantization | visual understanding, document |
| Ovis2 | arXiv 2025 | HF | Embedding table / projector architecture; excellent document understanding and fine-grained quantization | visual understanding, document |
| MiniMax-01: Scaling Foundation Models with Lightning Attention | arXiv 2025 | Paper | Hybrid Lightning-Softmax Attention; MoE-based (45.9B active) multimodal; 4M long-context with near-zero prefill latency. | visual understanding |
| mPLUG-Owl3 | arXiv 2024 | Paper Code | Long visual sequences | visual understanding |
| Idefics3 | arXiv 2024 | Paper HF | Open-data recipe with strong document understanding | visual understanding |
| NVLM 1.0: Open Frontier-Class Multimodal LLMs | arXiv 2024 | Paper HF | Hybrid multimodal design with strong OCR and reasoning | visual understanding |
| Idefics2 | arXiv 2024 | Paper HF | Fully open; built on Mistral | visual understanding |
| mPLUG-DocOwl 1.5 / 2: Unified Structure Learning for OCR-free Document Understanding | arXiv 2024 | Paper Code | OCR-free document understanding with unified structure learning; excels at long documents and complex layouts | document understanding, OCR |
| Paper | Venue | Links | Notes | Task | Adaptor |
|---|---|---|---|---|---|
| OmniGAIA: Towards Native Omni-Modal AI Agents | arXiv 2026 | Paper Code | Comprehensive benchmark for omni-modal agents with complex multi-hop queries across video, audio, and image; includes OmniAtlas agent with tool-integrated reasoning | omni-modal understanding & reasoning | Native |
| ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding | arXiv 2026 | Paper | Training-free framework that lifts textual reasoning to omni-modal scenarios using LRM guidance and stepwise contrastive scaling | omni-modal reasoning | Hybrid |
| OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention | arXiv 2026 | Paper | Reinforced audio-visual reasoning framework with query intention grounding and modality attention fusion | audio-visual reasoning | Hybrid |
| ChronusOmni: Improving Time Awareness of Omni Large Language Models | arXiv 2025 | Paper Code | Enhances temporal awareness in omni-modal LLMs | time-aware omni-modal understanding | Hybrid |
| Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data | arXiv 2025 | Paper Code | MoE-based scaling for omnimodal understanding and generation | omni-modal understanding & generation | MLP Projector |
| Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models | arXiv 2025 | Paper Code | Unified audio-visual speech recognition using LLMs | audio-visual speech recognition | Hybrid |
| LongCat-Flash-Omni Technical Report | arXiv 2025 | Paper Code | Long-context omni-modal model supporting text and audio generation | long-context omni-modal | Hybrid |
| OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM | arXiv 2025 | Paper Code | Architecture and data enhancements for omni-modal understanding | omni-modal understanding | Hybrid |
| InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue | arXiv 2025 | Paper Code | Unified model for audio-visual multi-turn dialogue | audio-visual dialogue | Hybrid |
| OneLLM: One Framework to Align All Modalities with Languag | CVPR 2024 | Paper | Mixture of Matryoshka experts for efficient audio-visual speech recognition | all-in-one LLM | Hybrid |
| MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition | NeurIPS 2025 | Paper | Mixture of Matryoshka experts for efficient audio-visual speech recognition | audio-visual speech recognition | Hybrid |
| Qwen3-Omni Technical Report | arXiv 2025 | Paper Code | Omni-modal model with text and audio capabilities (Alibaba/Qwen series) | omni-modal | Native |
| Qwen2.5-Omni Technical Report | arXiv 2025 | Paper Code | Omni-modal technical report with text and audio support (Alibaba/Qwen series) | omni-modal | Hybrid |
| MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech, and Multimodal Live Streaming on Your Phone | 2025 | Paper Code | On-device GPT-4o level MLLM for vision, speech and multimodal live streaming (OpenBMB) | on-device multimodal live streaming | Hybrid |
| Baichuan-Omni Technical Report | arXiv 2024 | Paper Code | Technical report for Baichuan-Omni (Baichuan Inc.) | omni-modal | Hybrid |
| Baichuan-Omni-1.5 Technical Report | arXiv 2025 | Paper Code | Technical report for Baichuan-Omni 1.5 (Baichuan Inc.) | omni-modal | Hybrid |
| VITA: Towards Open-Source Interactive Omni Multimodal LLM | arXiv 2024 | Paper Code | Open-source interactive omni multimodal LLM | interactive omni multimodal | Hybrid |
| VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction | arXiv 2024 | Paper Code | Real-time vision and speech interaction model | real-time multimodal interaction | Hybrid |
| Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities | NeurIPS 2024 | Paper Code | Open-source GPT-4o style model with vision, speech and duplex capabilities | vision-speech duplex | Hybrid |
| Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment | arXiv 2025 | Paper Code | Progressive modality alignment for omni-modal language model | omni-modal alignment | MLP Projector |
| MIO: A Foundation Model on Multimodal Tokens | arXiv 2024 | Paper Code | Foundation model based on multimodal tokens | multimodal tokens | Native |
| EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions | CVPR 2024 | Paper Code | Multimodal model supporting seeing, hearing and emotional speech | emotional multimodal | Hybrid |
| Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model | arXiv 2025 | Paper Code | Simultaneous multimodal interactions with language-vision-speech model | simultaneous multimodal | Hybrid |
| ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding | arXiv 2025 | Paper Code | Native multimodal LLM focused on 3D generation and understanding | 3D multimodal | Native |
In this section: 4.1 Taxonomy by Generation Paradigm · 4.2 Any-to-Any / Omni UMMs
Models that unify multimodal understanding and visual generation within one framework. The defining property is U+G unification, not necessarily training from scratch.
Boundary with NMMs: if a unified model's central contribution is native end-to-end multimodal pretraining from scratch, we document its architectural details primarily in §5 NMMs and keep §4 focused on the unified U+G perspective.
Overview of representative paradigms and architectures of Unified Multimodal Models (UMMs). Source: https://github.com/AIDC-AI/Awesome-Unified-Multimodal-Models
Subtopics: Diffusion-Based UMMs · Autoregressive (AR) UMMs · Hybrid (AR + Diffusion) UMMs
Unified models are categorized according to their core generation mechanism for visual output (while supporting strong multimodal understanding). This taxonomy highlights trade-offs in fidelity, reasoning, efficiency, and training stability.
| Model | Venue | Links | Paradigm | Notes | Task |
|---|---|---|---|---|---|
| Dual Diffusion | arXiv 2025 | Paper Code | Dual Diffusion | Unified image generation + understanding via bidirectional diffusion | visual understanding, visual generation |
| UniDisc | arXiv 2025 | Paper Code | Unified Discrete Diffusion | Discrete diffusion for multimodal U+G | visual understanding, visual generation |
| MMaDA | arXiv 2025 | Paper Code | Multimodal Large Diffusion LM | Diffusion LM for unified understanding/generation | visual understanding, visual generation |
| FUDOKI | arXiv 2025 | Paper | Discrete Flow-based Unified | Kinetic-optimal velocities for U+G | visual understanding, visual generation |
| Muddit | arXiv 2025 | Paper Code | Unified Discrete Diffusion | Liberating generation beyond T2I | visual understanding, visual generation |
| Lavida-O | arXiv 2025 | Paper Code | Elastic Large Masked Diffusion | Elastic masked diffusion for U+G | visual understanding, visual generation |
| UniModel | arXiv 2025 | Paper | Visual-Only MMDiT Framework | Visual-only unified multimodal U+G | visual understanding, visual generation |
| Model | Venue | Links | Modalities | Notes | Task |
|---|---|---|---|---|---|
| LWM | arXiv 2024 | Paper | video + language | World model on million-length video and language with blockwise ring attention | visual understanding, visual generation |
| Chameleon | arXiv 2024 | Paper Code | image + text | Mixed-modal early-fusion foundation models; token-by-token generation | visual understanding, visual generation |
| ANOLE | arXiv 2024 | Paper Code | image + text | Open autoregressive native LMM for interleaved image-text generation | visual understanding, visual generation |
| Emu3 | arXiv 2024 | Paper Code | image + text | Next-token prediction is all you need; single next-token model | visual understanding, visual generation |
| MMAR | arXiv 2024 | Paper | image + text | Lossless multi-modal auto-regressive probabilistic modeling | visual understanding, visual generation |
| Orthus | arXiv 2024 | Paper Code | image + text | Autoregressive interleaved image-text generation with modality-specific heads | visual understanding, visual generation |
| SynerGen-VL | arXiv 2024 | Paper | image + text | Synergistic image understanding and generation with vision experts and token folding | visual understanding, visual generation |
| Liquid | arXiv 2024 | Paper Code | image + text | Language models are scalable and unified multi-modal generators | visual understanding, visual generation |
| UGen | arXiv 2025 | Paper | image + text | Unified autoregressive multimodal model with progressive vocabulary learning | visual understanding, visual generation |
| Harmon | arXiv 2025 | Paper Code | image + text | Shared MAR encoder for semantic + fine-grained harmony; SOTA GenEval | visual understanding, visual generation |
| TokLIP | arXiv 2025 | Paper Code | image + text | Marry visual tokens to CLIP for U+G | visual understanding, visual generation |
| Selftok | arXiv 2025 | Paper Code | image + text | Discrete visual tokens for AR / Diffusion / Reasoning | visual understanding, visual generation |
| OneCat | arXiv 2025 | Paper Code | image + text | Pure decoder-only unified U+G | visual understanding, visual generation |
| Uni-X | arXiv 2025 | Paper Code | image + text | Two-end-separated architecture mitigating modality conflict | visual understanding, visual generation |
| Emu3.5 | Nature 2026 | Paper Code | image + text | Native multimodal world learner; next-token only | visual understanding, visual generation |
| Title | Venue | Links | Focus | Task |
|---|---|---|---|---|
| Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer | arXiv 2025 | Paper Code | Unified continuous tokenizer for joint understanding and generation | visual understanding, visual generation |
| Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents | arXiv 2025 | Paper Code | Bridging MLLMs and diffusion models via patch-level CLIP latents | visual understanding, visual generation |
| Qwen-Image Technical Report | arXiv 2025 | Paper Code | High-quality image generation with strong text rendering | visual generation |
| X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again | arXiv 2025 | Paper Code | RL-enhanced discrete autoregressive unified modeling | visual understanding, visual generation |
| Ovis-U1 Technical Report | arXiv 2025 | Paper Code | 3B unified model for understanding, text-to-image and editing | visual understanding, visual generation |
| UniCode²: Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation | arXiv 2025 | Paper | Cascaded large-scale codebooks for unified modeling | visual understanding, visual generation |
| OmniGen2: Exploration to Advanced Multimodal Generation | arXiv 2025 | Paper Code | Versatile open-source unified generation model | visual generation |
| Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations | arXiv 2025 | Paper Code | Text-aligned discrete semantic representations | visual understanding, visual generation |
| UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation | arXiv 2025 | Paper Code | Y-shaped architecture for modality alignment | visual understanding, visual generation |
| UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation | arXiv 2025 | Paper Code | High-resolution semantic encoders | visual understanding, visual generation |
| Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation | arXiv 2025 | Paper | Auto-regressive foundation model | visual understanding, visual generation |
| DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies | arXiv 2025 | Paper | Dual visual vocabularies | visual understanding, visual generation |
| UniTok: A Unified Tokenizer for Visual Generation and Understanding | arXiv 2025 | Paper Code | Unified tokenizer | visual understanding, visual generation |
| QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation | arXiv 2025 | Paper Code | Text-aligned visual tokenization | visual understanding, visual generation |
| MetaMorph: Multimodal Understanding and Generation via Instruction Tuning | arXiv 2024 | Paper | Instruction tuning for unified multimodal | visual understanding, visual generation |
| ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance | arXiv 2024 | Paper | Self-enhancing unified see-and-draw | visual understanding, visual generation |
| PUMA: Empowering Unified MLLM with Multi-granular Visual Generation | arXiv 2024 | Paper Code | Multi-granular visual generation | visual understanding, visual generation |
| VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation | ICLR 2024 | Paper Code | Unified foundation model | visual understanding, visual generation |
| Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models | arXiv 2024 | Paper Code | Multi-modality potential mining | visual understanding, visual generation |
| MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer | arXiv 2024 | Paper Code | Interleaved image-text generative modeling | visual understanding, visual generation |
| VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation | arXiv 2023 | Paper | Generative pre-trained transformer | visual understanding, visual generation |
| Generative Multimodal Models are In-Context Learners | CVPR 2023 | Paper | In-context learning generative multimodal | visual understanding, visual generation |
| DreamLLM: Synergistic Multimodal Comprehension and Creation | ICLR 2023 | Paper | Synergistic multimodal comprehension and creation | visual understanding, visual generation |
| LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization | ICLR 2023 | Paper Code | Dynamic discrete visual tokenization | visual understanding, visual generation |
| Emu: Generative Pretraining in Multimodality | ICLR 2023 | Paper | Generative pretraining in multimodality | visual understanding, visual generation |
| Title | Venue | Links | Focus | Task |
|---|---|---|---|---|
| Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model | arXiv 2025 | Paper | Kontext model with online RL and MetaQuery connector for unified multimodal framework | visual understanding, visual generation, editing |
| TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning | arXiv 2025 | Paper | Ladder-side diffusion tuning integrating MLLM and DiT via layer-wise alignment | visual understanding, visual generation |
| UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing | arXiv 2025 | Paper | Adapting CLIP with unified continuous tokenizer for reconstruction, generation and editing | visual understanding, visual generation, editing |
| OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation | arXiv 2025 | Paper Code | Simple baseline with learnable queries and lightweight connector bridging MLLM and diffusion | visual understanding, visual generation |
| BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset | arXiv 2025 | Paper | Fully open unified multimodal models with complete architecture, training recipe and datasets | visual understanding, visual generation |
| Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction | arXiv 2025 | Paper | Unified visual generator and native multimodal autoregressive model for natural interaction | visual understanding, visual generation |
| Nexus-Gen: A Unified Model for Image Understanding, Generation, and Editing | arXiv 2025 | Paper Code | Prefilled autoregression in shared embedding space unifying understanding, generation and editing | visual understanding, visual generation, editing |
| Transfer between Modalities with MetaQueries | arXiv 2025 | Paper Code | Learnable MetaQueries as efficient interface between autoregressive MLLMs and diffusion models | visual understanding, visual generation |
| SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation | arXiv 2024 | Paper Code | Unified multi-granularity visual semantics for arbitrary-size comprehension and generation | visual understanding, visual generation |
| Making LLaMA SEE and Draw with SEED Tokenizer | ICLR 2023 | Paper Code | SEED tokenizer enabling LLaMA for scalable multimodal autoregression (see and draw) | visual understanding, visual generation |
| Planting a SEED of Vision in Large Language Model | arXiv 2023 | Paper Code | SEED image tokenizer with 1D causal dependency and high-level semantics for LLM vision | visual understanding, visual generation |
| Title | Venue | Links | Focus | Task |
|---|---|---|---|---|
| Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation | arXiv 2025 | Paper Code | Unified autoregressive modeling with decoupled encoding for image understanding, generation and editing | visual understanding, visual generation |
| MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO | arXiv 2025 | Paper Code | Unified VLM with reasoning generation via Reinforcement Learning (RGPO) | multimodal understanding, reasoning generation |
| UniFluid: Unified Autoregressive Visual Generation and Understanding with Continuous Tokens | arXiv 2025 | Paper | Unified autoregressive framework using continuous visual tokens | visual understanding, visual generation |
| OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models | arXiv 2025 | Paper Code | Efficient linear-time unified multimodal model based on Mamba (state space models) | multimodal understanding, visual generation |
| Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling | arXiv 2025 | Paper Code | Scaled-up version of Janus with improved training strategy, more data and larger model size | multimodal understanding, visual generation |
| Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation | arXiv 2024 | Paper Code | Decoupling visual encoding to enable unified understanding and generation in an autoregressive framework | multimodal understanding, visual generation |
| Title | Venue | Links | Focus | Task |
|---|---|---|---|---|
| AToken: A Unified Tokenizer for Vision | arXiv 2025 | Paper Code | AToken unified visual tokenizer achieving high-fidelity reconstruction and semantic understanding for images, videos and 3D | visual understanding, visual generation |
| UniWeTok: An Unified Binary Tokenizer with Codebook Size 2128 for Unified Multimodal Large Language Model | arXiv 2026 | Paper | UniWeTok unified binary tokenizer with 2^{128} codebook, pre-post distillation and generative-aware prior for MLLMs | visual understanding, visual generation |
| Towards Scalable Pre-training of Visual Tokenizers for Generation | arXiv 2025 | Paper Code | VTP unified visual tokenizer pre-training framework with joint image-text contrastive, self-supervised and reconstruction losses | visual understanding, visual generation |
| The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding | arXiv 2025 | Paper Code | Prism Hypothesis and unified autoencoding (UAE) harmonizing semantic and pixel representations across modalities | visual understanding, visual generation |
| Show-o2: Improved Native Unified Multimodal Models | arXiv 2025 | Paper Code | Improved native unified multimodal models with autoregressive modeling and flow matching for understanding and generation | multimodal understanding and generation |
| UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding | CVPRW 2025 | Paper Code | Unified visual encoding combining discrete and continuous representations for autoregressive multimodal models | multimodal understanding and generation |
| VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning | arXiv 2025 | Paper Code | Enhanced visual autoregressive unified model with iterative instruction tuning and DPO reinforcement learning | visual understanding, generation and editing |
| ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement | arXiv 2025 | Paper Code | Dual visual tokenization and diffusion refinement for unified multimodal large language model | multimodal understanding and generation |
| SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation | arXiv 2025 | Paper | Semantic-guided hierarchical codebook for unified image tokenization supporting understanding and generation | multimodal understanding and generation |
| VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model | arXiv 2025 | Paper Code | Visual autoregressive framework unifying understanding and generation in a single MLLM | visual understanding and generation |
| TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation | CVPR 2025 | Paper Code | Unified image tokenizer with dual-codebook architecture bridging understanding and generation | multimodal understanding and generation |
| MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding | arXiv 2024 | Paper | Semantic discrete encoding for unified vision-language model enabling efficient multimodal understanding and generation | multimodal understanding and generation |
| Paper | Venue | Links | Notes | Task |
|---|---|---|---|---|
| Tuna: Taming Unified Visual Representations for Native Unified Multimodal Models | arXiv 2025 | Paper Code | Native unified multimodal model with cascaded VAE + representation encoder for unified continuous visual representations | multimodal understanding and generation |
| LMFusion: Adapting Pretrained Language Models for Multimodal Generation | arXiv 2024 | Paper | Adapting pretrained LLMs (Llama) for multimodal generation by adding parallel diffusion modules while keeping autoregressive text modeling | multimodal understanding and generation |
| MonoFormer: One Transformer for Both Diffusion and Autoregression | arXiv 2024 | Paper Code | Single shared transformer backbone that handles both autoregressive modeling and diffusion for unified multimodal tasks | visual understanding and generation |
| Show-o: One Single Transformer to Unify Multimodal Understanding and Generation | ICLR 2025 | Paper Code | Unified transformer combining autoregressive and discrete diffusion modeling to flexibly handle mixed-modality inputs/outputs | multimodal understanding and generation |
| Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model | ICLR 2025 | Paper | Joint training of next-token prediction (AR) and diffusion in one transformer over mixed discrete/continuous multimodal sequences | visual understanding, visual generation |
| Paper | Venue | Links | Notes | Task |
|---|---|---|---|---|
| EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture | arXiv 2025 | Paper Code | Efficient unified architecture with autoencoders, channel-wise concatenation, shared-decoupled networks and MoE for understanding, generation and editing | multimodal understanding, generation and editing |
| HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation | arXiv 2025 | Paper | Asymmetric H-shaped architecture bridging heterogeneous experts with symmetric dense mid-layer connections for unified multimodal modeling | multimodal understanding and generation |
| LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation | arXiv 2025 | Paper Code | Light-weighted double fusion framework that efficiently integrates pretrained vision-language and diffusion models | multimodal understanding and generation |
| BAGEL: Emerging Properties in Unified Multimodal Pretraining | arXiv 2025 | Paper Code | Open-source foundational decoder-only model pretrained on trillions of interleaved multimodal tokens supporting native understanding and generation | multimodal understanding and generation |
| Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation | arXiv 2025 | Paper | Causal interleaved multi-modal generation framework with deep-fusion, dual vision encoders and multi-modal classifier-free guidance | interleaved multimodal generation |
| JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation | arXiv 2024 | Paper Code | Minimalist framework harmonizing autoregressive LLMs with rectified flow for efficient unified understanding and generation | multimodal understanding and generation |
Models that extend unified understanding + generation beyond text and image to support any-to-any modality conversion (audio, video, speech, etc.). These often build on the paradigms above but emphasize native omni-modal tokenization, long-context handling, and cross-modal generation.
| Model | Paper | Links | Notes | Task |
|---|---|---|---|---|
| LongCat-Flash-Omni | arXiv 2025 | Paper Code | Efficient omni model with flash-style acceleration and real-time audio-visual interaction (560B parameters) | any-to-any multimodal generation and understanding |
| Ming-Flash-Omni | arXiv 2025 | Paper Code | Sparse unified MoE architecture (100B total, 6.1B active) for efficient multimodal perception and generation | any-to-any multimodal perception and generation |
| Qwen3-Omni | arXiv 2025 | Paper Code | Next-gen Qwen omni model with unified modality space, maintaining SOTA across text/image/audio/video | any-to-any multimodal understanding and generation |
| Ming-Omni | arXiv 2025 | Paper Code | Unified multimodal architecture for perception + generation (images, text, audio, video) | any-to-any multimodal tasks |
| M2-Omni | arXiv 2025 | Paper | Extends Omni-MLLM with broader modality support and competitive performance to GPT-4o | any-to-any multimodal modeling |
| Spider | arXiv 2024 | Paper Code | Any-to-many multimodal LLM with flexible output heads for arbitrary modality combinations | multimodal understanding and generation |
| MIO | arXiv 2024 | Paper | Token-level unified multimodal foundation model on discrete multimodal tokens | any-to-any multimodal token modeling |
| X-VILA | arXiv 2024 | Paper | Cross-modality alignment for LLM-based multimodal systems (image/video/audio) | multimodal understanding |
| AnyGPT | arXiv 2024 | Paper Code | Discrete token modeling for unified multimodal generation | any-to-any multimodal generation |
| OmniFlow | CVPR 2025 | Paper | Uses multi-modal rectified flows for any-to-any generation across modalities | any-to-any generation across modalities |
| Video-LaVIT | ICML 2024 | Paper Code | Decoupled visual-motion tokenization for video-language modeling | video understanding and generation |
| Unified-IO 2 | CVPR 2024 | Paper Code | Scales autoregressive multimodal models across modalities | any-to-any multimodal tasks (vision, language, audio, action) |
| NExT-GPT | arXiv 2023 | Paper Code | Any-to-any; encoder+LLM+diffusion decoders | visual understanding, visual generation, omni |
In this section: 5.1 Design Analyses & Scaling Laws · 5.2 Early Fusion NMMs · 5.3 Late Fusion NMMs · 5.4 Any-to-Any / Omni NMMs
The most restrictive category. NMMs are trained completely from scratch on multimodal data — no pretrained LLM or vision encoder is used as initialization. All weights are jointly learned end-to-end.
What recent arXiv work emphasizes: native multimodality is increasingly defined by end-to-end multimodal pretraining, tokenizer/representation co-design, and scaling strategies that explicitly address the asymmetry between vision and language.
Recent arXiv papers sharpen the definition of NMMs and identify the main bottlenecks in native multimodal pretraining.
| Paper | Venue | Links | Insights |
|---|---|---|---|
| Beyond Language Modeling: An Exploration of Multimodal Pretraining | arXiv 2026 | Paper | Highlights representation autoencoders, vision-language data synergy, and MoE for native pretraining |
| NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints | arXiv 2025 | Paper Code | End-to-end native MLLM scaling shows positive correlation between visual encoder and LLM size under data constraints; optimal meta-architecture balances cost and performance |
| Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training | arXiv 2025 | Paper | reveals that LLMs develop latent visual priors during text-only pre-training, where reasoning-centric data (code and math) builds transferable visual reasoning skills while broad corpora foster perception, enabling models to 'see' before ever processing an image. |
| Scaling Laws for Native Multimodal Models | arXiv 2025 | Paper | Early-fusion NMMs match or outperform late-fusion at low compute; early-fusion needs fewer params; MoE with modality-agnostic routing boosts sparse NMM scaling |
| The Narrow Gate: Localized Image-Text Communication in Native Multimodal Models | arXiv 2024 | Paper | Native models often funnel image-to-text communication through a single post-image token |
Single Transformer decoder processes tokenized text and image inputs from layer 1, with minimal modality-specific parameters (only a linear patch embedding for images). No separate image encoder component.
Recent scaling-law evidence suggests early-fusion NMMs are often stronger at lower parameter counts and simpler to deploy when paired with sufficiently strong visual representations.
| Model | Paper | Links | Training Scale | Notes | Task |
|---|---|---|---|---|---|
| NEO | arXiv 2025 | Paper | — | NEO as a cornerstone for scalable and powerful native VLM development, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem | vision-language understanding |
| NEO-Unify | - | Blog | — | NEO as a cornerstone for scalable and powerful native VLM development, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem | vision-language understanding |
| Emu3.5 | Nature 2026 | Paper Code | Large-scale (trillion+ tokens) | Native world model; next-state prediction on interleaved video/text; Discrete Diffusion Adaptation for efficiency | interleaved generation, world modeling, any-to-image |
Models where separate unimodal components are jointly trained from scratch (not pretrained), with cross-modal interaction occurring at deeper layers. Distinct from MLLMs where vision encoders are pretrained.
| Model | Paper | Links | Training Scale | Notes | Task |
|---|---|---|---|---|---|
| Llama4 | arXiv 2026 | Paper Blog | Scout/Maverick: 17B active / ~109B–400B total; Behemoth: ~2T total | Native multimodal, MoE architecture with early fusion and vision encoder | vision-language understanding |
| LongCat-Next | arXiv 2026 | Paper | — | Discrete Native Any-resolution Visual Transformer | vision-language understanding |
| VL-JEPA | arXiv 2025 | Paper | 1.6B | Vision-Language Joint Native Model | vision-language understanding |
| InternVL3 | arXiv 2025 | Paper | — | A pre-trained InternViT encoder coupled with a cross-attention visual expert, employing a deep but late-fusion strategy to ensure seamless multimodal alignment while strictly preserving native LLM reasoning and linguistic proficiency. | vision-language understanding |
| InternVL3.5 | arXiv 2025 | Paper | — | A pre-trained ViT encoder with a visual expert that uses cross-attention for deep but late-style fusion to the LLM, preserving its capabilities. | vision-language understanding |
| Qwen3.5 | - | Blog | — | Discrete Native Any-resolution Visual Transformer | vision-language understanding |
| Gemma4 | - | Blog | — | A pre-trained ViT encoder with a visual expert that uses cross-attention for deep but late-style fusion to the LLM, preserving its capabilities. | vision-language understanding |
| Emu3 | arXiv 2024 | Paper Code | 8B | Next-token prediction over VQ image tokens; native multimodal decoder-only; minimal modality-specific params | visual understanding, visual generation |
The latest arXiv-native multimodal papers increasingly blur the boundaries between omni understanding, any-to-any generation, world modeling, and RL-enhanced post-training.
| Model | Paper | Links | Notes | Task |
|---|---|---|---|---|
| Qwen3.5-Omni | (Late Fusion) | Blog | — | Discrete Native Any-resolution Visual Transformer |
| ERNIE 5.0 Technical Report | arXiv 2026 (Late fusion) | paper | — | a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio |
| Model | Venue | Links | Notes | Task |
|---|---|---|---|---|
| Claude 4.6 Family (Opus 4.6 / Sonnet 4.6) | Anthropic Blog | Anthropic Claude Updates | Released ~February 2026. Further multimodal refinements (vision + tool/computer use). Proprietary. | Multimodal + Agentic/Coding/Computer-Use |
| Gemini 3.x (Pro / Flash / 3.1 Pro) | Google DeepMind Blog | Gemini 3 Announcements | Released late 2025–early 2026 (Flash Dec 2025, Pro variants Feb 2026). State-of-the-art multimodal with massive context and Deep Think modes. Proprietary. | Frontier Multimodal (text/image/audio/video + reasoning) |
| GPT-5.4 (and Pro/Codex variants) | OpenAI Blog | OpenAI GPT-5 Updates | Released ~March 2026. Enhanced efficiency, multimodal, and professional/agentic features. Proprietary. | Omni-Modal + Professional/Agentic Workflows |
| Grok 4.x updates (e.g., Grok 4.1) | xAI Announcement | xAI Blog | Continued 2025–2026 iterations with improved vision and real-time capabilities. Proprietary via X platform. | Multimodal + Real-Time/Data-Integrated Reasoning |
| Model | Venue | Links | Notes | Task |
|---|---|---|---|---|
| Gemini 2.0 / 2.5 (Pro / Flash) | Google DeepMind Blog | Gemini 2.x Announcements | Released early–mid 2025 (Flash ~Jan/Feb, Pro variants through March–June). Native multimodal with improved agentic and long-context capabilities. Proprietary. | Advanced Native Multimodal + Agentic (text/image/audio/video) |
| Claude 4 Family (Opus 4 / Sonnet 4 / Haiku 4) | Anthropic Blog | Claude 4 Announcement | Released ~May 2025. Enhanced vision, reasoning, and early agentic features. Proprietary. | Vision + Advanced Reasoning/Agentic Workflows |
| Grok 3 / Grok 4 (including vision/speech) | xAI Announcement | xAI Blog | Major updates throughout 2025 (Grok 3 ~early 2025, Grok 4 ~mid-late 2025). Multimodal input (text/image/speech). Proprietary. | Multimodal Reasoning + Real-Time Integration |
| GPT-4.5 / GPT-5 (and variants like GPT-5 Codex) | OpenAI Blog | OpenAI Announcements | GPT-4.5 ~early 2025; full GPT-5 ~August 2025. Unified multimodal with strong reasoning and tool use. Proprietary. | Omni-Modal + Advanced Reasoning/Agentic |
| Mistral Large / Medium Multimodal variants | Mistral AI | Mistral Platform | Proprietary multimodal offerings (e.g., Medium 3.1 ~2025). Text + vision capabilities via API. | General Multimodal Tasks |
| Model | Venue | Links | Notes | Task |
|---|---|---|---|---|
| Gemini 1.5 (Pro / Flash) | Google DeepMind Blog | Gemini 1.5 Announcement | Released February 2024. Massive context (>1M tokens), strong long-context multimodal (video, audio, images). Proprietary. | Long-Context Multimodal (video/audio/image/text) |
| Claude 3 Family (Opus / Sonnet / Haiku) | Anthropic Blog | Claude 3 Family | Released March 2024. Strong native vision for images, charts, diagrams, and documents. Proprietary API + Claude.ai. | Vision-Language + Reasoning |
| Grok-1.5V / Grok-2 Vision | xAI Announcement | Grok Vision Updates | Vision capabilities added ~April 2024 (Grok-1.5V), expanded in Grok-2 (August 2024). Image understanding with real-world and diagram reasoning. Proprietary via X/Grok API. | Vision-Language (real-world visuals, diagrams) |
| GPT-4o (Omni) | OpenAI Blog | GPT-4o Announcement | Released May 2024. Full real-time omni-modal: text + image + audio (voice) input/output. Proprietary. | Real-Time Omni-Modal (text/vision/audio) |
| Amazon Nova (Pro / Lite) | Amazon Announcement | AWS Bedrock docs | Released late 2024. Multimodal (text + image + video). Proprietary via Amazon Bedrock API. | Multimodal Understanding (text/image/video) |
| Model | Venue | Links | Notes | Task |
|---|---|---|---|---|
| GPT-4V (Vision) | OpenAI Announcement | GPT-4V System Card | Released September 2023. First widely available multimodal GPT-4 variant. Image + text input, text output. API/ChatGPT access only. | Vision-Language (image understanding, VQA, OCR, document analysis, captioning) |
| Gemini 1.0 (Ultra / Pro / Nano) | Google DeepMind Blog | Gemini Announcement | Released December 2023. Native multimodal from training (text + image + audio + video). Proprietary API + Gemini chatbot. | Native Multimodal Understanding (text/image/audio/video) |
In this section: 7.1 Related Awesome Lists · 7.2 Slides & Survey Papers · 7.3 Code Repositories & Tools
| Repository | Focus | Author |
|---|---|---|
| awesome-multimodal-ml | General multimodal ML | pliang279 |
| Awesome-Multimodal-Large-Language-Models | MLLMs + evaluation | BradyFU |
| Awesome-Multimodal-Research | Broad multimodal research | Eurus-Holmes |
| Awesome-Unified-Multimodal-Models | UMMs | ShowLab |
| Awesome-Multimodal-Large-Language-Models | MLLMs | yfzhang114 |
| awesome-foundation-and-multimodal-models | Foundation + multimodal | SkalskiP |
| Awesome-Multimodality | General multimodality | Yutong-Zhou-cv |
| Awesome-Unified-Multimodal | Unified models | Purshow |
| Awesome-Unified-Multimodal | Unified models | AIDC-AI |
| Type | Resource | Notes |
|---|---|---|
| Slides | Native LMM Slides | Ziwei Liu (NTU); concise framing for native multimodal models |
| Survey | A Survey on Multimodal Large Language Models | Broad survey of MLLM architectures, data, and evaluation |
| Report | The Dawn of LMMs: Preliminary Explorations with GPT-4V | Early capability analysis around GPT-4V |
| Survey | Multimodal Foundation Models: From Specialists to General-Purpose Assistants | Broader foundation-model view across multimodal systems |
| Tool | Description | Link |
|---|---|---|
| LMMs-Eval | Unified evaluation harness for multimodal models | Code |
| LAVIS | Library for Language-Vision Intelligence (Salesforce) | Code |
| OpenFlamingo | Open reproduction of DeepMind Flamingo | Code |
| xtuner | Efficient fine-tuning for multimodal LLMs | Code |
| LLaMA-Factory | Multimodal instruction tuning framework | Code |
| MMEngine | Foundation for perception research (OpenMMLab) | Code |
| DeepSpeed-VisualChat | Scalable multimodal chat training | Code |
In this section: Validation Rules · Entry Format
We welcome contributions! Please follow these guidelines:
For NMM submissions (strict):
- Confirm the model does NOT use any pretrained LLM backbone
- Confirm the model does NOT use any pretrained vision encoder (CLIP, ViT, etc.)
- All weights are jointly trained from scratch on multimodal data
- Classify as Early Fusion or Late Fusion (both must be "from scratch")
For UMM submissions:
- Confirm the model handles both image understanding AND image generation
- Note whether pretrained components are used (annotate accordingly)
For MLLM submissions:
- Note which vision encoder is used (must be a pretrained encoder)
- Note which LLM backbone is used (must be a pretrained LLM)
| **Model Name** | [Paper](arxiv_link) [Code](github_link) [HF](huggingface_link) BADGES | Scale | Key contribution / notes |Submit a PR with:
- The paper/model entry in the correct section
- A one-line justification for the chosen category
- Links to paper, code, and/or weights
If this list is useful in your research, please consider citing:
@misc{awesome-multimodal-modeling-2026,
title = {Awesome Multimodal Modeling: From Traditional to Native & Unified},
author = OpenEnvision-Lab,
year = {2026},
url = {https://github.com/OpenEnvision-Lab/Awesome-Multimodal-Model-Traditional-Advanced},
note = {GitHub repository}
}This list is released under the CC0 1.0 Universal license.
