Skip to content

UCSC-VLAA/VLM-CapCurriculum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VLM-CapCurriculum

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

See first, then think β€” and treat capability as a new curriculum axis.

arXiv License πŸ€— HF Collection Project Page

🌐 Project page: ucsc-vlaa.github.io/VLM-CapCurriculum


TL;DR

Visual perception, not reasoning length, is the dominant bottleneck for visual reasoning in VLMs. We show that 86.9% of incorrect answers from a state-of-the-art Qwen3-VL-8B trace back to perception errors that no amount of additional thinking can fix β€” longer thinking cannot rescue a wrong "look."

So we see first, then think. Concretely, we decouple post-training into three stages along a capability axis:

   Stage 1                Stage 2               Stage 3
 Visual Perception  β†’   Textual Reasoning  β†’  Visual Reasoning
   (D_perc, RLVR)        (D_text, RLVR)        (D_vis, RLVR)

Across four backbones (Qwen2.5-VL-7B, Qwen3-VL-8B, InternVL3-8B, InternVL3.5-8B) this staged recipe consistently beats the standard merged baseline that pools all data into one stage. On Qwen3-VL-8B it yields +1.46% accuracy with 20.8% shorter reasoning traces β€” better perception literally lets the model think less.

🧭 Conceptually, this staging is best understood as a new curriculum axis: the capability axis (perception β†’ reasoning), orthogonal to the classic difficulty axis (easy β†’ hard). They stack additively β€” combining both lifts Qwen3-VL-8B from 58.6 β†’ 63.0 average, beating either axis alone by >2 points. See Β§ A new curriculum dimension below.


A new curriculum dimension

Curriculum learning has historically meant ordering training samples by difficulty (easy β†’ hard). Our staged recipe surfaces a second, orthogonal axis that has been largely overlooked in VLM post-training: what capability each batch trains. Sample difficulty and the capability under training are two independent knobs:

                Capability axis (ours)
                  perception β†’ reasoning
                       β–²
                       β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚   Capability βœ“    β”‚  Capability βœ“     β”‚
   β”‚   Difficulty βœ—    β”‚  Difficulty βœ“     β”‚   ← additive sweet spot
   β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
   β”‚   Capability βœ—    β”‚  Capability βœ—     β”‚
   β”‚   Difficulty βœ—    β”‚  Difficulty βœ“     β”‚   ← prior curriculum work
   └───────────────────┴───────────────────► Difficulty axis
                  easy β†’ hard

Empirically the two axes stack additively on Qwen3-VL-8B:

Curriculum Avg over 7 benchmarks Ξ” over Merged
None (Merged baseline) 58.56 β€”
Difficulty only 60.36 +1.80
Capability only (ours) 60.53 +1.97
Capability + Difficulty 62.99 +4.43

This reframes post-training as choosing a trajectory through a 2D space rather than along a single difficulty line, and opens a search space (other capability decompositions, joint schedules, etc.) that prior curriculum work has not touched. See Section 4.5 of the paper for details and training/examples/curriculum/ for the launch scripts.


Headline numbers

Qwen3-VL-8B base vs. our staged recipe

Setting Visual Math AVG Perception AVG Overall AVG
Qwen3-VL-8B base 45.17 79.21 62.19
Qwen3-VL-8B + Merged training 49.64 79.71 64.67
Qwen3-VL-8B + Staged (ours) 51.10 80.44 65.77
OneThinker-8B (concurrent baseline) 51.10 78.64 64.87

Visual math = MathVista / MathVision / MathVerse(VI) / WeMath. Perception = A-OKVQA / RealWorldQA / MMStar / POPE.

Staged > Merged across four backbones

Backbone Ξ” Overall AVG (Staged βˆ’ Merged)
Qwen3-VL-8B +3.37
InternVL3-8B +3.77
Qwen2.5-VL-7B +1.62
InternVL3.5-8B +0.95

(see docs/results.md for the full extended-benchmark table from the paper appendix.)

(For the additive Capability Γ— Difficulty result, see Β§ A new curriculum dimension above.)


Repository layout

VLM-CapCurriculum/
β”œβ”€β”€ data_pipeline/      # Stage 1 perception data synthesis (DOCCI/Pixmo β†’ MCQ β†’ filter)
β”œβ”€β”€ training/           # GRPO/RLVR training scripts on top of EasyR1
β”‚   β”œβ”€β”€ examples/       # one .sh per stage, per backbone
β”‚   β”œβ”€β”€ reward_functions/
β”‚   └── format_prompts/
β”œβ”€β”€ evaluation/         # VLMEvalKit configs + Claude-Haiku-4.5 judge setup
β”‚   └── perception_error_analysis/   # Sec 4.4 analysis pipeline
β”œβ”€β”€ scripts/
β”‚   └── quickstart_qwen3vl_8b_staged.sh    # one-shot stage1β†’2β†’3
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ images/         # paper figures (teaser, pipeline, case study)
β”‚   └── results.md      # full benchmark tables
└── requirements.txt

The repo is paper-specific β€” it does not vendor EasyR1 or VLMEvalKit. Install them separately and point our scripts at them. See Setup.


Resources


Setup

git clone https://github.com/UCSC-VLAA/VLM-CapCurriculum.git
cd VLM-CapCurriculum

# 1) Paper-side deps (data synthesis + Claude judge)
pip install -r requirements.txt

# 2) Training framework β€” install separately
git clone https://github.com/hiyouga/EasyR1.git ../EasyR1
cd ../EasyR1 && pip install -e . && cd -

# 3) Evaluation framework β€” install separately
git clone https://github.com/open-compass/VLMEvalKit.git ../VLMEvalKit
cd ../VLMEvalKit && pip install -e . && cd -

Set environment paths used by our scripts:

export EASYR1_HOME=$(realpath ../EasyR1)
export VLMEVALKIT_HOME=$(realpath ../VLMEvalKit)
export VLMCC_HOME=$(pwd)

Quickstart β€” Reproduce Qwen3-VL-8B (Staged)

The end-to-end recipe takes ~24 GPU-hours on 8Γ— H200.

# (a) Build perception data D_perc (or download from HF; see data_pipeline/README.md)
bash data_pipeline/examples/run_full_pipeline.sh

# (b) Run Stage 1 β†’ Stage 2 β†’ Stage 3 training in one shot
bash scripts/quickstart_qwen3vl_8b_staged.sh

# (c) Evaluate on the 8 benchmarks reported in the paper
bash evaluation/run_eval.sh <CHECKPOINT_DIR>

For more granular runs (per-stage, ablations, other backbones) see training/examples/.


Detailed reproduction guides

Topic Pointer
Synthesizing & filtering Stage 1 perception data data_pipeline/README.md
Per-stage training scripts training/examples/
Reward functions & prompt templates training/reward_functions/, training/format_prompts/
VLMEvalKit configs & judge setup evaluation/README.md
Sec 4.4 perception-error analysis evaluation/perception_error_analysis/
Ablations (stage order, encoder freezing, SFT vs RL) training/examples/ablations/
Difficulty + Capability curriculum (Sec 4.5) training/examples/curriculum/

Citation

@inproceedings{vlmcapcurriculum2026,
  title  = {From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models},
  author = {Wu, Juncheng and Chen, Hardy and Tu, Haoqin and Tang, Xianfeng and Shi, Freda and Liu, Hui and Lu, Hanqing and Xie, Cihang and Zhou, Yuyin},
  booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
  year   = {2026},
  eprint = {2605.20177},
  archivePrefix = {arXiv},
  primaryClass = {cs.CV},
  url    = {https://arxiv.org/abs/2605.20177}
}

Acknowledgements

This work builds on top of the open-source releases of EasyR1, VLMEvalKit, Qwen2.5-VL / Qwen3-VL, InternVL3, DOCCI, and PixmoCap. We thank the maintainers of these projects.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors