From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models
See first, then think β and treat capability as a new curriculum axis.
π Project page: ucsc-vlaa.github.io/VLM-CapCurriculum
Visual perception, not reasoning length, is the dominant bottleneck for visual reasoning in VLMs. We show that 86.9% of incorrect answers from a state-of-the-art Qwen3-VL-8B trace back to perception errors that no amount of additional thinking can fix β longer thinking cannot rescue a wrong "look."
So we see first, then think. Concretely, we decouple post-training into three stages along a capability axis:
Stage 1 Stage 2 Stage 3
Visual Perception β Textual Reasoning β Visual Reasoning
(D_perc, RLVR) (D_text, RLVR) (D_vis, RLVR)
Across four backbones (Qwen2.5-VL-7B, Qwen3-VL-8B, InternVL3-8B, InternVL3.5-8B) this staged recipe consistently beats the standard merged baseline that pools all data into one stage. On Qwen3-VL-8B it yields +1.46% accuracy with 20.8% shorter reasoning traces β better perception literally lets the model think less.
π§ Conceptually, this staging is best understood as a new curriculum axis: the capability axis (perception β reasoning), orthogonal to the classic difficulty axis (easy β hard). They stack additively β combining both lifts Qwen3-VL-8B from 58.6 β 63.0 average, beating either axis alone by >2 points. See Β§ A new curriculum dimension below.
Curriculum learning has historically meant ordering training samples by difficulty (easy β hard). Our staged recipe surfaces a second, orthogonal axis that has been largely overlooked in VLM post-training: what capability each batch trains. Sample difficulty and the capability under training are two independent knobs:
Capability axis (ours)
perception β reasoning
β²
β
βββββββββββββββββββββΌββββββββββββββββββββ
β Capability β β Capability β β
β Difficulty β β Difficulty β β β additive sweet spot
βββββββββββββββββββββΌββββββββββββββββββββ€
β Capability β β Capability β β
β Difficulty β β Difficulty β β β prior curriculum work
βββββββββββββββββββββ΄ββββββββββββββββββββΊ Difficulty axis
easy β hard
Empirically the two axes stack additively on Qwen3-VL-8B:
| Curriculum | Avg over 7 benchmarks | Ξ over Merged |
|---|---|---|
| None (Merged baseline) | 58.56 | β |
| Difficulty only | 60.36 | +1.80 |
| Capability only (ours) | 60.53 | +1.97 |
| Capability + Difficulty | 62.99 | +4.43 |
This reframes post-training as choosing a trajectory through a 2D space rather than along a single difficulty line, and opens a search space (other capability decompositions, joint schedules, etc.) that prior curriculum work has not touched. See Section 4.5 of the paper for details and training/examples/curriculum/ for the launch scripts.
| Setting | Visual Math AVG | Perception AVG | Overall AVG |
|---|---|---|---|
| Qwen3-VL-8B base | 45.17 | 79.21 | 62.19 |
| Qwen3-VL-8B + Merged training | 49.64 | 79.71 | 64.67 |
| Qwen3-VL-8B + Staged (ours) | 51.10 | 80.44 | 65.77 |
| OneThinker-8B (concurrent baseline) | 51.10 | 78.64 | 64.87 |
Visual math = MathVista / MathVision / MathVerse(VI) / WeMath. Perception = A-OKVQA / RealWorldQA / MMStar / POPE.
| Backbone | Ξ Overall AVG (Staged β Merged) |
|---|---|
| Qwen3-VL-8B | +3.37 |
| InternVL3-8B | +3.77 |
| Qwen2.5-VL-7B | +1.62 |
| InternVL3.5-8B | +0.95 |
(see docs/results.md for the full extended-benchmark table from the paper appendix.)
(For the additive Capability Γ Difficulty result, see Β§ A new curriculum dimension above.)
VLM-CapCurriculum/
βββ data_pipeline/ # Stage 1 perception data synthesis (DOCCI/Pixmo β MCQ β filter)
βββ training/ # GRPO/RLVR training scripts on top of EasyR1
β βββ examples/ # one .sh per stage, per backbone
β βββ reward_functions/
β βββ format_prompts/
βββ evaluation/ # VLMEvalKit configs + Claude-Haiku-4.5 judge setup
β βββ perception_error_analysis/ # Sec 4.4 analysis pipeline
βββ scripts/
β βββ quickstart_qwen3vl_8b_staged.sh # one-shot stage1β2β3
βββ docs/
β βββ images/ # paper figures (teaser, pipeline, case study)
β βββ results.md # full benchmark tables
βββ requirements.txt
The repo is paper-specific β it does not vendor EasyR1 or VLMEvalKit. Install them separately and point our scripts at them. See Setup.
- π€ Collection (single hub for everything below):
UCSC-VLAA / VLM-CapCurriculum - π€ Models:
UCSC-VLAA/VLM-CapCurriculum-Qwen3-VL-8B-Staged(primary, ICML headline numbers)UCSC-VLAA/VLM-CapCurriculum-Qwen2.5-VL-7B-StagedUCSC-VLAA/VLM-CapCurriculum-InternVL3-8B-Staged(largest staged-vs-merged delta)UCSC-VLAA/VLM-CapCurriculum-InternVL3.5-8B-Staged
- π€ Datasets (each ships with
pass_ratefor difficulty curricula):UCSC-VLAA/VLM-CapCurriculum-Perception-Dataβ Stage 1 (synthesised + filtered DOCCI MCQs)UCSC-VLAA/VLM-CapCurriculum-TextReasoning-Dataβ Stage 2 (ORZ-Math-13k)UCSC-VLAA/VLM-CapCurriculum-VisualReasoning-Dataβ Stage 3 (CLEVR-Math + GeoQA170K + Math PUMA + ArxivQA)
- π Project page: ucsc-vlaa.github.io/VLM-CapCurriculum β interactive headline tables, case study figures, downloadable resources.
git clone https://github.com/UCSC-VLAA/VLM-CapCurriculum.git
cd VLM-CapCurriculum
# 1) Paper-side deps (data synthesis + Claude judge)
pip install -r requirements.txt
# 2) Training framework β install separately
git clone https://github.com/hiyouga/EasyR1.git ../EasyR1
cd ../EasyR1 && pip install -e . && cd -
# 3) Evaluation framework β install separately
git clone https://github.com/open-compass/VLMEvalKit.git ../VLMEvalKit
cd ../VLMEvalKit && pip install -e . && cd -Set environment paths used by our scripts:
export EASYR1_HOME=$(realpath ../EasyR1)
export VLMEVALKIT_HOME=$(realpath ../VLMEvalKit)
export VLMCC_HOME=$(pwd)The end-to-end recipe takes ~24 GPU-hours on 8Γ H200.
# (a) Build perception data D_perc (or download from HF; see data_pipeline/README.md)
bash data_pipeline/examples/run_full_pipeline.sh
# (b) Run Stage 1 β Stage 2 β Stage 3 training in one shot
bash scripts/quickstart_qwen3vl_8b_staged.sh
# (c) Evaluate on the 8 benchmarks reported in the paper
bash evaluation/run_eval.sh <CHECKPOINT_DIR>For more granular runs (per-stage, ablations, other backbones) see training/examples/.
| Topic | Pointer |
|---|---|
| Synthesizing & filtering Stage 1 perception data | data_pipeline/README.md |
| Per-stage training scripts | training/examples/ |
| Reward functions & prompt templates | training/reward_functions/, training/format_prompts/ |
| VLMEvalKit configs & judge setup | evaluation/README.md |
| Sec 4.4 perception-error analysis | evaluation/perception_error_analysis/ |
| Ablations (stage order, encoder freezing, SFT vs RL) | training/examples/ablations/ |
| Difficulty + Capability curriculum (Sec 4.5) | training/examples/curriculum/ |
@inproceedings{vlmcapcurriculum2026,
title = {From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models},
author = {Wu, Juncheng and Chen, Hardy and Tu, Haoqin and Tang, Xianfeng and Shi, Freda and Liu, Hui and Lu, Hanqing and Xie, Cihang and Zhou, Yuyin},
booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
year = {2026},
eprint = {2605.20177},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2605.20177}
}This work builds on top of the open-source releases of EasyR1, VLMEvalKit, Qwen2.5-VL / Qwen3-VL, InternVL3, DOCCI, and PixmoCap. We thank the maintainers of these projects.