Bingyi Kang1, Hu Xu3, Saining Xie1
- 🔥 Cambrian-P is out — paper, model checkpoints, the annotated pose Cambrian-P-Data, and full training/eval code are all released.
Cambrian-P is a pose-grounded video MLLM. Built on top of the Cambrian-S architecture (SigLIP2-SO400m vision encoder + Qwen2.5 LLM + MLP projector), it introduces one learnable camera token per frame (via two learnable query embeddings — one for the first frame, one for the rest) and a lightweight pose head adapted from VGGT. A single forward pass answers spatial video questions and regresses per-frame camera translation, rotation, and field-of-view — enabling both improved spatial video QA and streaming camera pose estimation.
Spatial video understanding on VSI-Bench. Cambrian-P-7B (Qwen2.5-7B + SigLIP2-SO400m) achieves 73.7 average accuracy, the best among 7B-scale spatial-specialist models, with a +4.5% gain over Cambrian-S-7B (its no-pose counterpart) and particularly strong results on Relative Direction, Object Count, Route Plan, and Appearance Order.
Streaming camera pose estimation on ScanNet, TUM-dynamic, and Sintel, following the MonST3R protocol. For ScanNet and TUM-dynamic we sample the first 90 frames at temporal stride 3; for Sintel we exclude static / near-straight sequences. All metrics use Sim(3) alignment.
Cambrian-P achieves the lowest ATE on ScanNet among all streaming models, competitive with offline pipelines — without a DINOv2 encoder or a bidirectional transformer.
Metric definitions follow evo: ATE = absolute trajectory error RMSE (meters), RPE-t / RPE-r = per-frame relative pose error RMSE (meters / degrees).
We release five Cambrian-P-7B variants. All share the same backbone (Qwen2.5-7B + SigLIP2-SO400m + per-frame camera tokens + VGGT-style pose head) and finetune from Cambrian-S-7B stage 3.
| Model | Training Data | Hugging Face |
|---|---|---|
| Cambrian-P-7B | VSI | nyu-visionx/Cambrian-P-7B |
| Cambrian-P-7B-32f | VSI | nyu-visionx/Cambrian-P-7B-32f |
| Cambrian-P-7B-Mix-MA | VSI + MapAnything | nyu-visionx/Cambrian-P-7B-Mix-MA |
| Cambrian-P-7B-Mix-3R | VSI + partial VLM-3R | nyu-visionx/Cambrian-P-7B-Mix-3R |
| Cambrian-P-7B-Mix-CamS | VSI + Cambrian-S | nyu-visionx/Cambrian-P-7B-Mix-CamS |
git clone https://github.com/cambrian-mllm/cambrian-p.git
conda create -n cambrianp python=3.11 cmake=3.14.0
conda activate cambrianp
cd cambrian-p/vggt && pip install -e .
pip install hydra-core tensorboard iopath wcmatch fvcore
cd .. && pip install --upgrade pip && pip install -e ".[train]"
# PyTorch 2.4.1 + CUDA 12.1 + flash-attn 2.8.3
pip install torch==2.4.1+cu121 torchvision==0.19.1+cu121 torchaudio==2.4.1+cu121 \
--index-url https://download.pytorch.org/whl/cu121
pip install --no-deps https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.4cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
pip install accelerate==0.29.3 easydict matplotlib roma evo imageio OpenEXRSee doc/env_install.md for the detailed version.
Cambrian-P fine-tunes from Cambrian-S-7B stage 3 on three required pieces:
| Piece | Source | Size | Used for |
|---|---|---|---|
| 1. VSI-590K (VQA + scene geometry) | nyu-visionx/vsi-590k |
~236 GB | Spatial QA + scene pose supervision |
| 2. Cambrian-S 3M videos | nyu-visionx/Cambrian-S-3M |
per-source | Video backbone for the pose-annotated half of training |
| 3. Cambrian-P pseudo pose annotations | nyu-visionx/Cambrian-P-Data |
~850 MiB | Dense pose supervision on the partial Cambrian-S-3M |
See doc/data_preparation.md for the full recipe.
Please check cambrianp/scripts/
Training Script Sample test: (run §Data Preparation Quickstart first)
conda activate cambrianp
export WANDB_API_KEY=<your-key>
export DATA_DIR=/path/to/vsi-590k
export VIPE_CAMBRIANS_DATA_ROOT=/path/to/cambrian_s_3m
export VIPE_CAMBRIANS_RESULTS_ROOT=/path/to/cambrian_p_pose
export OUTPUT_DIR=$PWD/ckpts
bash cambrianp/scripts/Cambrian-P-7B.shPlease refer to doc/evaluation.md.
If you find our work useful for your research, please consider citing:
@article{yang2026cambrianp,
title = {Cambrian-P: Pose-Grounded Video Understanding},
author = {Yang, Jihan and Zhao, Zifan and Pan, Xichen and Yang, Shusheng and Zhang, Junyi and Kang, Bingyi and Xu, Hu and Xie, Saining},
journal = {arXiv preprint arXiv:2605.22819},
year = {2026},
}See LICENSE.
- Cambrian-1 — Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
- Thinking in Space — Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces (introduces VSI-Bench)
- Cambrian-S — spatial supersensing in video (shared training-data recipe)
- VGGT — reconstruction backbone
- DUSt3R / MonST3R — evaluation protocol
- CUT3R, StreamVGGT, MapAnything — evaluation baselines


