Cambrian-P:
Pose-Grounded Video Understanding

Jihan Yang^1*, Zifan Zhao^1*, Xichen Pan¹, Shusheng Yang¹, Junyi Zhang²,
Bingyi Kang¹, Hu Xu³, Saining Xie¹

¹New York University ²UC Berkeley ³Meta FAIR

*Equal Contribution — JY led the project, JY and ZZ contributed equally.

Release

🔥 Cambrian-P is out — paper, model checkpoints, the annotated pose Cambrian-P-Data, and full training/eval code are all released.

Cambrian-P Weights

Cambrian-P is a pose-grounded video MLLM. Built on top of the Cambrian-S architecture (SigLIP2-SO400m vision encoder + Qwen2.5 LLM + MLP projector), it introduces one learnable camera token per frame (via two learnable query embeddings — one for the first frame, one for the rest) and a lightweight pose head adapted from VGGT. A single forward pass answers spatial video questions and regresses per-frame camera translation, rotation, and field-of-view — enabling both improved spatial video QA and streaming camera pose estimation.

VSI-Bench Performance

Spatial video understanding on VSI-Bench. Cambrian-P-7B (Qwen2.5-7B + SigLIP2-SO400m) achieves 73.7 average accuracy, the best among 7B-scale spatial-specialist models, with a +4.5% gain over Cambrian-S-7B (its no-pose counterpart) and particularly strong results on Relative Direction, Object Count, Route Plan, and Appearance Order.

Pose Estimation Performance

Streaming camera pose estimation on ScanNet, TUM-dynamic, and Sintel, following the MonST3R protocol. For ScanNet and TUM-dynamic we sample the first 90 frames at temporal stride 3; for Sintel we exclude static / near-straight sequences. All metrics use Sim(3) alignment.

Cambrian-P achieves the lowest ATE on ScanNet among all streaming models, competitive with offline pipelines — without a DINOv2 encoder or a bidirectional transformer.

Metric definitions follow evo: ATE = absolute trajectory error RMSE (meters), RPE-t / RPE-r = per-frame relative pose error RMSE (meters / degrees).

Model Card

We release five Cambrian-P-7B variants. All share the same backbone (Qwen2.5-7B + SigLIP2-SO400m + per-frame camera tokens + VGGT-style pose head) and finetune from Cambrian-S-7B stage 3.

Model	Training Data	Hugging Face
Cambrian-P-7B	VSI	nyu-visionx/Cambrian-P-7B
Cambrian-P-7B-32f	VSI	nyu-visionx/Cambrian-P-7B-32f
Cambrian-P-7B-Mix-MA	VSI + MapAnything	nyu-visionx/Cambrian-P-7B-Mix-MA
Cambrian-P-7B-Mix-3R	VSI + partial VLM-3R	nyu-visionx/Cambrian-P-7B-Mix-3R
Cambrian-P-7B-Mix-CamS	VSI + Cambrian-S	nyu-visionx/Cambrian-P-7B-Mix-CamS

Train

Environment Preparation

git clone https://github.com/cambrian-mllm/cambrian-p.git
conda create -n cambrianp python=3.11 cmake=3.14.0
conda activate cambrianp

cd cambrian-p/vggt && pip install -e .
pip install hydra-core tensorboard iopath wcmatch fvcore

cd .. && pip install --upgrade pip && pip install -e ".[train]"

# PyTorch 2.4.1 + CUDA 12.1 + flash-attn 2.8.3
pip install torch==2.4.1+cu121 torchvision==0.19.1+cu121 torchaudio==2.4.1+cu121 \
    --index-url https://download.pytorch.org/whl/cu121
pip install --no-deps https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.4cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
pip install accelerate==0.29.3 easydict matplotlib roma evo imageio OpenEXR

See doc/env_install.md for the detailed version.

Data Preparation

Cambrian-P fine-tunes from Cambrian-S-7B stage 3 on three required pieces:

Piece	Source	Size	Used for
1. VSI-590K (VQA + scene geometry)	`nyu-visionx/vsi-590k`	~236 GB	Spatial QA + scene pose supervision
2. Cambrian-S 3M videos	`nyu-visionx/Cambrian-S-3M`	per-source	Video backbone for the pose-annotated half of training
3. Cambrian-P pseudo pose annotations	`nyu-visionx/Cambrian-P-Data`	~850 MiB	Dense pose supervision on the partial Cambrian-S-3M

See doc/data_preparation.md for the full recipe.

Training Scripts

Please check cambrianp/scripts/ Training Script Sample test: (run §Data Preparation Quickstart first)

conda activate cambrianp
export WANDB_API_KEY=<your-key>         
export DATA_DIR=/path/to/vsi-590k                        
export VIPE_CAMBRIANS_DATA_ROOT=/path/to/cambrian_s_3m  
export VIPE_CAMBRIANS_RESULTS_ROOT=/path/to/cambrian_p_pose  
export OUTPUT_DIR=$PWD/ckpts                            

bash cambrianp/scripts/Cambrian-P-7B.sh

Evaluation

Please refer to doc/evaluation.md.

Citation

If you find our work useful for your research, please consider citing:

@article{yang2026cambrianp,
  title   = {Cambrian-P: Pose-Grounded Video Understanding},
  author  = {Yang, Jihan and Zhao, Zifan and Pan, Xichen and Yang, Shusheng and Zhang, Junyi and Kang, Bingyi and Xu, Hu and Xie, Saining},
  journal = {arXiv preprint arXiv:2605.22819},
  year    = {2026},
}

License

See LICENSE.

Related Projects

Cambrian-1 — Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Thinking in Space — Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces (introduces VSI-Bench)
Cambrian-S — spatial supersensing in video (shared training-data recipe)
VGGT — reconstruction backbone
DUSt3R / MonST3R — evaluation protocol
CUT3R, StreamVGGT, MapAnything — evaluation baselines

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
cambrianp		cambrianp
doc		doc
figs		figs
lmms-eval		lmms-eval
scripts		scripts
vggt		vggt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cambrian-P:
Pose-Grounded Video Understanding

Release

Contents

Cambrian-P Weights

VSI-Bench Performance

Pose Estimation Performance

Model Card

Train

Environment Preparation

Data Preparation

Training Scripts

Evaluation

Citation

License

Related Projects

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cambrian-P: Pose-Grounded Video Understanding

Release

Contents

Cambrian-P Weights

VSI-Bench Performance

Pose Estimation Performance

Model Card

Train

Environment Preparation

Data Preparation

Training Scripts

Evaluation

Citation

License

Related Projects

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Cambrian-P:
Pose-Grounded Video Understanding

Packages