Skip to content

cambrian-mllm/cambrian-p

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cambrian-P:
Pose-Grounded Video Understanding

Cambrian-P

arXiv Website HF Models: Cambrian-P HF Data: Cambrian-P-Data
1New York University   2UC Berkeley   3Meta FAIR
*Equal Contribution — JY led the project, JY and ZZ contributed equally.

Release

Contents

Cambrian-P Weights

Cambrian-P is a pose-grounded video MLLM. Built on top of the Cambrian-S architecture (SigLIP2-SO400m vision encoder + Qwen2.5 LLM + MLP projector), it introduces one learnable camera token per frame (via two learnable query embeddings — one for the first frame, one for the rest) and a lightweight pose head adapted from VGGT. A single forward pass answers spatial video questions and regresses per-frame camera translation, rotation, and field-of-view — enabling both improved spatial video QA and streaming camera pose estimation.

VSI-Bench Performance

Spatial video understanding on VSI-Bench. Cambrian-P-7B (Qwen2.5-7B + SigLIP2-SO400m) achieves 73.7 average accuracy, the best among 7B-scale spatial-specialist models, with a +4.5% gain over Cambrian-S-7B (its no-pose counterpart) and particularly strong results on Relative Direction, Object Count, Route Plan, and Appearance Order.

VSI-Bench Performance

Pose Estimation Performance

Streaming camera pose estimation on ScanNet, TUM-dynamic, and Sintel, following the MonST3R protocol. For ScanNet and TUM-dynamic we sample the first 90 frames at temporal stride 3; for Sintel we exclude static / near-straight sequences. All metrics use Sim(3) alignment.

Cambrian-P achieves the lowest ATE on ScanNet among all streaming models, competitive with offline pipelines — without a DINOv2 encoder or a bidirectional transformer.

Pose Estimation Results

Metric definitions follow evo: ATE = absolute trajectory error RMSE (meters), RPE-t / RPE-r = per-frame relative pose error RMSE (meters / degrees).

Model Card

We release five Cambrian-P-7B variants. All share the same backbone (Qwen2.5-7B + SigLIP2-SO400m + per-frame camera tokens + VGGT-style pose head) and finetune from Cambrian-S-7B stage 3.

Model Training Data Hugging Face
Cambrian-P-7B VSI nyu-visionx/Cambrian-P-7B
Cambrian-P-7B-32f VSI nyu-visionx/Cambrian-P-7B-32f
Cambrian-P-7B-Mix-MA VSI + MapAnything nyu-visionx/Cambrian-P-7B-Mix-MA
Cambrian-P-7B-Mix-3R VSI + partial VLM-3R nyu-visionx/Cambrian-P-7B-Mix-3R
Cambrian-P-7B-Mix-CamS VSI + Cambrian-S nyu-visionx/Cambrian-P-7B-Mix-CamS

Train

Environment Preparation

git clone https://github.com/cambrian-mllm/cambrian-p.git
conda create -n cambrianp python=3.11 cmake=3.14.0
conda activate cambrianp

cd cambrian-p/vggt && pip install -e .
pip install hydra-core tensorboard iopath wcmatch fvcore

cd .. && pip install --upgrade pip && pip install -e ".[train]"

# PyTorch 2.4.1 + CUDA 12.1 + flash-attn 2.8.3
pip install torch==2.4.1+cu121 torchvision==0.19.1+cu121 torchaudio==2.4.1+cu121 \
    --index-url https://download.pytorch.org/whl/cu121
pip install --no-deps https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.4cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
pip install accelerate==0.29.3 easydict matplotlib roma evo imageio OpenEXR

See doc/env_install.md for the detailed version.

Data Preparation

Cambrian-P fine-tunes from Cambrian-S-7B stage 3 on three required pieces:

Piece Source Size Used for
1. VSI-590K (VQA + scene geometry) nyu-visionx/vsi-590k ~236 GB Spatial QA + scene pose supervision
2. Cambrian-S 3M videos nyu-visionx/Cambrian-S-3M per-source Video backbone for the pose-annotated half of training
3. Cambrian-P pseudo pose annotations nyu-visionx/Cambrian-P-Data ~850 MiB Dense pose supervision on the partial Cambrian-S-3M

See doc/data_preparation.md for the full recipe.

Training Scripts

Please check cambrianp/scripts/ Training Script Sample test: (run §Data Preparation Quickstart first)

conda activate cambrianp
export WANDB_API_KEY=<your-key>         
export DATA_DIR=/path/to/vsi-590k                        
export VIPE_CAMBRIANS_DATA_ROOT=/path/to/cambrian_s_3m  
export VIPE_CAMBRIANS_RESULTS_ROOT=/path/to/cambrian_p_pose  
export OUTPUT_DIR=$PWD/ckpts                            

bash cambrianp/scripts/Cambrian-P-7B.sh

Evaluation

Please refer to doc/evaluation.md.

Citation

If you find our work useful for your research, please consider citing:

@article{yang2026cambrianp,
  title   = {Cambrian-P: Pose-Grounded Video Understanding},
  author  = {Yang, Jihan and Zhao, Zifan and Pan, Xichen and Yang, Shusheng and Zhang, Junyi and Kang, Bingyi and Xu, Hu and Xie, Saining},
  journal = {arXiv preprint arXiv:2605.22819},
  year    = {2026},
}

License

See LICENSE.

Related Projects

  • Cambrian-1Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
  • Thinking in SpaceThinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces (introduces VSI-Bench)
  • Cambrian-S — spatial supersensing in video (shared training-data recipe)
  • VGGT — reconstruction backbone
  • DUSt3R / MonST3R — evaluation protocol
  • CUT3R, StreamVGGT, MapAnything — evaluation baselines

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors