MonST3R processes a dynamic video to produce a time-varying dynamic point cloud, along with per-frame camera poses and intrinsics, in a predominantly feed-forward manner. This representation then enables the efficient computation of downstream tasks, such as video depth estimation and dynamic/static scene segmentation.
This repository is the official implementation of the paper:
MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion Junyi Zhang, Charles Herrmann+, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun*, Ming-Hsuan Yang* Arxiv, 2024. [Project Page] [Paper] [Interactive Results🔥]
- Release model weights on Google Drive and Hugging Face (10/07)
- Release inference code for global optimization (10/18)
- Release 4D visualization code (10/18)
- Release training code & dataset preparation (10/19)
- Release evaluation code (10/20)
- Memory efficient optimization v1 (12/25): a non-batchified version of the global optimization, slower but less memory usage
- Real-time reconstruction mode (1/20): a fully feed-forward mode for real-time reconstruction (sample results)
- Gradio Demo
- Clone MonST3R.
git clone --recursive https://github.com/junyi42/monst3r
cd monst3r
## if you have already cloned monst3r:
# git submodule update --init --recursive
- Create the environment, here we show an example using conda.
conda create -n monst3r python=3.11 cmake=3.14.0
conda activate monst3r
conda install pytorch torchvision pytorch-cuda=12.1 -c pytorch -c nvidia # use the correct version of cuda for your system
pip install -r requirements.txt
# Optional: you can also install additional packages to:
# - training
# - evaluation on camera pose
# - dataset preparation
pip install -r requirements_optional.txt
- Optional, install 4d visualization tool,
viser
.
pip install -e viser
- Optional, compile the cuda kernels for RoPE (as in CroCo v2).
# DUST3R relies on RoPE positional embeddings for which you can compile some cuda kernels for faster runtime.
cd croco/models/curope/
python setup.py build_ext --inplace
cd ../../../
We currently provide fine-tuned model weights for MonST3R, which can be downloaded on Google Drive or via Hugging Face.
To download the weights of MonST3R and optical flow models, run the following commands:
# download the weights
cd data
bash download_ckpt.sh
cd ..
To run the inference code, you can use the following command:
python demo.py # launch GUI, input can be a folder or a video
# use memory efficient optimization: --not_batchify
The results will be saved in the demo_tmp/{Sequence Name}
(by default is demo_tmp/NULL
) folder for future visualization.
You can also run the inference code in a non-interactive mode:
python demo.py --input demo_data/lady-running --output_dir demo_tmp --seq_name lady-running
# use video as input: --input demo_data/lady-running.mp4 --num_frames 65
# (update 12/15) use memory efficient optimization: --not_batchify
# (update 1/20) use real-time mode: --real_time
Currently, it takes about 33G VRAM to run the inference code on a 16:9 video of 65 frames. Use less frames or disable the
flow_loss
could reduce the memory usage. We are welcome to any PRs to improve the memory efficiency (one reasonable way is to implement window-wise optimzation inoptimizer.py
).
Update (12/15): Using the non-batchified version of the global optimization, the memory usage has been reduced. Now, it only requires ~23G VRAM to inference on a 65-frames 16:9 video. It is now slower, but we are welcome to any PRs to improve the memory-speed trade-off, e.g., using a window-wise optimization in function
forward_non_batchify()
instead of edge-wise.
Update (1/20): We have added a fully feed-forward mode for the inference code, which can run in real-time. The results are worse than the global optimization mode, and it only applies to cases where the camera motion is small. More details will be updated soon.
To visualize the interactive 4D results, you can use the following command:
python viser/visualizer_monst3r.py --data demo_tmp/lady-running
# to remove the floaters of foreground: --init_conf --fg_conf_thre 1.0 (thre can be adjusted)
# (update 1/20) for results generated by real-time mode, please (update viser and) using the following command:
python viser/visualizer_monst3r_realtime.py --data demo_tmp/lady-running
We provide here an example of joint dense reconstruction and camera pose estimation on the DAVIS dataset.
First, download the dataset:
cd data; python download_davis.py; cd ..
Then, run the evaluation script:
CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 --master_port=29604 launch.py --mode=eval_pose \
--pretrained="checkpoints/MonST3R_PO-TA-S-W_ViTLarge_BaseDecoder_512_dpt.pth" \
--eval_dataset=davis --output_dir="results/davis_joint"
# To use the ground truth dynamic mask for davis, add: --use_gt_mask
You could then use the viser
to visualize the results:
python viser/visualizer_monst3r.py --data results/davis_joint/bear
# if the dynamic mask is noisy, one could visualize per-frame pointcloud by adding: --no_mask
For the complete scripts to evaluate the camera pose / video depth / single-frame depth estimation on the Sintel, Bonn, KITTI, NYU-v2, TUM-dynamics, ScanNet, and DAVIS datasets. Please refer to the evaluation_script.md for more details.
Please refer to the prepare_training.md for preparing the pretrained models and training/testing datasets.
Then, you can train the model using the following command:
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=29604 launch.py --mode=train \
--train_dataset="10_000 @ PointOdysseyDUSt3R(dset='train', z_far=80, dataset_location='data/point_odyssey', S=2, aug_crop=16, resolution=[(512, 288), (512, 384), (512, 336)], transform=ColorJitter, strides=[1,2,3,4,5,6,7,8,9], dist_type='linear_1_2', aug_focal=0.9)+ 5_000 @ TarTanAirDUSt3R(dset='Hard', z_far=80, dataset_location='data/tartanair', S=2, aug_crop=16, resolution=[(512, 288), (512, 384), (512, 336)], transform=ColorJitter, strides=[1,2,3,4,5,6,7,8,9], dist_type='linear_1_2', aug_focal=0.9)+ 1_000 @ SpringDUSt3R(dset='train', z_far=80, dataset_location='data/spring', S=2, aug_crop=16, resolution=[(512, 288), (512, 384), (512, 336)], transform=ColorJitter, strides=[1,2,3,4,5,6,7,8,9], dist_type='linear_1_2', aug_focal=0.9)+ 4_000 @ Waymo(ROOT='data/waymo_processed', pairs_npz_name='waymo_pairs_video.npz', aug_crop=16, resolution=[(512, 288), (512, 384), (512, 336)], transform=ColorJitter, aug_focal=0.9)" \
--test_dataset="1000 @ PointOdysseyDUSt3R(dset='test', z_far=80, dataset_location='data/point_odyssey', S=2, strides=[1,2,3,4,5,6,7,8,9], resolution=[(512, 288)], seed=777)+ 1000 @ SintelDUSt3R(dset='final', z_far=80, S=2, strides=[1,2,3,4,5,6,7,8,9], resolution=[(512, 224)], seed=777)" \
--train_criterion="ConfLoss(Regr3D(L21, norm_mode='avg_dis'), alpha=0.2)" \
--test_criterion="Regr3D_ScaleShiftInv(L21, gt_scale=True)" \
--pretrained="checkpoints/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth" \
--lr=0.00005 --min_lr=1e-06 --warmup_epochs=3 --epochs=50 --batch_size=4 --accum_iter=4 \
--save_freq=3 --keep_freq=5 --eval_freq=1 \
--output_dir="results/MonST3R_PO-TA-S-W_ViTLarge_BaseDecoder_512_dpt"
If you find our work useful, please cite:
@article{zhang2024monst3r,
author = {Zhang, Junyi and Herrmann, Charles and Hur, Junhwa and Jampani, Varun and Darrell, Trevor and Cole, Forrester and Sun, Deqing and Yang, Ming-Hsuan},
title = {MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion},
journal = {arXiv preprint arxiv:2410.03825},
year = {2024}
}
Our code is based on DUSt3R and CasualSAM, our camera pose estimation evaluation script is based on LEAP-VO, and our visualization code is based on Viser. We thank the authors for their excellent work!