Hanxun Yu1,2*,
Xuan Qu1,2*,
Yuxin Wang2,3,
Jianke Zhu1,4,
Lei Ke2
1Zhejiang University,
2Tencent Hunyuan LLM,
3HKUST,
4Shenzhen Loop Area Institute
project_video.mp4
DepthVLM serves as a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.
- [2026-05-18] π₯ We release DepthVLM-Bench in Hugging Face π€.
- [2026-05-18] π₯ We release the checkpoint of DepthVLM-4B in Hugging Face π€.
- [2026-05-18] π₯ We release the training and inference code.
- [2026-05-15] π₯ We release the paper of DepthVLM.
git clone https://github.com/hanxunyu/DepthVLM.git
cd DepthVLM
conda create -n depthvlm python=3.10 -y
conda activate depthvlm
pip install -r requirements.txt
pip install flash-attn==2.6.3 --no-build-isolation
- Due to licensing restrictions, we are unable to directly release the curated data. Instead, we provide the full data curation pipeline for reproducibility. Please refer to data_process.md for detailed dataset-specific preparation instructions.
- We provide visualization examples from ScanNet++ in the examples folder.
- We also release the curated annotations of DepthVLM-Bench on Hugging Face π€.
We provide the pretrained model DepthVLM-4B in Hugging Face π€.
Run our example inference script to generate the predicted depth maps and 3D point clouds.
# visualization examples
bash examples/run_demo.sh
Specify the annotation and dataset paths in configs/eval_datasets.conf, choose the evaluation protocol with EVAL_MODE="sparse" for sparse-point evaluation or EVAL_MODE="dense" for full-depth-map evaluation, and then run the script on DepthVLM-Bench.
bash eval/eval.sh
Specify the annotation and dataset paths in configs/train_datasets.conf, then run the following training scripts.
Stage1: depth head-only training
# stage-1
bash train/train-stage1.sh
Stage2: end-to-end fine-tuning
bash train/train-stage2.sh
DepthVLM-4B is trained for two days on 80 NVIDIA H20 GPUs (96GB).
We are grateful for the open-source contributions of other projects:
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
@article{yu2026unlocking,
title={Unlocking Dense Metric Depth Estimation in VLMs},
author={Hanxun Yu and Xuan Qu and Yuxin Wang and Jianke Zhu and Lei Ke},
journal={arXiv preprint arXiv:2605.15876},
year={2026}
}







