Unlocking Dense Metric Depth Estimation in VLMs

Hanxun Yu^1,2*, Xuan Qu^1,2*, Yuxin Wang^2,3, Jianke Zhu^1,4, Lei Ke²
¹Zhejiang University, ²Tencent Hunyuan LLM, ³HKUST, ⁴Shenzhen Loop Area Institute

project_video.mp4

🔍 Overview

DepthVLM serves as a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.

📰 News

[2026-05-18] 🔥 We release DepthVLM-Bench in Hugging Face 🤗.
[2026-05-18] 🔥 We release the checkpoint of DepthVLM-4B in Hugging Face 🤗.
[2026-05-18] 🔥 We release the training and inference code.
[2026-05-15] 🔥 We release the paper of DepthVLM.

🛠️ Installation

git clone https://github.com/hanxunyu/DepthVLM.git
cd DepthVLM

conda create -n depthvlm python=3.10 -y
conda activate depthvlm
pip install -r requirements.txt
pip install flash-attn==2.6.3 --no-build-isolation

📊 Data Preparation

Due to licensing restrictions, we are unable to directly release the curated data. Instead, we provide the full data curation pipeline for reproducibility. Please refer to data_process.md for detailed dataset-specific preparation instructions.
We provide visualization examples from ScanNet++ in the examples folder.
We also release the curated annotations of DepthVLM-Bench on Hugging Face 🤗.

📦️ Pretrained Models

We provide the pretrained model DepthVLM-4B in Hugging Face 🤗.

🤖 Inference Examples

Run our example inference script to generate the predicted depth maps and 3D point clouds.

# visualization examples
bash examples/run_demo.sh

Specify the annotation and dataset paths in configs/eval_datasets.conf, choose the evaluation protocol with EVAL_MODE="sparse" for sparse-point evaluation or EVAL_MODE="dense" for full-depth-map evaluation, and then run the script on DepthVLM-Bench.

bash eval/eval.sh

🚀 Two-Stage Training

Specify the annotation and dataset paths in configs/train_datasets.conf, then run the following training scripts.

Stage1: depth head-only training

# stage-1 
bash train/train-stage1.sh

Stage2: end-to-end fine-tuning

bash train/train-stage2.sh

DepthVLM-4B is trained for two days on 80 NVIDIA H20 GPUs (96GB).

🔬 Experiment Results

Comparison with VLMs (Sparse Points)

Comparison with Pure Vision Models (Sparse Points)

Comparison with Pure Vision Models (Full Depth Map)

Visualization Comparison

👏 Acknowledgements

We are grateful for the open-source contributions of other projects:

📑 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🖊️ Citation

@article{yu2026unlocking,
  title={Unlocking Dense Metric Depth Estimation in VLMs},
  author={Hanxun Yu and Xuan Qu and Yuxin Wang and Jianke Zhu and Lei Ke},
  journal={arXiv preprint arXiv:2605.15876},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unlocking Dense Metric Depth Estimation in VLMs

🔍 Overview

📰 News

🛠️ Installation

📊 Data Preparation

📦️ Pretrained Models

🤖 Inference Examples

🚀 Two-Stage Training

🔬 Experiment Results

Comparison with VLMs (Sparse Points)

Comparison with Pure Vision Models (Sparse Points)

Comparison with Pure Vision Models (Full Depth Map)

Visualization Comparison

👏 Acknowledgements

📑 License

🖊️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
assets		assets
configs		configs
data_process		data_process
eval		eval
examples		examples
model		model
train		train
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Unlocking Dense Metric Depth Estimation in VLMs

🔍 Overview

📰 News

🛠️ Installation

📊 Data Preparation

📦️ Pretrained Models

🤖 Inference Examples

🚀 Two-Stage Training

🔬 Experiment Results

Comparison with VLMs (Sparse Points)

Comparison with Pure Vision Models (Sparse Points)

Comparison with Pure Vision Models (Full Depth Map)

Visualization Comparison

👏 Acknowledgements

📑 License

🖊️ Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages