Skip to content

hanxunyu/DepthVLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

58 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DepthVLM Logo Β Unlocking Dense Metric Depth Estimation in VLMs

Hanxun Yu1,2*, Xuan Qu1,2*, Yuxin Wang2,3, Jianke Zhu1,4, Lei Ke2
1Zhejiang University, 2Tencent Hunyuan LLM, 3HKUST, 4Shenzhen Loop Area Institute

project_video.mp4

πŸ” Overview

model

model

DepthVLM serves as a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.

πŸ“° News

  • [2026-05-18] πŸ”₯ We release DepthVLM-Bench in Hugging Face πŸ€—.
  • [2026-05-18] πŸ”₯ We release the checkpoint of DepthVLM-4B in Hugging Face πŸ€—.
  • [2026-05-18] πŸ”₯ We release the training and inference code.
  • [2026-05-15] πŸ”₯ We release the paper of DepthVLM.

πŸ› οΈ Installation

git clone https://github.com/hanxunyu/DepthVLM.git
cd DepthVLM

conda create -n depthvlm python=3.10 -y
conda activate depthvlm
pip install -r requirements.txt
pip install flash-attn==2.6.3 --no-build-isolation

πŸ“Š Data Preparation

  • Due to licensing restrictions, we are unable to directly release the curated data. Instead, we provide the full data curation pipeline for reproducibility. Please refer to data_process.md for detailed dataset-specific preparation instructions.
  • We provide visualization examples from ScanNet++ in the examples folder.
  • We also release the curated annotations of DepthVLM-Bench on Hugging Face πŸ€—.

πŸ“¦οΈ Pretrained Models

We provide the pretrained model DepthVLM-4B in Hugging Face πŸ€—.

πŸ€– Inference Examples

Run our example inference script to generate the predicted depth maps and 3D point clouds.

# visualization examples
bash examples/run_demo.sh

Specify the annotation and dataset paths in configs/eval_datasets.conf, choose the evaluation protocol with EVAL_MODE="sparse" for sparse-point evaluation or EVAL_MODE="dense" for full-depth-map evaluation, and then run the script on DepthVLM-Bench.

bash eval/eval.sh

πŸš€ Two-Stage Training

Specify the annotation and dataset paths in configs/train_datasets.conf, then run the following training scripts.

Stage1: depth head-only training

# stage-1 
bash train/train-stage1.sh

Stage2: end-to-end fine-tuning

bash train/train-stage2.sh

DepthVLM-4B is trained for two days on 80 NVIDIA H20 GPUs (96GB).

πŸ”¬ Experiment Results

Comparison with VLMs (Sparse Points)

model

Comparison with Pure Vision Models (Sparse Points)

model

Comparison with Pure Vision Models (Full Depth Map)

model

Visualization Comparison

model

example 1

example 2

example 3

πŸ‘ Acknowledgements

We are grateful for the open-source contributions of other projects:

πŸ“‘ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

πŸ–ŠοΈ Citation

@article{yu2026unlocking,
  title={Unlocking Dense Metric Depth Estimation in VLMs},
  author={Hanxun Yu and Xuan Qu and Yuxin Wang and Jianke Zhu and Lei Ke},
  journal={arXiv preprint arXiv:2605.15876},
  year={2026}
}

About

πŸ”₯ Official code repository for "Unlocking Dense Metric Depth Estimation in VLMs"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors