This repository contains the PyTorch implementation of the paper "The Less You Depend, The More You Learn: Synthesizing Novel Views from Sparse, Unposed Images without Any 3D Knowledge".
Create and activate conda environment:
conda create -n lvsm python=3.11
conda activate lvsm
pip install -r requirements.txtRecommended: GPU device with compute capability > 8.0. We used 8*A100 GPUs in our experiments.
Update(26/01/04): We now also provide our preprocessed version of DL3DV dataset in pixelSplat-style format on huggingface!
We use RealEstate10K dataset from pixelSplat, and followed LVSM to do the preprocessing.
Download and unzip RealEstate10K .torch chunks. For our scaling experiments, we split the dataset into 4 sizes, each containing the number of chunks listed below:
| Size | Chunks | Scenes |
|---|---|---|
| Little | 76 | 1,202 |
| Medium | 304 | 4,121 |
| Large | 1,216 | 16,449 |
| Full | 4,866 | 66,033 |
Process the dataset following LVSM:
# process training split
python process_data.py --base_path datasets/re10k --output_dir datasets/re10k-full_processed --mode train --num_processes 80
# process test split
python process_data.py --base_path datasets/re10k --output_dir datasets/re10k-full_processed --mode test --num_processes 80Download pre-trained model from Hugging Face.
Run evaluation:
# fast inference, compute metrics only
torchrun --nproc_per_node 8 --nnodes 1 --rdzv_id 18640 --rdzv_backend c10d --rdzv_endpoint localhost:29511 -m src.inference_fast --config config/eval/uplvsm_x224.yaml
# complete inference
torchrun --nproc_per_node 8 --nnodes 1 --rdzv_id 18640 --rdzv_backend c10d --rdzv_endpoint localhost:29511 -m src.inference --config config/eval/uplvsm_x224.yaml✅ Download uplvsm model with 518×518 resolution from Hugging Face, and run evaluation:
# fast inference, compute metrics only
torchrun --nproc_per_node 8 --nnodes 1 --rdzv_id 18640 --rdzv_backend c10d --rdzv_endpoint localhost:29511 -m src.inference_fast --config config/eval/uplvsm_x518.yaml
# complete inference
torchrun --nproc_per_node 8 --nnodes 1 --rdzv_id 18640 --rdzv_backend c10d --rdzv_endpoint localhost:29511 -m src.inference --config config/eval/uplvsm_x518.yaml# pretraining on 224×224 resolution
torchrun --nproc_per_node 8 --nnodes 1 --rdzv_id 18640 --rdzv_backend c10d --rdzv_endpoint localhost:29511 -m src.train --config config/uplvsm_x224.yaml
# finetuning on 518×518 resolution
torchrun --nproc_per_node 8 --nnodes 1 --rdzv_id 18640 --rdzv_backend c10d --rdzv_endpoint localhost:29511 -m src.train --config config/uplvsm_x518.yamlOur implementation builds upon LVSM. We also recommend RayZer, Pensieve and X-Factor for self-supervised scene reconstruction.
If you find this work useful for your research, please consider citing:
@misc{wang2025less3depend,
title={The Less You Depend, The More You Learn: Synthesizing Novel Views from Sparse, Unposed Images without Any 3D Knowledge},
author={Haoru Wang and Kai Ye and Yangyan Li and Wenzheng Chen and Baoquan Chen},
year={2025},
eprint={2506.09885},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.09885},
}