A Novel VLM-Guided Diffusion Model for Remote Sensing Image Super-Resolution

Abstract

Abstract—Super-resolution (SR) of remote sensing imagery based on generative AI models is vital for practical applications such as urban planning and disaster assessment. However, current approaches suffer from poor performance trade-offs among the pivotal, yet competing, objectives: perceptual quality, factual accuracy, and inference speed. To break through this limitation, we propose a novel and high-performing two-stage SR framework for the remote sensing imagery based on a generative diffusion model. First, in Stage 1, factually grounded base images are generated by employing a guidance-free diffusion process relying solely on the original low-resolution images, such that the risk of semantic hallucination can be effectively mitigated. The generated images are refined subsequently in Stage 2 such that high-frequency details for SR quality can be restored via our customized and innovative guidance mechanism with a vision–language model (VLM) and a ControlNet, and a dynamic inference acceleration technique is applied to ensure efficiency. Extensive experimental results confirm that our proposed frame-work excels in perceptual quality—achieving top CLIP-IQA scores—and in structural integrity while achieving robust per-formance. In particular, it enables reliable, high-fidelity SR for large-scale, real-world remote sensing pipelines by surpassing the conventional fidelity–hallucination trade-off at practical inference speed

Architecture

Results

Visual Results: Ours vs Competing Methods

Visual Results at Diffusion Steps

Quantitative Comparison of Our SR Method with Existing Models

Model	SMS ↓			CLIP-IQA ↑
Model	RSC11	RSSCN7	WHU-RS19	RSC11	RSSCN7	WHU-RS19
ESRGAN	0.2788	0.2799	0.2822	0.3439	0.3145	0.3486
LWTDN	0.2819	0.2749	0.2663	0.5313	0.4745	0.5420
SRDiff	0.2997	0.3051	0.2987	0.3270	0.2995	0.3738
SRDDPM	0.2438	0.2352	0.2325	0.5551	0.4971	0.5618
SR3	0.2428	0.2317	0.2321	0.6035	0.5472	0.6068
SUPIR	0.2920	0.2714	0.2973	0.6360	0.6200	0.6078
Ours	0.2291	0.2337	0.2400	0.7842	0.7778	0.7497

Installation

conda create -n myenv python=3.10
conda activate myenv
# GPU stack ─ PyTorch 2.5.1 + CUDA 12.4
# Check your CUDA version, then install
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Download and Setup

Remote-Sensing-Vision-Language-Diffusion-Model/
├── CKPT_PTH/
│   ├── Llava-next/
│   ├── v0F.ckpt
│   ├── v0Q.ckpt
│   └── ...
├── README.md
├── infer.py
├── infer_dir.py
└── ...

Download pretrained_model.zip (Google Drive)

Inference

Single Image Inference

To run inference on a single image:

python infer.py --input_img path/to/your/image.png --output path/to/save/results --upscale 8

Argument	Description
`--input_img`	path to a low-resolution input image (e.g., ./data/lr/image1.png)
`--output path`	directory where the super-resolved image will be saved (e.g., ./results)
`--upscale`	Upscaling factor (e.g., `2`, `4`, `8`)

Inference on Image Folder

python infer_dir.py --image_dir path/to/input/image_folder --save_dir path/to/save/results --upscale 8

Argument	Description
`--image_dir`	directory containing low-resolution input images (e.g.,/data/lr)
`--save_dir`	directory where the super-resolved image will be saved (e.g., ./results)
`--upscale`	Upscaling factor (e.g., `2`, `4`, `8`)

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
__pycache__		__pycache__
assets		assets
configs		configs
data		data
llava		llava
model_configs		model_configs
models		models
prompts		prompts
sgm		sgm
utils		utils
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CKPT_PTH.py		CKPT_PTH.py
README.md		README.md
infer.py		infer.py
infer_dir.py		infer_dir.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Novel VLM-Guided Diffusion Model for Remote Sensing Image Super-Resolution

Abstract

Architecture

Results

Visual Results: Ours vs Competing Methods

Visual Results at Diffusion Steps

Quantitative Comparison of Our SR Method with Existing Models

Installation

Download and Setup

Inference

Single Image Inference

Inference on Image Folder

About

Uh oh!

Releases

Packages

Languages

Bluear7878/Remote-Sensing-Vision-Language-Diffusion-Model

Folders and files

Latest commit

History

Repository files navigation

A Novel VLM-Guided Diffusion Model for Remote Sensing Image Super-Resolution

Abstract

Architecture

Results

Visual Results: Ours vs Competing Methods

Visual Results at Diffusion Steps

Quantitative Comparison of Our SR Method with Existing Models

Installation

Download and Setup

Inference

Single Image Inference

Inference on Image Folder

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages