Skip to content

Bluear7878/Remote-Sensing-Vision-Language-Diffusion-Model

Repository files navigation

A Novel VLM-Guided Diffusion Model for Remote Sensing Image Super-Resolution

Paper

Abstract

Early Access Paper (IEEE GRSL)

Abstract—Super-resolution (SR) of remote sensing imagery based on generative AI models is vital for practical applications such as urban planning and disaster assessment. However, current approaches suffer from poor performance trade-offs among the pivotal, yet competing, objectives: perceptual quality, factual accuracy, and inference speed. To break through this limitation, we propose a novel and high-performing two-stage SR framework for the remote sensing imagery based on a generative diffusion model. First, in Stage 1, factually grounded base images are generated by employing a guidance-free diffusion process relying solely on the original low-resolution images, such that the risk of semantic hallucination can be effectively mitigated. The generated images are refined subsequently in Stage 2 such that high-frequency details for SR quality can be restored via our customized and innovative guidance mechanism with a vision–language model (VLM) and a ControlNet, and a dynamic inference acceleration technique is applied to ensure efficiency. Extensive experimental results confirm that our proposed frame-work excels in perceptual quality—achieving top CLIP-IQA scores—and in structural integrity while achieving robust per-formance. In particular, it enables reliable, high-fidelity SR for large-scale, real-world remote sensing pipelines by surpassing the conventional fidelity–hallucination trade-off at practical inference speed

Architecture

architecture

Results

Visual Results: Ours vs Competing Methods

architecture

Visual Results at Diffusion Steps

architecture

Quantitative Comparison of Our SR Method with Existing Models

Model SMS ↓ CLIP-IQA ↑
RSC11 RSSCN7 WHU-RS19 RSC11 RSSCN7 WHU-RS19
ESRGAN 0.27880.27990.2822 0.34390.31450.3486
LWTDN 0.28190.27490.2663 0.53130.47450.5420
SRDiff 0.29970.30510.2987 0.32700.29950.3738
SRDDPM 0.24380.23520.2325 0.55510.49710.5618
SR3 0.24280.23170.2321 0.60350.54720.6068
SUPIR 0.29200.27140.2973 0.63600.62000.6078
Ours 0.22910.23370.2400 0.78420.77780.7497

Installation

conda create -n myenv python=3.10
conda activate myenv
# GPU stack ─ PyTorch 2.5.1 + CUDA 12.4
# Check your CUDA version, then install
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Download and Setup

Remote-Sensing-Vision-Language-Diffusion-Model/
├── CKPT_PTH/
│   ├── Llava-next/
│   ├── v0F.ckpt
│   ├── v0Q.ckpt
│   └── ...
├── README.md
├── infer.py
├── infer_dir.py
└── ...

Download pretrained_model.zip (Google Drive)

Inference

Single Image Inference

To run inference on a single image:

python infer.py --input_img path/to/your/image.png --output path/to/save/results --upscale 8
Argument Description
--input_img path to a low-resolution input image (e.g., ./data/lr/image1.png)
--output path directory where the super-resolved image will be saved (e.g., ./results)
--upscale Upscaling factor (e.g., 2, 4, 8)

Inference on Image Folder

python infer_dir.py --image_dir path/to/input/image_folder --save_dir path/to/save/results --upscale 8
Argument Description
--image_dir directory containing low-resolution input images (e.g.,/data/lr)
--save_dir directory where the super-resolved image will be saved (e.g., ./results)
--upscale Upscaling factor (e.g., 2, 4, 8)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published