Early Access Paper (IEEE GRSL)
Abstract—Super-resolution (SR) of remote sensing imagery based on generative AI models is vital for practical applications such as urban planning and disaster assessment. However, current approaches suffer from poor performance trade-offs among the pivotal, yet competing, objectives: perceptual quality, factual accuracy, and inference speed. To break through this limitation, we propose a novel and high-performing two-stage SR framework for the remote sensing imagery based on a generative diffusion model. First, in Stage 1, factually grounded base images are generated by employing a guidance-free diffusion process relying solely on the original low-resolution images, such that the risk of semantic hallucination can be effectively mitigated. The generated images are refined subsequently in Stage 2 such that high-frequency details for SR quality can be restored via our customized and innovative guidance mechanism with a vision–language model (VLM) and a ControlNet, and a dynamic inference acceleration technique is applied to ensure efficiency. Extensive experimental results confirm that our proposed frame-work excels in perceptual quality—achieving top CLIP-IQA scores—and in structural integrity while achieving robust per-formance. In particular, it enables reliable, high-fidelity SR for large-scale, real-world remote sensing pipelines by surpassing the conventional fidelity–hallucination trade-off at practical inference speed
| Model | SMS ↓ | CLIP-IQA ↑ | ||||
|---|---|---|---|---|---|---|
| RSC11 | RSSCN7 | WHU-RS19 | RSC11 | RSSCN7 | WHU-RS19 | |
| ESRGAN | 0.2788 | 0.2799 | 0.2822 | 0.3439 | 0.3145 | 0.3486 |
| LWTDN | 0.2819 | 0.2749 | 0.2663 | 0.5313 | 0.4745 | 0.5420 |
| SRDiff | 0.2997 | 0.3051 | 0.2987 | 0.3270 | 0.2995 | 0.3738 |
| SRDDPM | 0.2438 | 0.2352 | 0.2325 | 0.5551 | 0.4971 | 0.5618 |
| SR3 | 0.2428 | 0.2317 | 0.2321 | 0.6035 | 0.5472 | 0.6068 |
| SUPIR | 0.2920 | 0.2714 | 0.2973 | 0.6360 | 0.6200 | 0.6078 |
| Ours | 0.2291 | 0.2337 | 0.2400 | 0.7842 | 0.7778 | 0.7497 |
conda create -n myenv python=3.10
conda activate myenv
# GPU stack ─ PyTorch 2.5.1 + CUDA 12.4
# Check your CUDA version, then install
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia
pip install -r requirements.txt
pip install flash-attn --no-build-isolationRemote-Sensing-Vision-Language-Diffusion-Model/
├── CKPT_PTH/
│ ├── Llava-next/
│ ├── v0F.ckpt
│ ├── v0Q.ckpt
│ └── ...
├── README.md
├── infer.py
├── infer_dir.py
└── ...
Download pretrained_model.zip (Google Drive)
To run inference on a single image:
python infer.py --input_img path/to/your/image.png --output path/to/save/results --upscale 8| Argument | Description |
|---|---|
--input_img |
path to a low-resolution input image (e.g., ./data/lr/image1.png) |
--output path |
directory where the super-resolved image will be saved (e.g., ./results) |
--upscale |
Upscaling factor (e.g., 2, 4, 8) |
python infer_dir.py --image_dir path/to/input/image_folder --save_dir path/to/save/results --upscale 8| Argument | Description |
|---|---|
--image_dir |
directory containing low-resolution input images (e.g.,/data/lr) |
--save_dir |
directory where the super-resolved image will be saved (e.g., ./results) |
--upscale |
Upscaling factor (e.g., 2, 4, 8) |


