✴️ ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
TL;DR
ScreenSpot-Pro is a new benchmark designed to evaluate GUI grounding models in professional, high-resolution environments. It spans 23 applications across 5 professional categories and 3 operating systems, highlighting the challenges models face when interacting with complex software. Existing models achieve low accuracy (best at 18.9%), underscoring the need for further research.
Artifacts 📦
Introduction 🌟
Graphical User Interfaces (GUIs) are integral to modern digital workflows. While Multi-modal Large Language Models (MLLMs) have advanced GUI agents (e.g., Aria-UI and UGround) for general tasks like web browsing and mobile applications, professional environments introduce unique complexities. High-resolution screens, intricate interfaces, and smaller target elements make GUI grounding in professional settings significantly more challenging.
We present ScreenSpot-Pro—a benchmark designed to evaluate GUI grounding models specifically for high-resolution, professional computer-use environments.
Why ScreenSpot-Pro?
- High-Resolution Focus – Professional software often runs at resolutions like 3840x2160, demanding precise detection of small UI components.
- Diverse Application Coverage – Encompasses 23 applications across 5 industries and 3 operating systems, spanning from development tools like VSCode and PyCharm to creative suites such as Photoshop and Blender.
- Expert-Led Annotation – Tasks are curated and annotated by users with over five years of professional experience, ensuring accuracy and real-world relevance.
Key Challenges:
- Complex Interfaces – High-resolution displays lead to smaller, densely packed UI elements, complicating detection and interaction.
- Performance Gaps – Existing models achieve only 18.9% accuracy in professional GUI grounding, highlighting substantial room for improvement.
- Resolution Trade-Offs – Reducing image size improves performance, but even the best cropping strategies yield just 40.2% accuracy.
ScreenSpot-Pro aims to push the boundaries of GUI grounding models, driving advancements in professional application usability and performance.
Dataset Breakdown 🤗
ScreenSpot-Pro includes 23 applications across 5 industries and 3 operating systems:
- Development Tools: VSCode, PyCharm, Android Studio, VMware.
- Creative Applications: Photoshop, Premiere, Illustrator, Blender, DaVinci Resolve, FruitLoops.
- CAD/Engineering: AutoCAD, SolidWorks, Inventor, Vivado, Quartus.
- Scientific/Analytical: MATLAB, Stata, EViews.
- Office Software: Word, Excel, PowerPoint.
- Operating Systems: Windows, macOS, Linux.
Tasks include 1,581 natural language instructions paired with high-resolution screenshots. Each task asks the model to locate and interact with specific UI elements.
Performance 📊
Despite advancements in MLLMs, current models struggle with ScreenSpot-Pro:
- OS-Atlas-7B leads with 18.9% accuracy.
- ReGround methods improve accuracy to 40.2% but are far from perfect.
- GPT-4o scores just 0.8%, underscoring the need for specialized professional grounding models.
Next Steps 🤝
ScreenSpot-Pro lays the foundation for future advancements in GUI agents. Our goal is to inspire:
- New Models designed to handle high-resolution GUI environments.
- Improved Baselines using smarter cropping and zooming techniques.
- Community Collaboration to push the boundaries of professional GUI grounding.
BibTex 📚
@misc{screenspotpro,
author = {Kaixin Li and Ziyang Meng and Hongzhan Lin and Ziyang Luo and Yuchen Tian and Jing Ma and Zhiyong Huang and Tat-Seng Chua},
title = {ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use},
year = {2025},
note = {Preprint},
url = {https://likaixin2000.github.io/papers/ScreenSpot_Pro.pdf},
}