✴️ ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

Community Article Published January 3, 2025

Task Distribution

TL;DR

Artifacts 📦

Introduction 🌟
Why ScreenSpot-Pro?

Key Challenges:

Dataset Breakdown 🤗

Performance 📊

Next Steps 🤝

BibTex 📚

TL;DR

ScreenSpot-Pro is a new benchmark designed to evaluate GUI grounding models in professional, high-resolution environments. It spans 23 applications across 5 professional categories and 3 operating systems, highlighting the challenges models face when interacting with complex software. Existing models achieve low accuracy (best at 18.9%), underscoring the need for further research.

Artifacts 📦

Introduction 🌟

Graphical User Interfaces (GUIs) are integral to modern digital workflows. While Multi-modal Large Language Models (MLLMs) have advanced GUI agents (e.g., Aria-UI and UGround) for general tasks like web browsing and mobile applications, professional environments introduce unique complexities. High-resolution screens, intricate interfaces, and smaller target elements make GUI grounding in professional settings significantly more challenging.

We present ScreenSpot-Pro—a benchmark designed to evaluate GUI grounding models specifically for high-resolution, professional computer-use environments.

Why ScreenSpot-Pro?

High-Resolution Focus – Professional software often runs at resolutions like 3840x2160, demanding precise detection of small UI components.
Diverse Application Coverage – Encompasses 23 applications across 5 industries and 3 operating systems, spanning from development tools like VSCode and PyCharm to creative suites such as Photoshop and Blender.
Expert-Led Annotation – Tasks are curated and annotated by users with over five years of professional experience, ensuring accuracy and real-world relevance.

Key Challenges:

Complex Interfaces – High-resolution displays lead to smaller, densely packed UI elements, complicating detection and interaction.
Performance Gaps – Existing models achieve only 18.9% accuracy in professional GUI grounding, highlighting substantial room for improvement.
Resolution Trade-Offs – Reducing image size improves performance, but even the best cropping strategies yield just 40.2% accuracy.

ScreenSpot-Pro aims to push the boundaries of GUI grounding models, driving advancements in professional application usability and performance.

Task Distribution

Dataset Breakdown 🤗

ScreenSpot-Pro includes 23 applications across 5 industries and 3 operating systems:

Development Tools: VSCode, PyCharm, Android Studio, VMware.
Creative Applications: Photoshop, Premiere, Illustrator, Blender, DaVinci Resolve, FruitLoops.
CAD/Engineering: AutoCAD, SolidWorks, Inventor, Vivado, Quartus.
Scientific/Analytical: MATLAB, Stata, EViews.
Office Software: Word, Excel, PowerPoint.
Operating Systems: Windows, macOS, Linux.

Tasks include 1,581 natural language instructions paired with high-resolution screenshots. Each task asks the model to locate and interact with specific UI elements.

Task Distribution

Performance 📊

Despite advancements in MLLMs, current models struggle with ScreenSpot-Pro:

OS-Atlas-7B leads with 18.9% accuracy.
ReGround methods improve accuracy to 40.2% but are far from perfect.
GPT-4o scores just 0.8%, underscoring the need for specialized professional grounding models.

Task Distribution ScreenSpot-Pro Leaderboard

Next Steps 🤝

ScreenSpot-Pro lays the foundation for future advancements in GUI agents. Our goal is to inspire:

New Models designed to handle high-resolution GUI environments.
Improved Baselines using smarter cropping and zooming techniques.
Community Collaboration to push the boundaries of professional GUI grounding.

BibTex 📚

@misc{screenspotpro,
  author    = {Kaixin Li and Ziyang Meng and Hongzhan Lin and Ziyang Luo and Yuchen Tian and Jing Ma and Zhiyong Huang and Tat-Seng Chua},
  title     = {ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use},
  year      = {2025},
  note      = {Preprint},
  url       = {https://likaixin2000.github.io/papers/ScreenSpot_Pro.pdf},
}

Upvote

✴️ ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

TL;DR Artifacts 📦 Introduction 🌟 Why ScreenSpot-Pro? Key Challenges: Dataset Breakdown 🤗 Performance 📊 Next Steps 🤝 BibTex 📚 TL;DR

Artifacts 📦

Introduction 🌟

Why ScreenSpot-Pro?

Key Challenges:

Dataset Breakdown 🤗

Performance 📊

Next Steps 🤝

BibTex 📚

TL;DR

Artifacts 📦

Introduction 🌟
Why ScreenSpot-Pro?

Key Challenges:

Dataset Breakdown 🤗

Performance 📊

Next Steps 🤝

BibTex 📚

TL;DR