Skip to content

Releases: appergb/desktop-agent-ops

desktop-agent-ops v1.4.1

07 Apr 06:03

Choose a tag to compare

v1.4.1 (2026-04-07)

Features

  • Adaptive Window-Crop Executor Entry
    • local_agent.py screenshot tool now accepts app and region_label
    • executor now prefers app-window crops before falling back to full-screen captures
    • added explicit image_space metadata so cropped image coordinates can be decoded outside the model
    • added translate_image_point tool so screenshot-driven flows can remap image-space points into screen coordinates deterministically

Fixes

  • Retina / HiDPI coordinate remapping
    • added pixel-size-aware remapping in image_space.py instead of assuming simple offset-only translation
    • updated executor context so models see crop bounds, image pixel size, and mapping rules together

Documentation

  • documented the window-crop-first rule and fallback behavior in SKILL.md, desktop-agent-ops.md, and references/workflow.md
  • clarified that screenshot-driven flows should decode image coordinates outside the model and only fall back to whole-screen captures when window capture is unavailable

🤖 Generated with Claude Code

v1.2.1 — MCP-First + Three-Layer Smart Targeting

03 Apr 01:09

Choose a tag to compare

What's New

Tool Priority Decision Flow (MCP-First)

  • Priority 1: MCP Servers (chrome-devtools, fetch, etc.) — always prefer structured APIs
  • Priority 2: Native CLI / AppleScript — direct control without screen parsing
  • Priority 3: Desktop Agent Ops — screen recognition as last resort only
  • Decision checklist added to SKILL.md entry point

Three-Layer Smart Targeting

Layer Method Speed When Used
1 Accessibility API (AXUIElement) ~34ms Native apps (Finder, Safari, Notes)
2 Vision Framework OCR ~147ms Apps hiding UI (WeChat, QQ, Electron)
3 Tesseract OCR ~2187ms Linux/Windows fallback

New Files

  • ax_provider.py — macOS Accessibility API provider
  • vision_ocr.py — macOS Vision Framework OCR (no Tesseract needed)

Key Changes

  • target_resolver.py — accessibility-first provider chain with auto-degradation
  • ocr_text.py — multi-backend (--backend auto|vision|tesseract)
  • first_run_setup.py — macOS installs pyobjc; Tesseract now optional
  • Tesseract removed from mandatory brew install on macOS
  • 50/50 tests passing

v1.2.0 — Three-Layer Smart Targeting

03 Apr 00:57

Choose a tag to compare

Three-layer smart targeting: Accessibility API (34ms) → Vision OCR (147ms) → Tesseract (2187ms). macOS no longer requires Tesseract installation. New: ax_provider.py, vision_ocr.py. See CHANGELOG.md for details.

v1.1.0 — Custom Workflows + OCR Ambiguity Fix

02 Apr 11:04

Choose a tag to compare

Release Notes — v1.1.0 (2026-04-02)

New Features

  • Custom Workflow System — Define reusable multi-step desktop automations in Markdown + YAML frontmatter

    • workflow_loader.py: Discover and parse workflows from bundled and user directories
    • workflow_runner.py: Execute workflows with parameter substitution, retry logic, and task context
    • preview command for Agent safety review before execution (no hardcoded whitelist)
    • 3 bundled example workflows: send-chat-message, browser-search, open-app-and-click
  • Secret Scanner — Pre-upload security scanning (secret_scanner.py)

    • 13 regex patterns: AWS keys, GitHub tokens, API keys, private keys, connection strings, etc.
    • Shannon entropy detection for unknown secret formats
    • Severity levels: error (blocks upload) / warning (skippable with --force)
  • Workflow Sharing — Contribute workflows to community via GitHub PR (workflow_share.py)

    • Automated preflight: format validation + secret scan + gh auth check
    • One-command fork → branch → commit → PR creation
    • PR body auto-generated with workflow metadata and scan results

Fixes

  • OCR ambiguity guard — Example 3 send-button lookup now uses --region-label primary_action to prevent false-positive when message text contains "发送"
  • Removed vague "OR" fallback — Input field targeting no longer offers "click at bottom center" as alternative; window_regions.py --label bottom_input is now mandatory
  • Reference doc trigger rules — Changed from "Load as needed" to explicit MUST-read conditions for platform, chat-app, WeChat, validation, and targeting docs
  • Added post-type screenshot verification step in Example 3

Documentation

  • Added skill/references/custom-workflows.md workflow authoring guide
  • Updated SKILL.md with Custom Workflows section and Agent Safety Review Protocol
  • Updated README with workflow system documentation

Install

Download desktop-agent-ops-v1.1.0.zip and follow the setup instructions in SKILL.md.

SHA-256 checksum available in desktop-agent-ops-v1.1.0.sha256.

v1.0.3 — Performance & Reliability

25 Mar 04:57

Choose a tag to compare

Summary

Major reliability and performance release. Fixes CJK text input, Enter-to-send, minimized window restoration, and 10+ other bugs. End-to-end WeChat message sending now works reliably and is 7.6x faster (0.59s vs 4.49s).

Highlights

  • Clipboard-first input on all platforms — cliclick silently dropped CJK characters
  • AppleScript key code as primary key press path — cliclick kp:return not recognized by WeChat
  • Minimized window restoration — Dock click approach for minimized windows
  • 7.6x faster end-to-end: focus 0.29s + type 0.17s + send 0.13s = 0.59s total
  • 8 new example cases (Case 12–19): right-click, drag-and-drop, system settings, form fill, dropdown, toggle/slider, cross-app copy-paste, browser tabs
  • 12 bug fixes across scroll, screenshot, pixel-color, window bounds, drag, hotkey, and more

Install via ClawHub

npx clawhub@latest install desktop-agent-ops

See full details in release-notes-v1.0.3.md and CHANGELOG.md.

v1.0.0 — Desktop Agent Ops

23 Mar 07:19

Choose a tag to compare

Desktop Agent Ops v1.0.0

Cross-platform desktop GUI automation skill for AI agents.

Highlights

  • One-command setup: python3 scripts/first_run_setup.py handles everything
  • Window-scoped OCR: Targets only the active app window, never the wrong app
  • Auto DPI scaling: Retina, HiDPI, all resolutions handled automatically
  • Multi-language OCR: Auto-detects system language (中文, 日本語, 한국어, etc.)
  • CJK text input: Reliable Unicode input via clipboard-paste on all platforms
  • 17 desktop commands: screenshot, click, type, scroll, drag, hotkey, focus-app...

Platforms

  • macOS (Retina supported)
  • Windows (HiDPI supported)
  • Linux X11 (HiDPI supported)

Installation

Download desktop-agent-ops-skill-clean.zip and extract to your skill directory. The agent will auto-setup on first use.