Sai — Voice-Native Agentic OS Co-Pilot

Your computer can finally see. Your voice is the only interface you need.

"Hey Sai, answer this LeetCode problem." Sai reads the problem from the screen, writes a complete solution in the code editor, and clicks Submit — all from a single voice command.

The Problem

Every voice assistant today is blind. Siri, Alexa, and Cortana can set timers and play music, but they cannot see the complex UI you're staring at. They have zero awareness of the browser tab you have open, the form you're filling out, or the code editor you're working in.

Meanwhile, traditional UI automation tools are brittle — they rely on DOM selectors, accessibility trees, or hard-coded pixel coordinates that break the moment a website redesigns a button.

There is no system today that can hear what you want, see what's on your screen, and act on it with human-level understanding.

The Solution

Sai is a voice-native OS co-pilot that combines real-time speech recognition with Amazon Nova's multimodal vision reasoning to operate your entire macOS desktop — any app, any website, any workflow — through natural voice commands.

Sai doesn't parse HTML. It doesn't read the DOM. It looks at your screen the same way a human would, reasons about what it sees, and executes OS-level actions with pixel-perfect accuracy. It works on every application because it operates at the visual layer, not the API layer.

System Architecture

graph TB
    subgraph CLIENT["Local macOS Client"]
        WW["Picovoice Porcupine<br/>Wake Word Engine"]
        MIC["Microphone<br/>PCM 16kHz Stream"]
        SC["Screen Capture<br/>+ AppleScript Context"]
        EX["PyAutoGUI Executor<br/>Click · Type · Scroll · Hotkey"]
        OV["NSPanel Activity Overlay<br/>Animated Border Feedback"]
    end

    subgraph SERVER["Cloud Server (FastAPI)"]
        WS["WebSocket Gateway<br/>Bidirectional Communication"]

        subgraph STT["Speech-to-Text"]
            EL["ElevenLabs Scribe v2<br/>Realtime Streaming ASR"]
        end

        subgraph BRAIN["Multi-Model Brain"]
            INT["Intent Interpreter<br/>Nova 2 Lite"]
            RT["Task Router<br/>Nova 2 Lite"]
            SB["Simple Executor<br/>Nova 2 Lite"]
            VL["Vision Agent Loop<br/>Nova Pro (Multimodal)"]
        end
    end

    MIC -->|PCM Audio| WS
    WS -->|Base64 Audio| EL
    EL -->|Transcription| INT
    INT -->|Corrected Intent| RT
    RT -->|SIMPLE| SB
    RT -->|ADVANCED| VL
    SB -->|Single Command| WS
    VL -->|Step-by-Step Commands| WS
    WS -->|Commands| EX
    WS -->|capture_screen| SC
    SC -->|Screenshot + App Context| WS
    WW -->|Wake Trigger| WS
    EX --> OV

    style CLIENT fill:#1a1a2e,stroke:#e94560,color:#eee
    style SERVER fill:#0f3460,stroke:#e94560,color:#eee
    style BRAIN fill:#16213e,stroke:#0ea5e9,color:#eee
    style STT fill:#16213e,stroke:#0ea5e9,color:#eee

How It Works — End-to-End Flow

sequenceDiagram
    participant U as User
    participant C as Client (macOS)
    participant S as Server (Cloud)
    participant E as ElevenLabs STT
    participant NL as Nova 2 Lite
    participant NP as Nova Pro

    U->>C: "Hey Sai..." (wake word)
    activate C
    C->>S: WebSocket connect + handshake
    S->>C: capture_screen
    C->>S: Screenshot (1440×900) + App Context

    U->>C: "...turn off data sharing on Twitter"
    C->>S: PCM audio stream
    S->>E: Base64 audio chunks
    E->>S: Committed transcript

    Note over S: Intent Interpretation
    S->>NL: Fix garbled STT → clean command
    NL->>S: "Turn off data sharing on Twitter"

    Note over S: Task Routing
    S->>NL: SIMPLE or ADVANCED?
    NL->>S: ADVANCED (needs screen interaction)

    Note over S: Vision Agent Loop Begins
    loop Up to 25 steps
        S->>NP: Screenshot + Task + History
        NP->>S: {command: "click", x: 450, y: 320}
        S->>C: Execute click
        C->>C: PyAutoGUI performs click
        S->>C: capture_screen
        C->>S: Fresh screenshot
        Note over S: Verify action succeeded
    end

    S->>C: Task complete (done: true)
    C->>S: Session close
    deactivate C
    Note over C: Returns to wake word listening

The Vision Agent Loop — Deep Dive

This is the core innovation. When a task requires interacting with on-screen content, Sai enters a multi-step Plan → Act → Verify agent loop powered by Amazon Nova Pro's multimodal reasoning.

flowchart TD
    START([Task Received]) --> PLAN
    PLAN["Step 1: Analyze Screenshot<br/>Form high-level plan in explanation field"]
    PLAN --> ACT

    ACT["Execute ONE Action<br/>click · type · scroll · hotkey"]
    ACT --> WAIT["Wait for UI Settle (2s)"]
    WAIT --> CAPTURE["Capture Fresh Screenshot"]
    CAPTURE --> ANNOTATE["Annotate with Edge Rulers<br/>+ Last-Click Crosshair"]
    ANNOTATE --> VERIFY

    VERIFY{"Verify: Did the<br/>action succeed?"}
    VERIFY -->|"Yes, task complete"| DONE([Signal done=true])
    VERIFY -->|"Yes, but more steps needed"| ACT
    VERIFY -->|"No, action failed"| RECOVER

    RECOVER["Adjust approach<br/>Try different element/method"]
    RECOVER --> ACT

    subgraph GUARDRAILS["Safety Guardrails"]
        G1["Cycle Detection<br/>Detects repeating action patterns"]
        G2["Stuck Recovery<br/>Forces approach change after loops"]
        G3["Hard Bail<br/>Aborts after 3× cycle repetition"]
        G4["Verification Gate<br/>Cannot mark done on same step as action"]
    end

    VERIFY -.-> GUARDRAILS

    style START fill:#10b981,stroke:#059669,color:#fff
    style DONE fill:#10b981,stroke:#059669,color:#fff
    style GUARDRAILS fill:#7c3aed,stroke:#6d28d9,color:#fff
    style PLAN fill:#f59e0b,stroke:#d97706,color:#000

How the Agent Sees the Screen

Every screenshot sent to Nova Pro is processed through Sai's Annotated Vision Pipeline:

Native Capture — macOS screencapture grabs the full Retina display (e.g., 2560×1600)
Canvas Normalization — Downsampled to a fixed 1440×900 logical canvas via LANCZOS resampling
Edge Ruler Annotation — Red tick marks at [200, 400, 600, 800, 1000] along the top and left edges provide spatial reference without cluttering the UI
Last-Action Crosshair — A lime-green crosshair marks where the previous click landed, enabling the agent to self-correct drift
Normalized Coordinates — Nova Pro reasons in a [0, 1000] × [0, 1000] coordinate space, making the system resolution-independent

Key Technical Innovations

1. Hybrid Multi-Model Routing

Not every voice command needs a 25-step vision agent. Sai uses a three-tier model hierarchy to minimize latency:

Tier	Model	Latency	When Used
Intent Correction	Nova 2 Lite	~200ms	Every command — fixes STT errors
Task Router	Nova 2 Lite	~200ms	Every command — classifies SIMPLE vs ADVANCED
Simple Executor	Nova 2 Lite	~200ms	App launches, URL opens, hotkeys
Vision Agent	Nova Pro	~2s/step	Multi-step UI interaction

"Open Chrome" resolves in under 1 second. "Navigate to Privacy Settings and disable tracking" uses the full agent loop.

2. Resolution-Independent Coordinate System

The agent reasons in a normalized [0, 1000] × [0, 1000] grid. The client maps these coordinates to actual screen pixels at runtime. This means Sai works identically on:

13" MacBook Air (2560×1600 native, 1440×900 logical)
27" Studio Display (5120×2880 native)
Any future Apple display

3. Cycle-Aware Stuck Detection

Vision agents can get trapped in action loops (clicking the same 3-4 elements endlessly). Sai implements a cycle detection algorithm that identifies repeating patterns of any length (1-6 actions), injects corrective prompts to force strategy changes, and hard-bails if the cycle persists after intervention.

4. Strategic Planning Prompts

Instead of "do one action per step," Sai's system prompt enforces Plan → Act → Verify discipline. On Step 1, the agent must articulate a numbered high-level plan. Every subsequent step must justify why the action advances the plan. The agent is explicitly instructed that it can read text from the screenshot — no need to click on UI elements just to see their content.

5. Conversation History Windowing

Multi-step agent loops accumulate large image payloads. Sai implements a sliding window over conversation history: the system prompt + initial plan (first exchange) + the 3 most recent exchanges are retained. Older screenshots are pruned to keep the model focused on current state without losing the original strategy.

6. Native macOS Activity Overlay

A custom NSPanel overlay renders an animated, color-shifting border around the entire screen while Sai is active. It uses NSWindowCollectionBehaviorCanJoinAllSpaces to appear across all Spaces and full-screen apps, setIgnoresMouseEvents_(True) to remain non-interactive, and automatically suspends during screenshot capture to avoid appearing in the agent's vision.

Amazon Nova Integration

Sai makes deep, multi-layered use of Amazon Nova foundation models:

Component	Nova Model	Capability Used
Intent Interpretation	Nova 2 Lite	Text reasoning — corrects garbled speech-to-text using screen context
Task Routing	Nova 2 Lite	Text classification — determines if task is simple (single action) or advanced (agent loop)
Simple Command Generation	Nova 2 Lite	Structured output — converts natural language to executable JSON commands
Vision Agent Loop	Nova Pro	Multimodal reasoning — analyzes screenshots, plans multi-step strategies, outputs precise coordinates for UI interaction

Nova Pro's multimodal capabilities are the foundation of Sai's intelligence. It receives annotated screenshots and must:

Identify UI elements (buttons, text fields, menus) by visual appearance alone
Reason about spatial layout to output precise click coordinates
Track multi-step progress across sequential screenshots
Understand when a task is complete by visually confirming the result

Tech Stack

Layer	Technology	Purpose
Wake Word	Picovoice Porcupine	Offline, on-device keyword detection ("Hey Sai")
Speech-to-Text	ElevenLabs Scribe v2	Realtime streaming ASR with VAD (WebSocket)
Intent + Routing	Amazon Nova 2 Lite	Command interpretation and complexity classification
Vision Reasoning	Amazon Nova Pro	Multimodal screenshot analysis and action planning
Server Framework	FastAPI	Async WebSocket gateway for client-server communication
Screen Capture	macOS `screencapture` + Pillow	Native Retina capture with LANCZOS downsampling
OS Execution	PyAutoGUI	Cross-resolution click, type, scroll, and hotkey execution
App Context	AppleScript (osascript)	Extracts frontmost app name, browser URL, and tab title
Activity Overlay	PyObjC (NSPanel)	Native macOS animated overlay for visual feedback
Audio Capture	PyAudio	Low-level PCM microphone streaming at 16kHz

macOS Permissions & Troubleshooting

Sai will not function without these three permissions granted to your Terminal or IDE. macOS sandboxing blocks microphone access, screen capture, and UI automation by default.

Required Permissions

Go to System Settings → Privacy & Security and enable the following for your terminal application (Terminal, iTerm2, VS Code, etc.):

Permission	Path	Required For
Accessibility	Privacy & Security → Accessibility	Mouse clicks, keyboard input, hotkeys (PyAutoGUI)
Screen Recording	Privacy & Security → Screen Recording	Screenshot capture (`screencapture` command)
Microphone	Privacy & Security → Microphone	Wake word detection and audio streaming (PyAudio)

If macOS Doesn't Prompt You Automatically

macOS sometimes silently denies permissions without showing a dialog (especially after app reinstalls). To force a fresh permission prompt, reset each permission from Terminal:

tccutil reset Accessibility
tccutil reset ScreenCapture
tccutil reset Microphone

After running these commands, fully quit and relaunch your terminal, then run the app again — macOS will now prompt for each permission.

Setup & Reproducibility

Prerequisites

Python 3.11+
macOS (tested on M1 MacBook Pro, macOS Sonoma)
Picovoice Access Key — picovoice.ai
ElevenLabs API Key — elevenlabs.io
Amazon Nova API Key — via AWS Bedrock or API proxy

Automated Setup (Recommended)

Run the one-shot setup script from the repo root:

bash setup_mac.sh

This will:

Verify Python 3.11+ is installed
Create isolated venv environments inside client/ and server/
Install all dependencies from requirements.txt
Copy .env.example → .env in both directories, prompting you to fill in your API keys

Then fill in your keys:

# server/.env
AMAZON_NOVA_API_KEY=your_key
NOVA_BASE_URL=https://api.nova.amazon.com/v1
OPENROUTER_API_KEY=your_key
ELEVENLABS_API_KEY=your_key

# client/.env
PICOVOICE_ACCESS_KEY=your_key

Manual Setup

Expand for manual step-by-step instructions

Server:

cd server
python3.11 -m venv venv
venv/bin/pip install -r requirements.txt
cp .env.example .env   # then edit .env with your keys

Client:

cd client
python3.11 -m venv venv
venv/bin/pip install -r requirements.txt
cp .env.example .env   # then edit .env with your Picovoice key

Running Sai

# Terminal 1 — start the server
cd server && venv/bin/uvicorn main:app --host 0.0.0.0 --port 8080

# Terminal 2 — start the client
cd client && venv/bin/python wake_word.py

3. Use It

Say "Hey Sai" — the animated border appears
Give a command: "Open Spotify", "Go to twitter.com and turn off data sharing", "Answer this LeetCode problem"
Watch Sai execute — then it returns to listening mode

Example Workflows

Command	Type	What Sai Does
"Open Chrome"	SIMPLE	Opens Spotlight → types "Chrome" → launches app (~1s)
"Go to github.com"	SIMPLE	Opens URL directly in the active browser (~1s)
"Turn off data sharing on Twitter"	ADVANCED	Navigates Settings → Privacy → toggles the correct switch (5-8 steps)
"Answer this LeetCode problem"	ADVANCED	Reads the problem from the screenshot → clicks code editor → types complete solution → clicks Submit (4-6 steps)
"Search for wireless headphones on Amazon"	ADVANCED	Clicks search bar → types query → submits search (3-4 steps)

Why Sai Matters

Accessibility. For users with motor disabilities, Sai transforms the entire macOS desktop into a voice-controlled interface — not just a handful of supported apps, but every app, every website, every workflow.

Universal Automation. Traditional RPA breaks when UIs change. Sai's visual approach is inherently resilient — it doesn't care if a button moved 50 pixels to the right or if a website redesigned its settings page. It sees the screen and adapts.

The Future of HCI. Sai demonstrates that the combination of multimodal AI (Nova Pro) with real-time voice (Nova Lite + ElevenLabs) creates an interaction paradigm that is fundamentally different from chatbots, command lines, or GUI automation scripts. The keyboard becomes optional. The mouse becomes agentic.

Project Structure

sai/
├── server/
│   ├── main.py              # FastAPI server — STT, routing, agent loop
│   ├── requirements.txt
│   └── .env                  # API keys (not committed)
├── client/
│   ├── wake_word.py          # macOS client — wake word, capture, execution, overlay
│   ├── HeySai_mac.ppn        # Custom Porcupine wake word model
│   ├── requirements.txt
│   └── .env                  # Picovoice key (not committed)
├── LICENSE                   # MIT
└── README.md

Stop typing. Start speaking. Sai is the future of human-computer interaction.

Built with Amazon Nova for the Amazon Nova AI Hackathon #AmazonNova

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
client		client
server		server
.gitignore		.gitignore
DEVPOST.md		DEVPOST.md
LICENSE		LICENSE
README.md		README.md
SAI_PROJECT_SUMMARY.md		SAI_PROJECT_SUMMARY.md
current_status.md		current_status.md
presentation.html		presentation.html
setup_mac.sh		setup_mac.sh

Folders and files

Latest commit

History

Repository files navigation

Sai — Voice-Native Agentic OS Co-Pilot

Your computer can finally see. Your voice is the only interface you need.

The Problem

The Solution

System Architecture

How It Works — End-to-End Flow

The Vision Agent Loop — Deep Dive

How the Agent Sees the Screen

Key Technical Innovations

1. Hybrid Multi-Model Routing

2. Resolution-Independent Coordinate System

3. Cycle-Aware Stuck Detection

4. Strategic Planning Prompts

5. Conversation History Windowing

6. Native macOS Activity Overlay

Amazon Nova Integration

Tech Stack

macOS Permissions & Troubleshooting

Required Permissions

If macOS Doesn't Prompt You Automatically

Setup & Reproducibility

Prerequisites

Automated Setup (Recommended)

Manual Setup

Running Sai

3. Use It

Example Workflows

Why Sai Matters

Project Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages