Empowering Accessibility through AI: A next-generation multimodal desktop assistant built for the Google Gemini Live Agent Challenge.
SUVI was born out of a profound need for digital accessibility. For individuals with physical disabilities—such as those without the use of their hands or with severe mobility impairments—interacting with standard computers can be an insurmountable challenge.
My vision for SUVI is to bridge this gap. By combining real-time, natural voice conversation with advanced computer vision and autonomous UI control, SUVI acts as a virtual pair of hands. A user can simply speak naturally to their computer ("Hey SUVI, please open my email and write a message to Mom saying I will be late"), and SUVI will visually navigate the screen, type, click, and complete the task—all while conversing with the user seamlessly.
(Hackathon Judges: Please watch this demo to see SUVI in action!)
SUVI breaks the "Text Box" paradigm and perfectly aligns with two primary tracks of the challenge:
- Live Agents: Utilizes the
gemini-2.5-flash-native-audiomodel for real-time, low-latency, interruptible voice interactions. - UI Navigator: Employs the
gemini-2.5-computer-use-previewmodel to observe screen state via screenshots and execute complex UI navigation paths (clicks, scrolls, typing) autonomously.
- Models: Gemini 2.5 Flash Native Audio, Gemini 2.5 Computer Use, Gemini 3.1 Pro (Orchestrator).
- SDK: Built using the
google-genaiandgoogle-adklibraries. - Hosting/GCP: The system backend is deployed on Google Cloud Run (Live Gateway URL). It heavily utilizes Google Cloud services including Firestore, Cloud Logging, and Secret Manager.
graph TD
%% Define Styles
classDef user fill:#E1F5FE,stroke:#4285F4,stroke-width:2px,color:#000
classDef ui fill:#F3E5F5,stroke:#9C27B0,stroke-width:2px,color:#000
classDef worker fill:#FFF3E0,stroke:#AB47BC,stroke-width:1px,color:#000
classDef core fill:#E8EAF6,stroke:#8E24AA,stroke-width:2px,color:#000
classDef service fill:#E8F5E9,stroke:#FF9800,stroke-width:1px,color:#000
classDef gateway fill:#E0F7FA,stroke:#4CAF50,stroke-width:2px,color:#000
classDef cloud fill:#FFFDE7,stroke:#FFC107,stroke-width:2px,color:#000
classDef gcp fill:#FBE9E7,stroke:#FFEB3B,stroke-width:2px,color:#000
User[👤 User]:::user
subgraph Desktop_Client ["Desktop Client (PyQt6)"]
direction TB
UI_Login[Login Window]:::ui
UI_Chat[Chat Widget / Voice Overlay]:::ui
Tray[System Tray]:::ui
Core[SUVI App Controller]:::core
subgraph Background_Workers ["Background Workers"]
WW[Wake Word Worker <br> Porcupine]:::worker
Voice[Voice Worker <br> Audio I/O]:::worker
Replay[Replay Worker <br> Screen capture]:::worker
end
subgraph Local_Services ["Local Services"]
LS[Gemini Live Service]:::service
CS[Computer Use Service]:::service
OS[Orchestrator Service]:::service
MS[Memory Service]:::service
GS[Gateway Service]:::service
AS[Auth Service]:::service
Env[Environment Scanner]:::service
end
subgraph Local_Execution ["Local Execution Engine"]
Perm[Permissions Manager]:::service
Exec[Dispatcher]:::service
OS_Inter[OS Interfaces<br>PyAutoGUI / Playwright]:::service
end
end
subgraph Cloud_Gateway ["Cloud Gateway (FastAPI / Cloud Run)"]
Router[WebSocket Proxy]:::gateway
AuthM[Auth Middleware]:::gateway
O_Proxy[Orchestrator Proxy]:::gateway
DB_Proxy[Firestore Service]:::gateway
Log_Proxy[Cloud Logging]:::gateway
end
subgraph External_APIs ["Google Cloud & APIs"]
Gemini_Live[Gemini Live API <br> native audio]:::cloud
Gemini_CU[Gemini Computer Use API <br> 2.5 flash]:::cloud
Gemini_Pro[Gemini Pro API <br> orchestrator]:::cloud
Firebase[Firebase Auth]:::gcp
Firestore[Cloud Firestore <br> Sessions & Memory]:::gcp
GCS[Cloud Storage <br> Session Replays]:::gcp
end
%% Connections
User <-->|Speaks / Hears| Voice
User -->|Views / Clicks| UI_Chat
User -->|Says Hey SUVI| WW
UI_Login --> AS
UI_Chat <--> Core
Tray <--> Core
WW -->|Triggers| Core
Voice <--> LS
Replay -->|Uploads Video| GCS
Core <--> Background_Workers
Core <--> Local_Services
LS <-->|WebSockets / Audio Stream| Gemini_Live
CS <-->|Screenshots & Intent| Gemini_CU
OS <-->|Task Planning| Gemini_Pro
CS -->|Requested Actions| Perm
Perm -->|Validated Actions| Exec
Exec --> OS_Inter
OS_Inter -->|Moves Mouse, Types| Desktop_Client
Env -->|System Context| Core
GS <-->|WSS Auth & Requests| Router
AS -->|REST Auth| Firebase
MS <-->|User Profile Queries| DB_Proxy
Router --> AuthM
AuthM --> Firebase
Router --> O_Proxy
Router --> DB_Proxy
Router --> Log_Proxy
DB_Proxy <--> Firestore
O_Proxy <--> Gemini_Pro
SUVI operates on a sophisticated Multi-Agent Architecture divided into a Desktop Client and a GCP Cloud Gateway:
-
Layer 1: Local Client (PyQt6)
- Manages local microphone input and speaker output.
- Houses background workers for wake-word detection ("Hey SUVI").
- Connects to the GCP Gateway via secure WebSockets.
- Runs the local execution engines (
PyAutoGUI,Playwright) to physically move the mouse and type keys on the user's machine.
-
Layer 2 & 3: Cloud Run Gateway + Vertex AI (Orchestration & Intelligence)
- Live Agent (
live_session.py): Handles the real-time voice stream. It recognizes intents. - Orchestrator Agent (
agents/orchestrator): If a user asks for a desktop task, the Orchestrator (powered by Gemini Pro) creates a step-by-step plan. - Computer Use Agent (
computer_use_service.py): Executes the plan in a continuous loop. It asks the local client for a screenshot, analyzes it using Gemini Computer Use, determines the next UI action (e.g., "Click at x:100, y:200"), and dispatches the command back to Layer 1.
- Live Agent (
-
Layer 4: GCP Data Layer
- Firestore manages long-term user memory and session state.
- Cloud Logging provides audit trails for automated actions.
To evaluate SUVI, you will need to run the Desktop Client locally on your Windows machine, as it requires physical access to your screen, mouse, and keyboard.
- Windows 10/11
- Python 3.10 or higher
- A valid Google Cloud Project with the Vertex AI API and Live API enabled.
git clone https://github.com/your-username/suvi.git
cd suvicd apps/desktop
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txtCreate a .env file in the apps/desktop/ directory and add your Google credentials:
(Note: You must have a GCP Service Account JSON key)
GOOGLE_APPLICATION_CREDENTIALS="C:\path\to\your\service_account.json"
PROJECT_ID="your-gcp-project-id"
LOCATION="us-central1"
GATEWAY_URL="wss://suvi-google-gemini-live-hackathon-722150734142.us-central1.run.app"python -m apps.desktop.suvi- The SUVI chat widget will appear on your desktop.
- You can click the microphone icon or say "Hey SUVI" (if enabled) to start talking!
While SUVI was built for this hackathon, the journey doesn't end here. Future iterations will include:
- Cross-Platform Support: Expanding from Windows to macOS and Linux desktop environments.
- Mobile Companion App: Allowing users to control their desktop remotely via voice from their phone.
- Advanced Safety Rails: Implementing stricter bounding boxes and visual confirmation prompts for highly sensitive actions (e.g., deleting files, sending financial emails) to ensure absolute safety for visually or physically impaired users.
- Custom Persona Tuning: Allowing users to define SUVI's voice tone, speaking speed, and verbosity to better match their personal accessibility needs.
- Google GenAI Team for the incredible Gemini 2.5 Live and Computer Use APIs.
- Built with ❤️ by Sumit Singh for the Gemini Live Agent Challenge 2026.
