Skip to content

feat: expose OpenWork UI control plane and MCP bridge#1638

Closed
benjaminshafii wants to merge 9 commits intodevfrom
feat/realtime-voice-control-mode
Closed

feat: expose OpenWork UI control plane and MCP bridge#1638
benjaminshafii wants to merge 9 commits intodevfrom
feat/realtime-voice-control-mode

Conversation

@benjaminshafii
Copy link
Copy Markdown
Member

@benjaminshafii benjaminshafii commented May 2, 2026

Summary

  • Add a provider-neutral OpenWork UI control plane so controllers can discover visible app state and execute semantic actions without DOM scraping.
  • Expose OpenWork session/composer capabilities through that control plane: list/open/rename/delete sessions, create tasks, read transcript/latest message, type/send/stop composer prompts, and scroll/focus the active session.
  • Add an OpenWork-owned local UI bridge plus the openwork-ui-mcp package so external MCP clients can use ui_status, ui_snapshot, ui_list_actions, and ui_execute_action.
  • Keep OpenAI Realtime as an optional Feature Preview driver that drives the generic control plane; the durable improvement is the OpenWork-owned action registry and MCP surface.
  • Extract the standalone HandsFree/Pilot app out of this repo so this PR now focuses on the actual OpenWork app/server/desktop improvements.

OpenWork improvements

Semantic UI control plane

  • New window.__openworkControl registry with snapshot(), listActions(), execute(), setEnabled(), and subscribe().
  • Domain-owned actions are registered by OpenWork UI/runtime code instead of by provider-specific automation.
  • Control state includes route/status narration and currently available actions, making automation safer and more inspectable.

Session and composer actions

  • Session controls include session.create_task, session.list_sessions, session.open, session.rename, session.delete, session.latest_message, and session.read_transcript.
  • Composer controls include composer.set_text, composer.send, and composer.stop.
  • Session surface also exposes scroll/focus actions used by controllers and future tests.

MCP-facing OpenWork bridge

  • OpenWork desktop starts a localhost, bearer-token-protected UI-control bridge and writes discovery metadata to Electron userData.
  • New packages/openwork-ui-mcp stdio MCP server proxies that bridge as MCP tools:
    • ui_status
    • ui_snapshot
    • ui_list_actions
    • ui_execute_action
  • New docs/mcp-ui-control-profile.md documents the intended semantic MCP profile for OpenWork UI control.

Optional Realtime preview driver

  • The Feature Preview Realtime controller remains isolated under shell/control-drivers/openai-realtime/.
  • It uses the generic OpenWork control plane rather than hard-wiring provider logic into session UI.
  • Server-side Realtime session creation stays isolated under apps/server/src/remote-control/openai-realtime.ts; long-lived OpenAI API keys do not go to the browser.

Architecture intent

This PR is not about making voice or OpenAI the foundation of OpenWork control. The durable layer is OpenWork-owned:

  1. semantic app state/action discovery,
  2. domain-owned action execution,
  3. an MCP-compatible bridge for external clients,
  4. replaceable drivers such as OpenAI Realtime, tests, demos, or HandsFree.

Screenshots

Feature Preview settings Realtime activity pane
settings activity
Composer typed through control plane Status bar connector
composer status

Verification

Previously run on this branch:

  • pnpm --filter @openwork/app typecheck
  • pnpm --filter openwork-server typecheck
  • pnpm --filter openwork-server build:bin
  • pnpm --filter @openwork/desktop package:electron:dir
  • Packaged app smoke check via Chrome DevTools CDP:
    • control actions registered ✅
    • session.list_sessions returned 30 sessions ✅
    • Realtime connected with live mic in prior packaged verification ✅

Latest extraction/MCP sanity checks:

  • pnpm install --lockfile-only
  • node --check packages/openwork-ui-mcp/index.mjs
  • node --check apps/desktop/electron/main.mjs

…ion, and inline transcript panel

Add app-native voice control via OpenAI Realtime WebRTC so users can
drive visible UI actions hands-free through microphone input.

- Provider-neutral control surface (window.__openworkControl) with
  snapshot, listActions, execute, setEnabled, and subscribe
- OpenAI Realtime WebRTC bridge with mic input, server VAD, text output,
  and tool calling (snapshot, list_actions, execute_action, set_input,
  list_sessions, open_session)
- Server endpoint POST /remote/session mints ephemeral client secrets
  with key from env store; no secrets in browser
- Feature Preview settings tab with Realtime toggle, OpenAI key entry,
  mic selector/test, and transcript panel toggle
- Inline right-side Voice transcript pane (not overlay) showing user
  speech, assistant responses, and tool call lifecycle
- Session list/open control actions so voice can navigate by name
- Electron mic permission plumbing and macOS entitlements
- Stale mic device fallback (OverconstrainedError → system default)
@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
openwork-app Ready Ready Preview, Comment May 4, 2026 6:53pm
openwork-den Ready Ready Preview, Comment May 4, 2026 6:53pm
openwork-den-worker-proxy Ready Ready Preview, Comment May 4, 2026 6:53pm
openwork-landing Ready Ready Preview, Comment, Open in v0 May 4, 2026 6:53pm
openwork-share Ready Ready Preview, Comment May 4, 2026 6:53pm

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

The following comment was made by an LLM, it may be inaccurate:

Add voice-accessible controls for renaming and deleting sessions, scrolling the current session to the top or bottom, and reading the latest visible message. Extend the Realtime tool surface so the model can use these actions directly while requiring explicit confirmation for deletion.
Reorganize the realtime voice-control PR so the generic OpenWork control surface lives independently from the OpenAI Realtime driver. Move session-owned control actions into the session domain, move the OpenAI browser driver and activity/status UI into a driver folder, and move backend Realtime session/tool setup out of server.ts.
…tatus bar, better mic test

Activity panel:
- Rename header from "Voice" to "Control" (generic surface, not voice-specific)
- Replace colored role bubbles with softer tints: structure before effects
- Add proper role labels ("You", "Assistant", "Tool") instead of raw role names
- Add relative timestamps on entries ("now", "12s ago", etc.)
- Add pending-dot animation for in-flight entries
- Add dismiss (X) button to hide the panel inline
- Add empty-state icon + descriptive copy
- Reduce width from 300px to 280px for tighter proportion
- Remove shell shadow on the aside (flat-first per DESIGN-LANGUAGE.md)

Status bar control:
- Replace round pill + text label with minimal icon-only button
- Show compact state text ("Listening", "Connecting…", "Error") without truncation
- Use MicOff icon for disconnect affordance
- Remove background fills; use text color only for state (flatter)

Feature Preview settings:
- Thinner mic level bar (1.5px → cleaner)
- Color-coded level: gray idle → accent low → green strong
- Show numeric percentage during test
- Remove Volume2 icon from test description
- Tighter copy for mic test prompt
…ession

The voice controller could list/open/rename sessions but couldn't read the
content of the currently active session. "What's the last message?" would
fail because the model didn't know it had access.

Changes:
- Add session.read_transcript control action (returns last N messages as
  readable text with session ID, title, and message count)
- Add read_transcript tool to OpenAI Realtime tool schema
- Add controller handler for read_transcript dispatching to the action
- Improve system instructions: tell the model it CAN see session content
  and should always call read_transcript/get_latest_message before saying
  it cannot see the session
- Better tool label for transcript reads in activity panel
…on composer

When the user says something like "tell them I'll be there at 3" or
"reply that looks good", the intent is to type and send that as a message
in the active OpenWork session — not to get a response from the voice
controller itself.

Add REPLY INTENT instructions that tell the model to:
1. read_transcript to understand the on-screen conversation
2. compose the reply from the user's spoken words
3. set_input → composer.set_text with the reply
4. execute_action → composer.send

Direct commands to the controller ("list sessions", "open settings")
still get handled directly. When ambiguous, default to treating spoken
input as a session reply — that's the most common intent when the user
is looking at a conversation.
New standalone Electron menubar app at apps/pilot/ that controls macOS
via voice. Pilot is the top-level control surface; OpenWork and other
apps are connectable targets.

What's included:
- Electron main process: menubar tray, floating always-on-top panel,
  global hotkeys (⌘⇧; toggle panel, ⌘⇧L toggle listening)
- System control via AppleScript IPC:
  - list/activate/launch apps
  - frontmost app detection
  - keystroke/key-combo injection
  - clipboard read/write
  - open URL
- Preload bridge: window.__PILOT__.system.* for the UI and future
  Realtime driver
- Floating panel UI: dark vibrancy glass, transcript area, status,
  mic button, empty state with hotkey hints
- macOS entitlements: microphone + AppleScript automation
- LSUIElement: true (no dock icon, menubar-only)
- electron-builder config for packaging

Verified: panel shows, detects frontmost app via AppleScript,
counts 18 running apps. System IPC bridge functional.

Next: wire up OpenAI Realtime driver with system tools, add OpenWork
app connector protocol.
Pilot now owns the Realtime voice driver as the standalone macOS control app.

What's included:
- Main-process OpenAI Realtime session creation with local API key persistence
  so long-lived OpenAI keys never enter the renderer
- Tool schema for macOS control: snapshot, list/frontmost apps, activate/launch
  app, type text, press key combo, clipboard read/write, and open URL
- Renderer WebRTC Realtime driver with microphone capture, SDP exchange,
  data-channel tool-call handling, transcript logging, and tool results
- Panel settings UI for saving the OpenAI key locally
- Panel states for ready/connecting/listening/error and Realtime transcript/tool
  activity
- Vite config so Pilot packages the static panel correctly

Verified:
- pnpm --filter @openwork/pilot build:ui
- pnpm --filter @openwork/pilot package:dir
- Launched packaged/dev Pilot panel; AppleScript frontmost-app and list-apps
  calls still work.
@benjaminshafii
Copy link
Copy Markdown
Member Author

Superseded by the split PRs: #1644 for the OpenWork UI control/MCP bridge, and draft #1645 for built-in Realtime control.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant