Summary
Add a co-browsing system that lets the OpenClaw agent browse websites while the user watches in real-time inside the canvas iframe, with voice conversation staying active throughout. The agent can "see" pages via screenshots/DOM extraction, interact via click/type/scroll commands, and the user sees every action live-streamed.
Problem
Two disconnected capabilities exist today:
| Capability |
What Happens |
Who Sees It |
| Puppeteer (server-side headless Chrome) |
Agent browses, takes screenshots, clicks, scrapes |
Agent only — user sees nothing |
Canvas iframe [CANVAS_URL:...] |
External website loads in user's iframe |
User only — agent is blind to it |
There's no way to combine these. The user can't watch the agent browse, and the agent can't see or interact with websites shown to the user.
Solution: CDP Screencast Co-Browsing
Use Chrome DevTools Protocol Page.startScreencast to stream live browser frames from a server-side Puppeteer instance to the user's canvas iframe via WebSocket. The agent controls the browser via Puppeteer and "sees" pages via screenshots sent to the LLM as vision input.
Architecture
┌────────────────────────────────────────────────┐
│ User's Browser (Remote) │
│ ┌──────────────────────────────────────────┐ │
│ │ OpenVoiceUI (voice stays active) │ │
│ │ ┌────────────────────────────────────┐ │ │
│ │ │ Canvas iframe: browse-viewer.html │ │ │
│ │ │ ┌──────────────────────────────┐ │ │ │
│ │ │ │ <canvas> — live JPEG frames │ │ │ │
│ │ │ │ Agent cursor overlay ● │ │ │ │
│ │ │ │ User click capture → relay │ │ │ │
│ │ │ └──────────────────────────────┘ │ │ │
│ │ │ URL bar + Back/Forward/Refresh │ │ │
│ │ └────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────┘ │
└──────────────────────┬─────────────────────────┘
│ WebSocket (frames ↓, events ↑)
▼
┌────────────────────────────────────────────────┐
│ VPS: Browse Service (routes/browse.py) │
│ POST /api/browse/start → Launch Puppeteer │
│ WS /api/browse/stream → Screencast frames │
│ POST /api/browse/action → Agent commands │
│ GET /api/browse/screenshot → Vision capture │
│ GET /api/browse/dom → Text/links/buttons │
│ POST /api/browse/stop → Cleanup │
│ │ │
│ ▼ │
│ Headless Chrome (Puppeteer) — already installed │
└────────────────────────────────────────────────┘
User Experience
- User says: "Go to amazon.com and find me a blue widget"
- Agent responds:
"Opening Amazon now." [BROWSE:https://amazon.com]
- Canvas shows a live view of Amazon loading (~10fps JPEG stream)
- Agent "sees" a screenshot, clicks the search bar, types "blue widget", presses Enter
- User watches it all happen in real-time
- User can say "click the third one" — agent acts on it
- Voice conversation stays active the entire time
Components to Build
1. Browse Service API (routes/browse.py)
Flask blueprint managing Puppeteer sessions and exposing REST + WebSocket endpoints.
Endpoints:
| Method |
Path |
Purpose |
POST |
/api/browse/start |
Launch Puppeteer, navigate to URL, start screencast |
WS |
/api/browse/stream |
WebSocket streaming CDP screencast frames to canvas viewer |
POST |
/api/browse/action |
Agent sends click/type/scroll/goto/back/wait commands |
GET |
/api/browse/screenshot |
Full PNG screenshot for agent vision analysis |
GET |
/api/browse/dom |
Simplified page text + links + buttons + inputs |
GET |
/api/browse/status |
Session info (URL, title, idle time) |
POST |
/api/browse/stop |
Close browser, release memory |
Session rules:
- One session per user (starting new one closes previous)
- Auto-timeout after 5 min idle
- Auto-close when voice session ends
- Memory guard: refuse new sessions if VPS memory > 85%
/api/browse/action payloads:
{"action": "click", "selector": "#search-btn"}
{"action": "click", "x": 500, "y": 300}
{"action": "type", "selector": "input.search", "text": "blue widget", "clear": true}
{"action": "scroll", "direction": "down", "amount": 500}
{"action": "goto", "url": "https://example.com"}
{"action": "back"}
{"action": "wait", "selector": ".results", "timeout": 10000}
/api/browse/dom response:
{
"url": "https://amazon.com/s?k=blue+widget",
"title": "Amazon.com: blue widget",
"text": "...visible text...",
"links": [{"text": "Blue Widget Pro", "href": "/dp/B0123", "index": 1}],
"inputs": [{"type": "text", "id": "search", "placeholder": "Search"}],
"buttons": [{"text": "Add to Cart", "selector": "#add-to-cart-button"}]
}
2. Browse Viewer Canvas Page (default-pages/browse-viewer.html)
Canvas page that connects to the browse stream WebSocket and renders live frames.
Features:
<canvas> element draws JPEG frames from WebSocket (~10fps)
- URL bar showing current page address + title
- Back / Forward / Refresh / Stop navigation controls
- Agent cursor overlay showing where the agent clicked (colored dot/arrow)
- User click capture: clicks on canvas → map coordinates → relay via WebSocket
- Keyboard capture when canvas focused → relay keystrokes
- Status indicators: "Agent is browsing..." / "Loading..." / connection health
- Follows standard canvas page patterns (inline CSS, no CDN, auth bridge)
3. [BROWSE:url] Action Tag (app.js modification)
New tag parsed by app.js alongside existing [CANVAS_URL:...]:
const browseMatch = text.match(/\[BROWSE:([^\]]+)\]/i);
if (browseMatch) {
const url = browseMatch[1].trim();
const resolved = resolveCanvasUrl(url); // reuse existing IP blocking
if (resolved) {
// Start server-side browse session
await fetch('/api/browse/start', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({url: resolved})
});
// Load browse viewer in canvas
iframe.src = '/pages/browse-viewer.html';
CanvasControl.show();
}
}
Also add [BROWSE_STOP] to close sessions.
4. Agent Tool Integration
Immediate (exec-based, no OpenClaw changes): Agent calls the browse API via fetch in exec scripts:
const r = await fetch('http://localhost:5000/api/browse/action', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({action: 'click', selector: '#search-btn'})
});
Future: Dedicated browse tool in OpenClaw tool config for cleaner UX.
TOOLS.md addition: Instructions telling the agent about [BROWSE:url], the action API, screenshot/DOM endpoints, and rules (tell user what you're doing, take screenshots after navigation, one session at a time).
5. Vision Auto-Capture
After each /api/browse/action, the service:
- Waits 500ms for page to settle
- Takes a PNG screenshot
- Returns it in the action response (base64)
- Conversation system includes it as vision content in next agent context
Also supports on-demand via GET /api/browse/screenshot.
Implementation Phases
Phase 1: Core (MVP)
routes/browse.py with start/stop/action/screenshot/stream
browse-viewer.html canvas page with WebSocket frame display
[BROWSE:url] tag parsing in app.js
- Browse instructions in TOOLS.md
- Result: Agent browses, user watches, agent interacts via API
Phase 2: User Interaction
- Click/keyboard capture in browse-viewer
- Relay user events via WebSocket → Puppeteer
- Agent cursor overlay and click ripple effects
- URL bar + nav controls
Phase 3: Vision Auto-Capture
- Auto-screenshot after each action
- Integration with conversation.py vision pipeline
- DOM extraction endpoint for text-only analysis
- Agent chooses vision vs text based on complexity
Phase 4: Polish & Safety
- Session timeout + memory guards
- Concurrent session limits
- URL allowlist/blocklist per client
- Bandwidth optimization (adaptive quality, skip static frames)
- Error handling (Chrome crashes, network timeouts)
Technical Notes
Why CDP Screencast?
| Approach |
User Sees |
Agent Sees |
Agent Acts |
Complexity |
| CDP Screencast ✅ |
Live frames |
Screenshots |
Full Puppeteer |
Medium |
| VNC + noVNC |
Live desktop |
VNC captures |
VNC input |
High |
| Web proxy rewrite |
Real website |
DOM snapshots |
Injected JS |
Very high, brittle |
| Anthropic computer_use |
Nothing |
Screenshots |
Mouse/keyboard |
High latency |
Resource Impact
- Memory: ~200-300MB per Chrome instance per active browse session
- CPU: Moderate during active browsing, near-zero idle
- Bandwidth: ~300-500KB/s at 10fps JPEG quality 60 (1280x720)
- Mitigation: 1 session per user, 5 min idle timeout, memory guard
Prerequisites
- Puppeteer + Chrome already installed in openvoiceui container (used for Remotion)
- Flask-SocketIO already configured (used for voice streaming)
- Canvas page system fully operational
- Vision pipeline exists (camera/face recognition)
Security
- Block private/internal IPs (reuse
resolveCanvasUrl())
- Block
file://, chrome://, data: URL schemes
- Isolate Chrome profile per session (no cookie leaks)
- Rate-limit actions (max 10/second)
- Disable Chrome extensions + disk downloads
File Locations
routes/browse.py ← Browse service API
src/app.js ← [BROWSE:] tag parsing (modify)
default-pages/browse-viewer.html ← Viewer canvas page
Full Design Document
Complete architecture document with all API schemas, code examples, and implementation details:
docs/jambot/co-browsing-system.md in the MIKE-AI repo.
Labels
enhancement, feature
Summary
Add a co-browsing system that lets the OpenClaw agent browse websites while the user watches in real-time inside the canvas iframe, with voice conversation staying active throughout. The agent can "see" pages via screenshots/DOM extraction, interact via click/type/scroll commands, and the user sees every action live-streamed.
Problem
Two disconnected capabilities exist today:
[CANVAS_URL:...]There's no way to combine these. The user can't watch the agent browse, and the agent can't see or interact with websites shown to the user.
Solution: CDP Screencast Co-Browsing
Use Chrome DevTools Protocol
Page.startScreencastto stream live browser frames from a server-side Puppeteer instance to the user's canvas iframe via WebSocket. The agent controls the browser via Puppeteer and "sees" pages via screenshots sent to the LLM as vision input.Architecture
User Experience
"Opening Amazon now." [BROWSE:https://amazon.com]Components to Build
1. Browse Service API (
routes/browse.py)Flask blueprint managing Puppeteer sessions and exposing REST + WebSocket endpoints.
Endpoints:
POST/api/browse/startWS/api/browse/streamPOST/api/browse/actionGET/api/browse/screenshotGET/api/browse/domGET/api/browse/statusPOST/api/browse/stopSession rules:
/api/browse/actionpayloads:{"action": "click", "selector": "#search-btn"} {"action": "click", "x": 500, "y": 300} {"action": "type", "selector": "input.search", "text": "blue widget", "clear": true} {"action": "scroll", "direction": "down", "amount": 500} {"action": "goto", "url": "https://example.com"} {"action": "back"} {"action": "wait", "selector": ".results", "timeout": 10000}/api/browse/domresponse:{ "url": "https://amazon.com/s?k=blue+widget", "title": "Amazon.com: blue widget", "text": "...visible text...", "links": [{"text": "Blue Widget Pro", "href": "/dp/B0123", "index": 1}], "inputs": [{"type": "text", "id": "search", "placeholder": "Search"}], "buttons": [{"text": "Add to Cart", "selector": "#add-to-cart-button"}] }2. Browse Viewer Canvas Page (
default-pages/browse-viewer.html)Canvas page that connects to the browse stream WebSocket and renders live frames.
Features:
<canvas>element draws JPEG frames from WebSocket (~10fps)3.
[BROWSE:url]Action Tag (app.js modification)New tag parsed by
app.jsalongside existing[CANVAS_URL:...]:Also add
[BROWSE_STOP]to close sessions.4. Agent Tool Integration
Immediate (exec-based, no OpenClaw changes): Agent calls the browse API via fetch in exec scripts:
Future: Dedicated
browsetool in OpenClaw tool config for cleaner UX.TOOLS.md addition: Instructions telling the agent about
[BROWSE:url], the action API, screenshot/DOM endpoints, and rules (tell user what you're doing, take screenshots after navigation, one session at a time).5. Vision Auto-Capture
After each
/api/browse/action, the service:Also supports on-demand via
GET /api/browse/screenshot.Implementation Phases
Phase 1: Core (MVP)
routes/browse.pywith start/stop/action/screenshot/streambrowse-viewer.htmlcanvas page with WebSocket frame display[BROWSE:url]tag parsing in app.jsPhase 2: User Interaction
Phase 3: Vision Auto-Capture
Phase 4: Polish & Safety
Technical Notes
Why CDP Screencast?
Resource Impact
Prerequisites
Security
resolveCanvasUrl())file://,chrome://,data:URL schemesFile Locations
Full Design Document
Complete architecture document with all API schemas, code examples, and implementation details:
docs/jambot/co-browsing-system.mdin the MIKE-AI repo.Labels
enhancement,feature