Skip to content

feat: Co-browsing system — agent browses websites with live user view in canvas iframe #154

@MCERQUA

Description

@MCERQUA

Summary

Add a co-browsing system that lets the OpenClaw agent browse websites while the user watches in real-time inside the canvas iframe, with voice conversation staying active throughout. The agent can "see" pages via screenshots/DOM extraction, interact via click/type/scroll commands, and the user sees every action live-streamed.

Problem

Two disconnected capabilities exist today:

Capability What Happens Who Sees It
Puppeteer (server-side headless Chrome) Agent browses, takes screenshots, clicks, scrapes Agent only — user sees nothing
Canvas iframe [CANVAS_URL:...] External website loads in user's iframe User only — agent is blind to it

There's no way to combine these. The user can't watch the agent browse, and the agent can't see or interact with websites shown to the user.

Solution: CDP Screencast Co-Browsing

Use Chrome DevTools Protocol Page.startScreencast to stream live browser frames from a server-side Puppeteer instance to the user's canvas iframe via WebSocket. The agent controls the browser via Puppeteer and "sees" pages via screenshots sent to the LLM as vision input.

Architecture

┌────────────────────────────────────────────────┐
│  User's Browser (Remote)                        │
│  ┌──────────────────────────────────────────┐  │
│  │  OpenVoiceUI  (voice stays active)        │  │
│  │  ┌────────────────────────────────────┐  │  │
│  │  │  Canvas iframe: browse-viewer.html  │  │  │
│  │  │  ┌──────────────────────────────┐  │  │  │
│  │  │  │ <canvas> — live JPEG frames   │  │  │  │
│  │  │  │ Agent cursor overlay ●        │  │  │  │
│  │  │  │ User click capture → relay    │  │  │  │
│  │  │  └──────────────────────────────┘  │  │  │
│  │  │  URL bar + Back/Forward/Refresh    │  │  │
│  │  └────────────────────────────────────┘  │  │
│  └──────────────────────────────────────────┘  │
└──────────────────────┬─────────────────────────┘
                       │ WebSocket (frames ↓, events ↑)
                       ▼
┌────────────────────────────────────────────────┐
│  VPS: Browse Service (routes/browse.py)         │
│  POST /api/browse/start     → Launch Puppeteer  │
│  WS   /api/browse/stream    → Screencast frames │
│  POST /api/browse/action    → Agent commands     │
│  GET  /api/browse/screenshot → Vision capture    │
│  GET  /api/browse/dom       → Text/links/buttons │
│  POST /api/browse/stop      → Cleanup            │
│           │                                      │
│           ▼                                      │
│  Headless Chrome (Puppeteer) — already installed │
└────────────────────────────────────────────────┘

User Experience

  1. User says: "Go to amazon.com and find me a blue widget"
  2. Agent responds: "Opening Amazon now." [BROWSE:https://amazon.com]
  3. Canvas shows a live view of Amazon loading (~10fps JPEG stream)
  4. Agent "sees" a screenshot, clicks the search bar, types "blue widget", presses Enter
  5. User watches it all happen in real-time
  6. User can say "click the third one" — agent acts on it
  7. Voice conversation stays active the entire time

Components to Build

1. Browse Service API (routes/browse.py)

Flask blueprint managing Puppeteer sessions and exposing REST + WebSocket endpoints.

Endpoints:

Method Path Purpose
POST /api/browse/start Launch Puppeteer, navigate to URL, start screencast
WS /api/browse/stream WebSocket streaming CDP screencast frames to canvas viewer
POST /api/browse/action Agent sends click/type/scroll/goto/back/wait commands
GET /api/browse/screenshot Full PNG screenshot for agent vision analysis
GET /api/browse/dom Simplified page text + links + buttons + inputs
GET /api/browse/status Session info (URL, title, idle time)
POST /api/browse/stop Close browser, release memory

Session rules:

  • One session per user (starting new one closes previous)
  • Auto-timeout after 5 min idle
  • Auto-close when voice session ends
  • Memory guard: refuse new sessions if VPS memory > 85%

/api/browse/action payloads:

{"action": "click", "selector": "#search-btn"}
{"action": "click", "x": 500, "y": 300}
{"action": "type", "selector": "input.search", "text": "blue widget", "clear": true}
{"action": "scroll", "direction": "down", "amount": 500}
{"action": "goto", "url": "https://example.com"}
{"action": "back"}
{"action": "wait", "selector": ".results", "timeout": 10000}

/api/browse/dom response:

{
  "url": "https://amazon.com/s?k=blue+widget",
  "title": "Amazon.com: blue widget",
  "text": "...visible text...",
  "links": [{"text": "Blue Widget Pro", "href": "/dp/B0123", "index": 1}],
  "inputs": [{"type": "text", "id": "search", "placeholder": "Search"}],
  "buttons": [{"text": "Add to Cart", "selector": "#add-to-cart-button"}]
}

2. Browse Viewer Canvas Page (default-pages/browse-viewer.html)

Canvas page that connects to the browse stream WebSocket and renders live frames.

Features:

  • <canvas> element draws JPEG frames from WebSocket (~10fps)
  • URL bar showing current page address + title
  • Back / Forward / Refresh / Stop navigation controls
  • Agent cursor overlay showing where the agent clicked (colored dot/arrow)
  • User click capture: clicks on canvas → map coordinates → relay via WebSocket
  • Keyboard capture when canvas focused → relay keystrokes
  • Status indicators: "Agent is browsing..." / "Loading..." / connection health
  • Follows standard canvas page patterns (inline CSS, no CDN, auth bridge)

3. [BROWSE:url] Action Tag (app.js modification)

New tag parsed by app.js alongside existing [CANVAS_URL:...]:

const browseMatch = text.match(/\[BROWSE:([^\]]+)\]/i);
if (browseMatch) {
  const url = browseMatch[1].trim();
  const resolved = resolveCanvasUrl(url);  // reuse existing IP blocking
  if (resolved) {
    // Start server-side browse session
    await fetch('/api/browse/start', {
      method: 'POST',
      headers: {'Content-Type': 'application/json'},
      body: JSON.stringify({url: resolved})
    });
    // Load browse viewer in canvas
    iframe.src = '/pages/browse-viewer.html';
    CanvasControl.show();
  }
}

Also add [BROWSE_STOP] to close sessions.

4. Agent Tool Integration

Immediate (exec-based, no OpenClaw changes): Agent calls the browse API via fetch in exec scripts:

const r = await fetch('http://localhost:5000/api/browse/action', {
  method: 'POST',
  headers: {'Content-Type': 'application/json'},
  body: JSON.stringify({action: 'click', selector: '#search-btn'})
});

Future: Dedicated browse tool in OpenClaw tool config for cleaner UX.

TOOLS.md addition: Instructions telling the agent about [BROWSE:url], the action API, screenshot/DOM endpoints, and rules (tell user what you're doing, take screenshots after navigation, one session at a time).

5. Vision Auto-Capture

After each /api/browse/action, the service:

  1. Waits 500ms for page to settle
  2. Takes a PNG screenshot
  3. Returns it in the action response (base64)
  4. Conversation system includes it as vision content in next agent context

Also supports on-demand via GET /api/browse/screenshot.


Implementation Phases

Phase 1: Core (MVP)

  • routes/browse.py with start/stop/action/screenshot/stream
  • browse-viewer.html canvas page with WebSocket frame display
  • [BROWSE:url] tag parsing in app.js
  • Browse instructions in TOOLS.md
  • Result: Agent browses, user watches, agent interacts via API

Phase 2: User Interaction

  • Click/keyboard capture in browse-viewer
  • Relay user events via WebSocket → Puppeteer
  • Agent cursor overlay and click ripple effects
  • URL bar + nav controls

Phase 3: Vision Auto-Capture

  • Auto-screenshot after each action
  • Integration with conversation.py vision pipeline
  • DOM extraction endpoint for text-only analysis
  • Agent chooses vision vs text based on complexity

Phase 4: Polish & Safety

  • Session timeout + memory guards
  • Concurrent session limits
  • URL allowlist/blocklist per client
  • Bandwidth optimization (adaptive quality, skip static frames)
  • Error handling (Chrome crashes, network timeouts)

Technical Notes

Why CDP Screencast?

Approach User Sees Agent Sees Agent Acts Complexity
CDP Screencast Live frames Screenshots Full Puppeteer Medium
VNC + noVNC Live desktop VNC captures VNC input High
Web proxy rewrite Real website DOM snapshots Injected JS Very high, brittle
Anthropic computer_use Nothing Screenshots Mouse/keyboard High latency

Resource Impact

  • Memory: ~200-300MB per Chrome instance per active browse session
  • CPU: Moderate during active browsing, near-zero idle
  • Bandwidth: ~300-500KB/s at 10fps JPEG quality 60 (1280x720)
  • Mitigation: 1 session per user, 5 min idle timeout, memory guard

Prerequisites

  • Puppeteer + Chrome already installed in openvoiceui container (used for Remotion)
  • Flask-SocketIO already configured (used for voice streaming)
  • Canvas page system fully operational
  • Vision pipeline exists (camera/face recognition)

Security

  • Block private/internal IPs (reuse resolveCanvasUrl())
  • Block file://, chrome://, data: URL schemes
  • Isolate Chrome profile per session (no cookie leaks)
  • Rate-limit actions (max 10/second)
  • Disable Chrome extensions + disk downloads

File Locations

routes/browse.py              ← Browse service API
src/app.js                    ← [BROWSE:] tag parsing (modify)
default-pages/browse-viewer.html  ← Viewer canvas page

Full Design Document

Complete architecture document with all API schemas, code examples, and implementation details:
docs/jambot/co-browsing-system.md in the MIKE-AI repo.

Labels

enhancement, feature

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions