feat: Co-browsing system — agent browses websites with live user view in canvas iframe

## Summary

Add a co-browsing system that lets the OpenClaw agent browse websites while the user watches in real-time inside the canvas iframe, with voice conversation staying active throughout. The agent can "see" pages via screenshots/DOM extraction, interact via click/type/scroll commands, and the user sees every action live-streamed.

## Problem

Two disconnected capabilities exist today:

| Capability | What Happens | Who Sees It |
|---|---|---|
| **Puppeteer** (server-side headless Chrome) | Agent browses, takes screenshots, clicks, scrapes | Agent only — user sees nothing |
| **Canvas iframe** `[CANVAS_URL:...]` | External website loads in user's iframe | User only — agent is blind to it |

There's no way to combine these. The user can't watch the agent browse, and the agent can't see or interact with websites shown to the user.

## Solution: CDP Screencast Co-Browsing

Use Chrome DevTools Protocol `Page.startScreencast` to stream live browser frames from a server-side Puppeteer instance to the user's canvas iframe via WebSocket. The agent controls the browser via Puppeteer and "sees" pages via screenshots sent to the LLM as vision input.

### Architecture

```
┌────────────────────────────────────────────────┐
│  User's Browser (Remote)                        │
│  ┌──────────────────────────────────────────┐  │
│  │  OpenVoiceUI  (voice stays active)        │  │
│  │  ┌────────────────────────────────────┐  │  │
│  │  │  Canvas iframe: browse-viewer.html  │  │  │
│  │  │  ┌──────────────────────────────┐  │  │  │
│  │  │  │ <canvas> — live JPEG frames   │  │  │  │
│  │  │  │ Agent cursor overlay ●        │  │  │  │
│  │  │  │ User click capture → relay    │  │  │  │
│  │  │  └──────────────────────────────┘  │  │  │
│  │  │  URL bar + Back/Forward/Refresh    │  │  │
│  │  └────────────────────────────────────┘  │  │
│  └──────────────────────────────────────────┘  │
└──────────────────────┬─────────────────────────┘
                       │ WebSocket (frames ↓, events ↑)
                       ▼
┌────────────────────────────────────────────────┐
│  VPS: Browse Service (routes/browse.py)         │
│  POST /api/browse/start     → Launch Puppeteer  │
│  WS   /api/browse/stream    → Screencast frames │
│  POST /api/browse/action    → Agent commands     │
│  GET  /api/browse/screenshot → Vision capture    │
│  GET  /api/browse/dom       → Text/links/buttons │
│  POST /api/browse/stop      → Cleanup            │
│           │                                      │
│           ▼                                      │
│  Headless Chrome (Puppeteer) — already installed │
└────────────────────────────────────────────────┘
```

### User Experience

1. User says: *"Go to amazon.com and find me a blue widget"*
2. Agent responds: `"Opening Amazon now." [BROWSE:https://amazon.com]`
3. Canvas shows a live view of Amazon loading (~10fps JPEG stream)
4. Agent "sees" a screenshot, clicks the search bar, types "blue widget", presses Enter
5. User watches it all happen in real-time
6. User can say *"click the third one"* — agent acts on it
7. Voice conversation stays active the entire time

---

## Components to Build

### 1. Browse Service API (`routes/browse.py`)

Flask blueprint managing Puppeteer sessions and exposing REST + WebSocket endpoints.

**Endpoints:**

| Method | Path | Purpose |
|---|---|---|
| `POST` | `/api/browse/start` | Launch Puppeteer, navigate to URL, start screencast |
| `WS` | `/api/browse/stream` | WebSocket streaming CDP screencast frames to canvas viewer |
| `POST` | `/api/browse/action` | Agent sends click/type/scroll/goto/back/wait commands |
| `GET` | `/api/browse/screenshot` | Full PNG screenshot for agent vision analysis |
| `GET` | `/api/browse/dom` | Simplified page text + links + buttons + inputs |
| `GET` | `/api/browse/status` | Session info (URL, title, idle time) |
| `POST` | `/api/browse/stop` | Close browser, release memory |

**Session rules:**
- One session per user (starting new one closes previous)
- Auto-timeout after 5 min idle
- Auto-close when voice session ends
- Memory guard: refuse new sessions if VPS memory > 85%

**`/api/browse/action` payloads:**
```json
{"action": "click", "selector": "#search-btn"}
{"action": "click", "x": 500, "y": 300}
{"action": "type", "selector": "input.search", "text": "blue widget", "clear": true}
{"action": "scroll", "direction": "down", "amount": 500}
{"action": "goto", "url": "https://example.com"}
{"action": "back"}
{"action": "wait", "selector": ".results", "timeout": 10000}
```

**`/api/browse/dom` response:**
```json
{
  "url": "https://amazon.com/s?k=blue+widget",
  "title": "Amazon.com: blue widget",
  "text": "...visible text...",
  "links": [{"text": "Blue Widget Pro", "href": "/dp/B0123", "index": 1}],
  "inputs": [{"type": "text", "id": "search", "placeholder": "Search"}],
  "buttons": [{"text": "Add to Cart", "selector": "#add-to-cart-button"}]
}
```

### 2. Browse Viewer Canvas Page (`default-pages/browse-viewer.html`)

Canvas page that connects to the browse stream WebSocket and renders live frames.

**Features:**
- `<canvas>` element draws JPEG frames from WebSocket (~10fps)
- URL bar showing current page address + title
- Back / Forward / Refresh / Stop navigation controls
- Agent cursor overlay showing where the agent clicked (colored dot/arrow)
- User click capture: clicks on canvas → map coordinates → relay via WebSocket
- Keyboard capture when canvas focused → relay keystrokes
- Status indicators: "Agent is browsing..." / "Loading..." / connection health
- Follows standard canvas page patterns (inline CSS, no CDN, auth bridge)

### 3. `[BROWSE:url]` Action Tag (app.js modification)

New tag parsed by `app.js` alongside existing `[CANVAS_URL:...]`:

```javascript
const browseMatch = text.match(/\[BROWSE:([^\]]+)\]/i);
if (browseMatch) {
  const url = browseMatch[1].trim();
  const resolved = resolveCanvasUrl(url);  // reuse existing IP blocking
  if (resolved) {
    // Start server-side browse session
    await fetch('/api/browse/start', {
      method: 'POST',
      headers: {'Content-Type': 'application/json'},
      body: JSON.stringify({url: resolved})
    });
    // Load browse viewer in canvas
    iframe.src = '/pages/browse-viewer.html';
    CanvasControl.show();
  }
}
```

Also add `[BROWSE_STOP]` to close sessions.

### 4. Agent Tool Integration

**Immediate (exec-based, no OpenClaw changes):** Agent calls the browse API via fetch in exec scripts:
```javascript
const r = await fetch('http://localhost:5000/api/browse/action', {
  method: 'POST',
  headers: {'Content-Type': 'application/json'},
  body: JSON.stringify({action: 'click', selector: '#search-btn'})
});
```

**Future:** Dedicated `browse` tool in OpenClaw tool config for cleaner UX.

**TOOLS.md addition:** Instructions telling the agent about `[BROWSE:url]`, the action API, screenshot/DOM endpoints, and rules (tell user what you're doing, take screenshots after navigation, one session at a time).

### 5. Vision Auto-Capture

After each `/api/browse/action`, the service:
1. Waits 500ms for page to settle
2. Takes a PNG screenshot
3. Returns it in the action response (base64)
4. Conversation system includes it as vision content in next agent context

Also supports on-demand via `GET /api/browse/screenshot`.

---

## Implementation Phases

### Phase 1: Core (MVP)
- `routes/browse.py` with start/stop/action/screenshot/stream
- `browse-viewer.html` canvas page with WebSocket frame display
- `[BROWSE:url]` tag parsing in app.js
- Browse instructions in TOOLS.md
- **Result:** Agent browses, user watches, agent interacts via API

### Phase 2: User Interaction
- Click/keyboard capture in browse-viewer
- Relay user events via WebSocket → Puppeteer
- Agent cursor overlay and click ripple effects
- URL bar + nav controls

### Phase 3: Vision Auto-Capture
- Auto-screenshot after each action
- Integration with conversation.py vision pipeline
- DOM extraction endpoint for text-only analysis
- Agent chooses vision vs text based on complexity

### Phase 4: Polish & Safety
- Session timeout + memory guards
- Concurrent session limits
- URL allowlist/blocklist per client
- Bandwidth optimization (adaptive quality, skip static frames)
- Error handling (Chrome crashes, network timeouts)

---

## Technical Notes

### Why CDP Screencast?

| Approach | User Sees | Agent Sees | Agent Acts | Complexity |
|---|---|---|---|---|
| **CDP Screencast** ✅ | Live frames | Screenshots | Full Puppeteer | Medium |
| VNC + noVNC | Live desktop | VNC captures | VNC input | High |
| Web proxy rewrite | Real website | DOM snapshots | Injected JS | Very high, brittle |
| Anthropic computer_use | Nothing | Screenshots | Mouse/keyboard | High latency |

### Resource Impact
- **Memory:** ~200-300MB per Chrome instance per active browse session
- **CPU:** Moderate during active browsing, near-zero idle
- **Bandwidth:** ~300-500KB/s at 10fps JPEG quality 60 (1280x720)
- **Mitigation:** 1 session per user, 5 min idle timeout, memory guard

### Prerequisites
- Puppeteer + Chrome already installed in openvoiceui container (used for Remotion)
- Flask-SocketIO already configured (used for voice streaming)
- Canvas page system fully operational
- Vision pipeline exists (camera/face recognition)

### Security
- Block private/internal IPs (reuse `resolveCanvasUrl()`)
- Block `file://`, `chrome://`, `data:` URL schemes
- Isolate Chrome profile per session (no cookie leaks)
- Rate-limit actions (max 10/second)
- Disable Chrome extensions + disk downloads

### File Locations
```
routes/browse.py              ← Browse service API
src/app.js                    ← [BROWSE:] tag parsing (modify)
default-pages/browse-viewer.html  ← Viewer canvas page
```

---

## Full Design Document

Complete architecture document with all API schemas, code examples, and implementation details:
**[`docs/jambot/co-browsing-system.md`](https://github.com/MCERQUA/MIKE-AI/blob/main/docs/jambot/co-browsing-system.md)** in the MIKE-AI repo.

## Labels
`enhancement`, `feature`

Method	Path	Purpose
`POST`	`/api/browse/start`	Launch Puppeteer, navigate to URL, start screencast
`WS`	`/api/browse/stream`	WebSocket streaming CDP screencast frames to canvas viewer
`POST`	`/api/browse/action`	Agent sends click/type/scroll/goto/back/wait commands
`GET`	`/api/browse/screenshot`	Full PNG screenshot for agent vision analysis
`GET`	`/api/browse/dom`	Simplified page text + links + buttons + inputs
`GET`	`/api/browse/status`	Session info (URL, title, idle time)
`POST`	`/api/browse/stop`	Close browser, release memory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Co-browsing system — agent browses websites with live user view in canvas iframe #154

Summary

Problem

Solution: CDP Screencast Co-Browsing

Architecture

User Experience

Components to Build

1. Browse Service API (`routes/browse.py`)

2. Browse Viewer Canvas Page (`default-pages/browse-viewer.html`)

3. `[BROWSE:url]` Action Tag (app.js modification)

4. Agent Tool Integration

5. Vision Auto-Capture

Implementation Phases

Phase 1: Core (MVP)

Phase 2: User Interaction

Phase 3: Vision Auto-Capture

Phase 4: Polish & Safety

Technical Notes

Why CDP Screencast?

Resource Impact

Prerequisites

Security

File Locations

Full Design Document

Labels

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Capability	What Happens	Who Sees It
Puppeteer (server-side headless Chrome)	Agent browses, takes screenshots, clicks, scrapes	Agent only — user sees nothing
Canvas iframe `[CANVAS_URL:...]`	External website loads in user's iframe	User only — agent is blind to it

Approach	User Sees	Agent Sees	Agent Acts	Complexity
CDP Screencast ✅	Live frames	Screenshots	Full Puppeteer	Medium
VNC + noVNC	Live desktop	VNC captures	VNC input	High
Web proxy rewrite	Real website	DOM snapshots	Injected JS	Very high, brittle
Anthropic computer_use	Nothing	Screenshots	Mouse/keyboard	High latency

feat: Co-browsing system — agent browses websites with live user view in canvas iframe #154

Description

Summary

Problem

Solution: CDP Screencast Co-Browsing

Architecture

User Experience

Components to Build

1. Browse Service API (routes/browse.py)

2. Browse Viewer Canvas Page (default-pages/browse-viewer.html)

3. [BROWSE:url] Action Tag (app.js modification)

4. Agent Tool Integration

5. Vision Auto-Capture

Implementation Phases

Phase 1: Core (MVP)

Phase 2: User Interaction

Phase 3: Vision Auto-Capture

Phase 4: Polish & Safety

Technical Notes

Why CDP Screencast?

Resource Impact

Prerequisites

Security

File Locations

Full Design Document

Labels

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. Browse Service API (`routes/browse.py`)

2. Browse Viewer Canvas Page (`default-pages/browse-viewer.html`)

3. `[BROWSE:url]` Action Tag (app.js modification)