Version 8.8.0
WebBrain is a browser extension that gives an LLM control over the user's active browser tab. The user types a natural-language instruction in a side panel, and an autonomous agent loop calls the LLM, executes tool calls (click, type, navigate, screenshot, etc.), feeds results back to the LLM, and repeats until the task is done.
There are two builds that share almost all code:
- Chrome — Manifest V3, service worker, CDP-backed trusted events
- Firefox — Manifest V2, background page, synthetic events only
This doc covers the shared architecture and calls out where the builds diverge.
┌─────────────────────────────────────────────────────┐
│ Side Panel (UI) │
│ sidepanel.js · settings.js · traces.js │
│ locale: i18n.js / locales/*.js │
└──────────────┬──────────────────────────────────────┘
│ chrome.runtime.sendMessage({action, ...})
▼
┌─────────────────────────────────────────────────────┐
│ Background Script / Service Worker │
│ │
│ background.js — message router │
│ └─ agent.js — agent loop + executeTool() │
│ ├─ tools.js — tool schemas + system prompts│
│ ├─ adapters.js— per-site guidance │
│ ├─ credential-fields.js — secret detection │
│ ├─ captcha-solver.js — CapSolver integration │
│ ├─ loop-bucket.js — URL-family loop bucketing│
│ └─ pdf-tools.js — PDF text extraction │
│ ├─ providers/ — LLM provider abstraction │
│ ├─ network/ — fetch_url, downloads │
│ ├─ trace/ — optional IndexedDB recorder │
│ └─ recorder/ — tab recording orchestration │
│ │
│ Chrome only: │
│ ├─ cdp/ — Chrome DevTools Protocol │
│ └─ offscreen/ — fetch proxy + tab recorder │
└──────┬──────────────────────────────────────────────┘
│ chrome.scripting.executeScript / CDP
▼
┌─────────────────────────────────────────────────────┐
│ Content Scripts (injected) │
│ │
│ accessibility-tree.js — AX tree builder + ref_ids │
│ content.js — DOM reader, clicker, typer │
│ agent-visual-indicator.js — pulsing border + Stop │
└─────────────────────────────────────────────────────┘
The chat UI. Communicates with the background script via chrome.runtime.sendMessage (browser.runtime.sendMessage on Firefox). Supports two modes:
- Ask mode — read-only tools only (
ASK_ONLY_TOOLSintools.js). The agent can read, analyze, and summarize but never click, type, or navigate. - Act mode — full tool set. The agent can take real actions in the browser.
The user types a message, the panel sends {action: 'chat', text, mode, tabId} to the background, then listens for agent_update events streamed back during the run. The panel renders tool calls, results, and the final answer incrementally.
The central message router. On Chrome it's a service worker (MV3); on Firefox it's a persistent background page (MV2). Responsibilities:
- Route messages between the side panel, content scripts, and the agent
- Manage the agent lifecycle:
chat/chat_stream/continue/abort/clear_conversation - Manage provider config: load, save, test, switch active provider
- Manage side panel visibility: per-window "WebBrain" tab group controls where the panel is enabled
- Expose Claude OAuth, tab recording, CAPTCHA, and other sub-features as message handlers
Injected into every page (<all_urls>). Two files loaded sequentially:
accessibility-tree.js— exposeswindow.__generateAccessibilityTree()(DOM walker that produces the flat indented text tree),window.__wb_ax_lookup()(ref_id → Element resolver), andwindow.__wbElementMap(WeakRef-backed registry). Ships beforecontent.jsso the AX handlers are ready.content.js— DOM reader, interactive-element discovery, click/type/press_keys/scroll implementations, and iframe/frame support. Handlers for all content-script-dispatched tools.
User types "create a product 'namaz' priced 500 CNY, recurring every 2 months"
sidepanel.js → chrome.runtime.sendMessage({
action: 'chat',
text: 'create a product ...',
mode: 'act',
tabId: 42
})
background.js handleMessage('chat')
→ agent.processMessage(tabId, text, onUpdate, mode)
_enrichUserMessageWithCurrentPage(tabId, messages, userMessage)
1. Collect URL + title via chrome.tabs.get(tabId)
2. If /allow-api set for this tab → inject [USER OVERRIDE] preamble
3. If site adapters enabled → getActiveAdapter(url) → inject adapter notes
4. If provider supports vision (or dedicated vision model configured):
a. Capture viewport screenshot via CDP
b. (Optional) Sub-call dedicated vision model for text description
c. Attach image_url block or vision description to first user message
5. Return enriched user message
while (steps < maxSteps) {
// 4a. Call LLM
const result = await provider.chat(messages, {
tools: getToolsForMode(mode),
temperature: 0.3,
maxTokens: 4096,
})
// 4b. Parse response
if (result.toolCalls) {
// 4c. Execute tool batch
for (const tc of result.toolCalls) {
const toolResult = await executeTool(tabId, name, args)
// 4d. Loop detection
const loop = _checkLoop(tabId, name, args, toolResult)
if (loop.kind === 'stop') → return loop.message
// 4e. Auto-screenshot (if mode permits)
if (_shouldAutoScreenshot(name)) {
capture CDP screenshot → attach image_url block
}
messages.push({ role: 'tool', content: toolResult })
}
} else {
// 4f. Text-only response → final answer
return result.content
}
}
executeTool(tabId, name, args, onUpdate) dispatches by name:
| Tool group | Handler | Where it runs |
|---|---|---|
get_accessibility_tree, click_ax, type_ax, set_field, hover |
content script message | Injected page context |
click, type_text, press_keys, scroll, read_page, screenshot, etc. |
content script message | Injected page context |
navigate, new_tab |
chrome.tabs API |
Service worker |
fetch_url, research_url, list_downloads, etc. |
network-tools.js |
Service worker |
done |
agent.js — captures verification screenshot + page state probe | Service worker + CDP |
clarify |
agent.js — pauses for user input | Service worker |
solve_captcha |
captcha-solver.js | Service worker + CapSolver API |
record_tab, stop_recording |
recorder/host.js | Service worker + offscreen doc |
read_pdf |
pdf-tools.js | Service worker |
scratchpad_write |
agent.js — in-memory pinned note | Service worker |
The agent calls onUpdate(type, data) for each event:
tool_call— tool name + argstool_result— tool name + result JSONtext/text_delta— assistant response tokenswarning— loop detection, navigation warningsclarify— pending user questionerror— run errors
Background relays these via chrome.runtime.sendMessage to the side panel, which renders them incrementally.
The scheduler lets the agent defer work to a future browser session using the browser's alarms API. It lives in src/chrome/src/agent/scheduler.js (and the Firefox mirror) and is instantiated as ScheduledJobManager in the background script.
Job kinds
| Kind | Created by | Behavior |
|---|---|---|
resume |
schedule_resume tool |
Continues the current conversation in the same tab at a future time. Terminal tool — the current run ends when it fires. |
task |
schedule_task tool |
Runs a standalone user-authored prompt at a future time, optionally recurring. |
Job lifecycle
pending → running → completed
↘ queued ↗ ↘ needs_user_input
↓
failed / cancelled / paused
pending— alarm is set; waiting to fire.queued— alarm fired but the tab was busy; retries every 30 s (up to 120 deferrals before failing).running— agent is actively executing the job.needs_user_input— agent issued aclarifymid-run; waiting for the user's reply.paused— user or settings paused the job; no alarm is set.cancelled/failed/completed— terminal states.
Targets
current_tab— runs against the tab that was active when the job was created; fails if the tab is gone or has navigated away.url— opens (or reuses) a tab for a given http(s) URL at run time.
Schedule
once— fires at a singlerun_atorafter_secondstime.after_seconds: 0starts the task immediately.recurring— fires repeatedly atinterval_minutes(1 min – 1 year); after each run completes,nextRunAtis advanced and the next alarm is set.
Persistence
Jobs are stored in chrome.storage.local under the key wb_scheduled_jobs as a JSON array. On background restart, any jobs in running/needs_user_input are demoted to queued and retried, so no run is silently lost.
Settings
| Key | Default | Effect |
|---|---|---|
scheduledTasksEnabled |
true |
If false, pending jobs are paused instead of executed when their alarm fires. |
scheduledRequireConsequentialConfirmation |
true |
Passes a policy flag to the agent requiring explicit user confirmation before consequential scheduled actions. |
LLM tools
| Tool | When to use |
|---|---|
schedule_resume({after_seconds|run_at, reason, resume_instruction}) |
Durable pause for the current task when blocked on an external event (CI build, email, deploy). Terminal — the run ends after calling it. |
schedule_task({title, prompt, schedule, target, mode}) |
Create a standalone one-shot or recurring task. after_seconds: 0 starts now; nonzero future delays still require at least 60 seconds. Only when the user explicitly asks for scheduled work. |
58+ adapters inject site-specific guidance into the first user message (and re-inject on navigation to a different matched site). Only ONE adapter fires at a time (getActiveAdapter(url) returns the first match). See docs/site-adapters.md for how to write one.
The primary page-interaction path. Produces a flat, indented text tree of the page where each node has a stable ref_id. Tools: get_accessibility_tree, click_ax, type_ax, set_field. See docs/accessibility-tree-and-refs.md.
Wraps chrome.debugger API for:
- Trusted events —
Input.dispatchMouseEvent,Input.dispatchKeyEvent(event.isTrusted === true) - Screenshots —
Page.captureScreenshotwith clip/scale control - DOM queries —
Runtime.evaluatefor shadow DOM piercing,DOM.getDocumentfor closed roots
Without CDP (Firefox), all events are synthetic (el.click(), new KeyboardEvent()).
Abstracts LLM backends behind a common interface (BaseLLMProvider):
chat(messages, options) → { content, toolCalls, usage }
chatStream(messages, options) → async generator
supportsTools → boolean
supportsVision → boolean
useCompactPrompt → boolean
testConnection() → { ok, error, model }
See docs/providers-and-models.md.
Three independent detectors run after every tool call:
- General repeat — last 6 tool calls by (name + args hash + outcome). Nudge at 3 identical or ABAB. Stop at 8 nudges without 2 healthy calls between.
- Coordinate click — 5px-bucketed. Nudge at 5 same-bucket clicks. Stop at 8.
- Navigation — snapshot URL before click/navigate/iframe_click, compare after.
- Auto-compaction (
_manageContext) — runs both at the start of each user turn and at the top of every agent-loop iteration, so a long autonomous run compacts mid-flight ("when it's due"), not only between turns. Triggers on whichever fires first:- message count > 50, or raw chars > 80,000, or
- token budget — the running input-token count crossing
contextCompactRatio(0.75) of the active provider'scontextWindow(providers/base.js; category-aware default of 16k for local backends and 128k for cloud/router, overridable per provider viaconfig.contextWindow). The token count prefers the provider's reportedusage.prompt_tokens(which includes the system prompt + tool schemas) and falls back to a chars/4 estimate on the streaming path. - On compaction it keeps system prompt + original user task + LLM-summarized old messages + last 30 verbatim, then emits
onUpdate('context_compacted', …). The side panel renders an inline "Context automatically compacted" separator so the user knows history was summarized, not lost.
- Emergency trim on context overflow: keeps only last 6 messages (the hard fallback when a provider still rejects the request after auto-compaction)
- Image pruning: strips base64 images from all but the last 4 messages before each LLM call
- Tool result cap: individual results truncated at 8,000 chars
MV3 service workers can die between turns. Conversations are persisted to chrome.storage.session (debounced 300ms) and hydrated on first message to a tab. Per-tab isolated.
| Area | Chrome (MV3) | Firefox (MV2) |
|---|---|---|
| Background | Service worker (ephemeral) | Background page (persistent) |
| Events | CDP-trusted (isTrusted=true) |
Synthetic (isTrusted=false) |
| Screenshots | CDP Page.captureScreenshot |
browser.tabs.captureVisibleTab() |
| Full-page screenshot | CDP scroll+stitch | Not available |
| Conversation persistence | chrome.storage.session |
In-memory only |
| Offscreen document | Yes (fetch proxy + recorder) | Not available |
| Trace recorder | IndexedDB (opt-in) | IndexedDB (opt-in) — same trace/recorder.js |
| Duplicate-submit guard | Yes | Not available |
execute_js |
Blocked by CSP | Available |
| Shadow DOM piercing | CDP for closed roots | Open roots only |
| Localhost CORS | Offscreen proxy fallback | Server must set CORS headers |
| Tab recording | chrome.tabCapture + offscreen |
Not available |
| Side panel | sidePanel API (MV3) |
sidebar_action (MV2) |
| File upload | CDP-powered | Manual dispatch |
Everything else (agent loop, tools, adapters, providers, loop detection, context management, system prompts) is architecturally identical between the two builds.
src/
├── chrome/ # Chromium build (MV3)
│ ├── manifest.json
│ └── src/
│ ├── agent/ # agent.js, tools.js, adapters.js, scheduler.js, ...
│ ├── cdp/ # CDP client (Chrome only)
│ ├── content/ # accessibility-tree.js, content.js, ...
│ ├── network/ # network-tools.js
│ ├── offscreen/# Fetch proxy + tab recorder (Chrome only)
│ ├── providers/# BaseLLMProvider + implementations
│ ├── recorder/ # Tab recording orchestration
│ ├── trace/ # IndexedDB recorder
│ └── ui/ # sidepanel, settings, traces, i18n
├── firefox/ # Firefox build (MV2)
│ ├── manifest.json
│ └── src/ # Same structure, minus cdp/, offscreen/, recorder/
└── vendor/ # Third-party libs (pdfjs, katex)
Both builds share the same adapter set, provider implementations, accessibility tree, and most tool code. The src/shared/ pattern is intentionally avoided — files are duplicated between chrome/ and firefox/ so each build is self-contained and can be loaded directly without a build step for development.
See docs/security-model.md and src/chrome/ARCHITECTURE.md for details.
Key points:
- Extension runs with
<all_urls>+debuggerpermissions — full browser access - No additional auth: the agent IS the user's browser session
/allow-apiflag gates destructive HTTP methods viafetch_url- Tool results capped at 8 KB to limit prompt-injection surface
strictSecretModeprevents the model from quoting credentials in summaries- Trace data is local-only (IndexedDB), never transmitted
- Offscreen proxy only forwards provider SDK traffic
- Finance adapters inject extra confirmation guidance