WebBrain's agent acts inside the user's authenticated browser session: it can click, type, navigate, run JS, and submit forms as the logged-in user. So any text it reads from a web page is attacker-controllable — a malicious tweet, a shared doc, an email, an issue comment, a PDF. The whole point of the defenses below is: page content is DATA, never instructions, and consequential actions need a human in the loop.
If you add a tool, a new way to read the page, or a new place that feeds page-derived bytes to the model, read this first. The unit tests can enforce membership of the registries, but not whether you classified a thing correctly — that's on you and the reviewer.
The code lives in both builds (src/firefox/... and src/chrome/...). Keep
them in sync — the test suite asserts the pure modules are byte-identical.
- Untrusted-content wrapping (Layer 1). Tool results that carry page-derived
bytes are wrapped in
<untrusted_page_content id="<nonce>">…</…>markers, with any literal marker in the content stripped (breakout defense).- Code:
agent.js→_wrapUntrusted(name, content); the setUNTRUSTED_CONTENT_TOOLSinpermission-gate.js.
- Code:
- System-prompt contract (Layer 2). The prompts tell the model that
anything in those markers is data, never instructions, and that only the
system prompt and the user's own chat/
clarifymessages are authoritative.- Code:
tools.js->SYSTEM_PROMPT_ASK(5-bullet block),SYSTEM_PROMPT_ACT(7-bullet block),SYSTEM_PROMPT_ACT_COMPACT(condensed opt-in compact prompt in both browser builds).
- Code:
- Capability × origin permission gate (Layer 3). Before a consequential
tool runs, the agent checks a
(capability, host)grant and prompts the user (Allow once / Always / Deny) if there isn't one. No text inspection, no LLM — the human is the trust anchor.- Code:
permission-gate.js(capabilityFor,requiredHosts,PermissionManager); the gate loop inagent.js _executeToolBatch. - User control: Settings → Permissions (review/revoke grants + the master switch "Ask before consequential actions").
- Code:
- Output sanitizer (Layer 4). Model output is HTML-escaped and only
[label](url)markdown becomes an allowlisted (http/https/mailto) link — no auto-loading images, no bare-URL linkification.- Code:
ui/markdown-link.js.
- Code:
Treat all of the following as attacker-controllable:
- DOM text and HTML — including hidden / off-screen text, ARIA labels,
alt,titleattributes, HTML comments, and text styled invisible. - OCR / vision-model transcriptions of a screenshot (
desc.text). - Fetched / downloaded documents — PDF extracted text, downloaded file
contents,
fetch_url/research_urlbodies. - URLs and hosts the page controls —
href/src, an iframe's URL, a redirect target. (These drive permission decisions, see Layer 3.) - Tool results that embed page-derived verification/probe fields — e.g. the
doneresult includespageTitle/pageState(dialog titles, live-region text). Non-obvious, easy to miss —donewas mis-classified once for exactly this reason.
Model-authored text (a tool's own status string, the agent's summary) and the
user's messages (including clarify answers) are trusted.
Add its name to UNTRUSTED_CONTENT_TOOLS in permission-gate.js (both builds).
The exhaustiveness test will fail until every act-mode tool is classified.
Map it in permission-gate.js:
- add it to
TOOL_CAPABILITY(or handle it incapabilityForif the capability depends on args — seeset_field/press_keys/fetch_url); - make sure
hostForCapability/requiredHostsresolves the real target host (destination URL for navigate/network/download; current page for click/type; the frame host for iframe tools; every host for a multi-URL tool likedownload_files); - if the host can't be determined, return
''/[]so the gate fails closed (see the iframe-without-urlFiltercase).
Some page-derived text reaches the model outside the normal tool-result path
— it's interpolated into a role:'user' or role:'tool' message the agent
builds itself. Those must be wrapped explicitly:
const wrapped = this._wrapUntrusted('screenshot', desc.text); // nonce + strip
messages.push({ role: 'user', content: `[…]\n${wrapped}` });
⚠️ A prose "this is untrusted" label is NOT the boundary. The boundary is the nonce-delimited<untrusted_page_content>markers that_wrapUntrustedproduces (and the breakout-stripping it does). Always route page-derived text through_wrapUntrusted, not just a[warning]prefix.
Known non-tool ingestion points (keep this list current):
- auto-screenshot re-injection (vision description + interactive-elements list);
- the "Initial viewport description" in
_enrichUserMessageWithCurrentPage; - PDF passthrough: the raw PDF
documentblock can't be text-wrapped, so its accompanying note carries explicit untrusted framing and the attacker- controlleddocTitleis sanitized before interpolation; - the
donetool-result push (special-cased before the normal wrap).
The master switch (Settings → Permissions) disables Layer 3 only (the prompts). Layers 1, 2, and 4 stay on always — they cost nothing and are what protect the user on the trusted sites where injected content actually lives (a reputable domain is anti-correlated with safe content). Never gate Layers 1/2/4 behind a setting.
node test/run.js— pure-logic unit tests, including:- the exhaustiveness guard: every
getToolsForMode('act')tool must be gated (capabilityFor), untrusted-read (UNTRUSTED_CONTENT_TOOLS), or on theKNOWN_SAFE_TOOLSallowlist (defined intest/run.js) — else CI fails. - capability mapping, host resolution,
requiredHosts,frameHostMatches, grant storage /hydrateFrom, content-wrap breakout-stripping.
- the exhaustiveness guard: every
test/manual-permissions.md— the in-browser checklist (the 3-option permission card and the Settings → Permissions tab) that the unit suite can't cover.
The guard checks that tools are listed, not that they're listed
correctly. If a tool's result carries page-derived bytes, it belongs in
UNTRUSTED_CONTENT_TOOLS even if it's "just a status tool" (see done). When in
doubt, wrap it — wrapping a trusted field is harmless; leaving a page-derived
field unwrapped is a hole.
These are conscious trade-offs, not oversights.
-
Generic interaction is charged to the top-level page host, not the frame it lands in.
click({x,y})(CDP coordinate clicks),type_text, andpress_keysgo to whatever pixel/element is targeted or focused — which can be inside a cross-origin iframe (e.g. an embedded Stripe/PayPal frame). The gate charges these to the page host, so a grant formerchant.comalso covers a coordinate click that lands in an embeddedstripe.comframe.- Why accepted: (1) selector/text clicks can't reach cross-origin frames
(same-origin policy blocks
querySelectorfrom piercing them), so this is limited to coordinate clicks (Chrome/CDP only — Firefox clicks the<iframe>element, not into it) and focus-based typing; (2) for legitimate embedded flows the user grants the merchant page expecting checkout — including its payment iframe — to work, so prompting for the provider's host mid-flow is arguably worse UX than the residual risk. The explicitiframe_click/iframe_typetools DO gate on the frame host (frameHostMatches), because there the model deliberately names a frame. - If you want to close it: resolve the target frame for coordinate clicks (CDP hit-test) and the focused-frame for keystrokes, then gate on that frame host or fail closed when it's cross-origin. Non-trivial and Chrome/CDP-specific; needs real-browser testing.
- Why accepted: (1) selector/text clicks can't reach cross-origin frames
(same-origin policy blocks
-
solve_captchais ungated (on theKNOWN_SAFE_TOOLSallowlist). It spends CapSolver quota and injects a token (firing the widget'sdata-callback, which on some sites auto-submits). Accepted because the cost is bounded, the consequential submit is otherwise gated, and gating it adds a prompt to a precursor the user wants when blocked by a CAPTCHA. Revisit if quota abuse becomes a real concern. -
hoveris ungated — synthetic hover reveals menus/tooltips and commits nothing. -
An LLM is not used anywhere in the gate. Intent is never inferred from page or prompt text (that approach was tried and removed — it was English-only and leaky). The gate is deterministic capability×origin with the human as the trust anchor.