Skip to content

WebGPU + ONNX provider: Qwen 3 0.6B in-browser (7.4.0)#66

Open
esokullu wants to merge 33 commits into
mainfrom
webgpu
Open

WebGPU + ONNX provider: Qwen 3 0.6B in-browser (7.4.0)#66
esokullu wants to merge 33 commits into
mainfrom
webgpu

Conversation

@esokullu

Copy link
Copy Markdown
Collaborator

Summary

Adds a new `webgpu` provider type — a fourth "local" provider that runs Qwen 3 0.6B entirely in the browser via WebGPU + ONNX (`@huggingface/transformers`). Unlike llama.cpp / Ollama / LM Studio, it needs nothing installed: model weights download from HuggingFace on first use (~500MB q4), cached in IndexedDB, inference runs in the extension's offscreen document on the user's GPU.

Version bump 7.3.1 → 7.4.0.

Why

The other "local" providers all require the user to install + run a separate server. That's a real onboarding cliff. WebGPU + ONNX gives us a zero-install local option — useful for trying out webbrain without committing to a heavier setup, and as a privacy-preserving fallback when a user just wants to ask quick questions about a page without their data hitting any third party.

Architecture

```
service worker (background)
└─ WebGPUProvider.chat() ──→ chrome.runtime.sendMessage

offscreen document ←──────────────────────┘
└─ @huggingface/transformers
pipeline('text-generation', 'onnx-community/Qwen3-0.6B-ONNX',
{ device: 'webgpu', dtype: 'q4' })
```

  • Service workers have no WebGPU; the offscreen document does. We reuse the existing offscreen doc (already hosting the local-network fetch proxy) and add two new message handlers: `webgpu-chat` and `webgpu-probe`.
  • Pipeline is loaded lazily on the first chat call and cached for the offscreen doc's lifetime. Previous pipeline is disposed before loading a new model to avoid OOM on integrated GPUs.

Tool use

Enabled. Qwen 3's chat template knows how to render `tools=[...]` into the system prompt; the model emits `<tool_call>{...}</tool_call>` blocks. Offscreen.js parses them back into OpenAI-format `tool_calls` so webbrain's loop detector + dispatch treat WebGPU exactly like any other provider.

Reliability at 0.6B is mixed — this is small-model territory. A follow-up UI hint should nudge users toward Ask mode (similar to how we handle the existing small-model warnings).

Streaming

Deferred. v1 returns the full response in one shot. The 0.6B model finishes a normal turn in seconds, so this is acceptable; the background↔offscreen chunked-message router is the kind of plumbing that wants its own PR. A comment in `chatStream()` marks the upgrade target.

Library vendoring

`@huggingface/transformers` is ~5MB JS + ~30MB ONNX-runtime-web WASM. Too big to commit. `src/chrome/vendor/transformers/README.md` documents the one-command vendoring flow:

```bash
npm install @huggingface/transformers
cp node_modules/@huggingface/transformers/dist/transformers.min.js
src/chrome/vendor/transformers/

+ matching ort-wasm-simd-threaded.* files

```

The provider fails fast with a clear "library not vendored" message if the file is missing, so the failure mode is obvious to anyone testing.

`.gitignore` excludes `.js`/`.wasm`/`*.mjs` inside the vendor dirs so an accidental `git add .` doesn't commit 30MB of WASM. The README stays tracked.

Reviewer notes

  • Default `enabled: false` because the first-run download is ~500MB. We don't want to auto-burn bandwidth on extension install.
  • Firefox: stub. Firefox has no `browser.offscreen` and its extension-context WebGPU exposure is its own can of worms (gated, prefs-only on release at the time of writing). Stub fails fast; config stays so the categorization parity test passes. Real Firefox implementation is its own future PR.
  • CSP unchanged. Existing `script-src 'self' 'wasm-unsafe-eval'; connect-src *` already allows everything the library needs: same-origin imports, WASM eval, and HuggingFace Hub fetches.
  • Manifest unchanged. Vendor dir is same-origin with the offscreen doc, no `web_accessible_resources` needed.
  • Tool-call parsing is text-format, not structured JSON-output. Qwen-3 small models can produce slightly malformed blocks; `extractToolCalls()` JSON.parse'es each block in a try/catch and silently drops the malformed ones rather than crashing the turn.

Test plan

  • Tests pass (`node test/run.js` → 130/130, 4 new for the webgpu provider).
  • Open Chrome → Settings → Providers. Filter to "Local" → webgpu_qwen3 card visible alongside llama.cpp / Ollama / LM Studio.
  • Vendor `@huggingface/transformers` per the README (one-time setup).
  • Click webgpu card → "Test Connection" → reports library version + WebGPU availability without downloading model weights.
  • Set webgpu_qwen3 active → ask a simple Ask-mode question on any page. First run downloads model (~500MB, ~minute on a fast connection); subsequent runs are instant.
  • Switch to Act mode → try a simple click task. Verify tool-call parsing produces a sensible `click_ax` or similar.
  • Switch active provider away from webgpu → memory freed (pipeline disposed on next webgpu chat for a different model, or after page close).
  • Firefox: install Firefox build → webgpu_qwen3 card visible, "Test Connection" reports "not yet supported on Firefox".

🤖 Generated with Claude Code

esokullu and others added 2 commits May 22, 2026 15:52
Adds a fourth "local" provider alongside llama.cpp / Ollama / LM Studio.
Unlike those, this one needs nothing installed — model weights download
from HuggingFace on first use (~500MB for q4 Qwen 3 0.6B), cached in
IndexedDB by @huggingface/transformers, inference runs on the user's
GPU in the extension's offscreen document.

Architecture:

  service worker
     └─ WebGPUProvider.chat() ──┐
                                 ▼ chrome.runtime.sendMessage
  offscreen document
     └─ @huggingface/transformers
        pipeline('text-generation', 'onnx-community/Qwen3-0.6B-ONNX',
                 { device: 'webgpu', dtype: 'q4' })

Service workers have no WebGPU; the offscreen document does. We reuse
the existing offscreen doc (already hosting the local-network fetch
proxy) and add new message handlers `webgpu-chat` and `webgpu-probe`.

Tool use: enabled. Qwen 3's chat template renders `tools=[...]` into
the system prompt and the model emits `<tool_call>{...}</tool_call>`
blocks; offscreen.js parses them into OpenAI-format tool_calls so the
agent's loop detector / dispatch see WebGPU exactly like any other
provider. Reliability at 0.6B is mixed — the settings card will nudge
users toward Ask mode in a follow-up.

Streaming: v1 returns the full response (no per-token streaming yet).
The 0.6B model finishes a normal turn in seconds; the round-trip-and-
yield simplification let us ship the provider without first solving
the background↔offscreen chunked-message router. Comment in
webgpu.js's chatStream() flags the upgrade target.

Default-disabled (`enabled:false`) because the first-run download is
substantial and the library has to be vendored locally — see
src/chrome/vendor/transformers/README.md for the one-command vendoring
flow. The provider returns a clear "library not vendored" error when
the file is missing, so the failure mode is obvious.

Firefox: stub that fails fast with "not yet supported on Firefox".
Firefox doesn't have browser.offscreen and its extension-context WebGPU
exposure is its own can of worms — wiring those is its own future PR.
Stub stays so the categorization parity test stays green.

Tests: 4 new (130 total, all passing). webgpu provider present + local
+ disabled by default; no network fields (truly in-browser); _create-
Provider wires the right class; chrome/firefox provider sets stay in
sync. The actual chat() path can't be exercised in Node — no chrome.
offscreen, no WebGPU — but the wiring + classification do.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel

vercel Bot commented May 22, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
webbrain Ready Ready Preview, Comment May 26, 2026 3:20pm

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c7ed0192e5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/chrome/src/providers/webgpu.js Outdated
Comment on lines +107 to +109
type: result.toolCalls ? 'tool_call' : 'text',
content: result.content,
toolCalls: result.toolCalls,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Emit tool calls in streaming chunk content

When chatStream() returns a tool call, this chunk sets content to result.content (text) instead of the tool-call array. The streaming agent path (processMessageStream) iterates chunk.content as tool-call deltas, so WebGPU tool calls are dropped/misparsed and Act-mode tool execution fails whenever the chat_stream workflow is used.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment thread src/chrome/src/offscreen/offscreen.js Outdated
}

async function getPipeline(modelId, dtype, device) {
if (_activePipeline && _activeModelId === modelId) return _activePipeline;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Rebuild cached pipeline when dtype or device changes

The pipeline cache key only checks modelId, but users can edit dtype and device in provider settings. After one successful load, changing quantization/backend (for example q4q8 or webgpuwasm) will silently keep using the old pipeline, so configuration changes do not take effect until the offscreen document is recreated.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

esokullu and others added 2 commits May 22, 2026 16:27
The first time a user picks the WebGPU provider, ~500MB of Qwen 3
weights pull from HF Hub — a ~30-60s wait the existing UI doesn't
hint at at all. Renders as a frozen "thinking…" spinner, which is
indistinguishable from a hang.

Add a progress card at the top of the messages container:

  ┌──────────────────────────────────────────────────┐
  │ Downloading onnx-community/Qwen3-0.6B-ONNX —     │
  │ 142 / 487 MB                                     │
  │ █████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │
  │ onnx/model_q4.onnx                               │
  └──────────────────────────────────────────────────┘

  - Aggregates loaded/total across files (model has ~8 parallel
    downloads — weights, tokenizer, config, etc.).
  - Bar fills to 100% + flips green on the 'ready' event, then
    auto-dismisses ~1.8s later so the user sees confirmation.
  - Throttled to one progress update per file per 200ms so the
    message channel doesn't drown in callbacks.
  - Fire-and-forget broadcast from offscreen → sidepanel (.catch
    swallows "no listener" errors when no panel is open).

Firefox side has the same listener + renderer for parity, even
though the Firefox WebGPU provider itself is still stubbed — once
the Firefox path is wired up, the progress UI is ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Commits the three files the WebGPU provider needs:

  src/chrome/vendor/transformers/
    ├─ transformers.web.min.js          (422 KB — browser ESM bundle)
    ├─ ort-wasm-simd-threaded.jsep.mjs  ( 46 KB — WASM loader shim)
    └─ ort-wasm-simd-threaded.jsep.wasm ( 25 MB — WebGPU ONNX runtime)

Yes, the .wasm is 25MB. It's the cost of shipping a real local LLM
runtime — there's no smaller variant that does WebGPU. The trade-off:
the extension grows from ~2MB to ~28MB on disk, in exchange for a
provider that works straight from `git clone` with zero per-developer
setup. Previous behaviour was "vendor the library yourself per the
README" which is realistic for a 1-person team and friction for any
larger group.

Implementation details:

  - We vendor the .web.min.js variant (not the dual ESM/CJS
    transformers.min.js or the Node builds). Smaller, browser-only,
    matches our actual import path.

  - env.backends.onnx.wasm.wasmPaths is pinned to the vendor dir's
    chrome-extension:// URL. Without this the loader resolves the
    WASM path relative to transformers.web.min.js's URL — which
    happens to work today because they're siblings, but only by
    accident. Setting it explicitly makes the wiring obvious and
    survives future re-vendoring at different paths. Wrapped in
    try/catch so library shape changes between versions fall back
    to default resolution.

  - The CPU-fallback WASM variants (.wasm / .asyncify.wasm /
    .jspi.wasm) are intentionally NOT vendored — system without
    WebGPU gets a clear "WebGPU not available" error instead. Saves
    ~40MB of WASM we don't use. Add them later if CPU fallback
    becomes a real ask.

  - Firefox vendor dir stays empty (gitignored) — the Firefox WebGPU
    provider is still a stub; no point shipping 25MB of WASM it
    doesn't reach. Comment in .gitignore flags this for whoever
    wires the Firefox path next.

  - package.json now lists @huggingface/transformers as a regular
    dep (not devDep) — semantically wrong for an ESM file we commit,
    but useful: `npm install` keeps the version pinned for whoever
    needs to update the vendored files later. The README documents
    the update flow.

The README in the vendor dir reflects the new "it's checked in"
reality and explains the update procedure for next time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
specifier)

Real bug shipped with the previous "vendor the library" commit:
transformers.web.min.js contains a dynamic
`import("onnxruntime-web/webgpu")` — a bare module specifier. The
browser can't resolve bare specifiers without a build step or an
import map, and the WebGPU provider failed on first chat with:

    Failed to resolve module specifier "onnxruntime-web/webgpu".
    Relative references must start with either "/", "./", or "../".

Two-line fix:

  1. Vendor onnxruntime-web/dist/ort.webgpu.bundle.min.mjs (111KB,
     fully self-bundled — no further bare imports inside it).
  2. Rewrite the bare specifier in our vendored transformers.web.min.js
     to "./ort.webgpu.bundle.min.mjs" so it resolves as a relative
     URL against the patched file's own location. One sed replace,
     verified the count goes 1→0.

Why not an import map: MV3's CSP `script-src 'self'` can block inline
`<script type="importmap">` on some Chrome versions. Patching the
specifier sidesteps the CSP question entirely.

The webgpu bundle is self-contained (the bundled variant inlines all
ONNX-runtime dependencies it needs at WebGPU-init time), so no
external WASM fetch happens during normal WebGPU inference. The
existing jsep.wasm + jsep.mjs files stay vendored as a defensive
fallback path in case env.backends.onnx.wasm.wasmPaths ever gets
hit at runtime — they're never loaded for WebGPU, but cost nothing
since they're already there.

Vendor README updated with the sed step + verification command so
re-vendoring a future library version doesn't reintroduce the bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes folded into one:

1. CHROME WEB STORE / AMO compatibility. The previous commit vendored
   .min.js builds; both stores want readable source for review and can
   reject or stall reviews of minified blobs. Switch to the unminified
   variants:

     transformers.web.min.js     →  transformers.web.js    (~422K → 1.1M)
     ort.webgpu.bundle.min.mjs   →  ort.webgpu.mjs         (~111K → 662K)

   Total vendor dir grows from ~26MB to ~27MB. Negligible at runtime
   (the JS still parses in microseconds), worth a lot for review
   process. The 25MB WASM stays where it was — it's already
   not-text-readable by nature.

2. THE BARE-SPECIFIER FIX, BUT AGAINST THE RIGHT FILE. The previous
   commit sed-patched transformers.web.MIN.js — but offscreen.js
   actually loads transformers.web.js after this commit. The minified
   sibling never loaded, so the fix never ran. Reported as "still the
   same error" by the user.

   In the unminified .web.js the bare import is a STATIC import (not
   the dynamic form the minifier emits):

     import * as ONNX_WEB from "onnxruntime-web/webgpu";  // line 7547

   sed -i 's|"onnxruntime-web/webgpu"|"./ort.webgpu.mjs"|' \
     src/chrome/vendor/transformers/transformers.web.js

   One occurrence, replaced, verified count goes 1→0 with grep.

   Why not the "bundle" variant of onnxruntime-web/webgpu (the .bundle
   .min.mjs that inlines everything)? It's only available minified.
   The plain ort.webgpu.mjs is unminified and has no bare imports of
   its own (only Node-specific `node:fs` / `node:os` requires that
   never fire in browsers).

Vendor README updated end-to-end:
  - "What's here" table reflects the new file names + sizes
  - Adds a "Why unminified" callout pointing at store policy
  - Update procedure has the new cp + sed lines
  - "Files NOT vendored" explains why we skip the .bundle variants

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User reported same error message but a different specifier:
"Failed to resolve module specifier 'onnxruntime-common'".

Root cause: transformers.web.js has TWO bare specifiers, not one.
The first fix (`onnxruntime-web/webgpu` → relative path) resolved
one, but at line 7605 there's a second:

  import { Tensor } from "onnxruntime-common";

onnxruntime-common is a separate npm package providing Tensor +
session types. It's a transitive dep of @huggingface/transformers
(via onnxruntime-web).

Fix: wholesale-vendor its ESM tree, sed-patch the import.

  - Copy node_modules/onnxruntime-common/dist/esm/*.js (21 small
    files, ~85KB total) into vendor/transformers/onnxruntime-common/.
    The ESM tree is self-contained — all inter-file imports are
    already relative, no further patches needed.
  - sed: "onnxruntime-common" → "./onnxruntime-common/index.js" in
    transformers.web.js. One occurrence, replaced, verified.

Also added a defensive whole-tree bare-specifier sweep to the
vendor README's verification step — catches future versions that
introduce a THIRD bare import without needing a debug-runtime
round-trip.

The remaining "@huggingface/transformers" hit at line ~10667 is a
JSDoc example string inside a comment block, not a real import.
README documents this so future maintainers don't get spooked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ONNX Runtime Web dynamically picks a WASM variant at load time. For
Qwen 3 on WebGPU, ops that can't run on the GPU fall back to CPU,
which needs ort-wasm-simd-threaded.asyncify.{mjs,wasm} — without
this pair the runtime errors with "no available backend found,
Failed to fetch dynamically imported module .../asyncify.mjs".

Add both files (~23MB wasm + 47KB loader) and document why .jspi
and the plain variant are still skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
'q4' uses 4-bit weights with fp32 activations. The activation buffers
for Qwen 3 0.6B mid-inference overrun the WASM 2GB heap, producing
'std::bad_alloc' out of OrtRun on most laptops. 'q4f16' (4-bit
weights + fp16 activations) cuts the activation footprint in half and
is the dtype the transformers.js team recommends for Qwen on WebGPU.

Update the default in WebGPUProvider, the seed config in
providers/manager.js, and the placeholder text in the dtype settings
field — both chrome and firefox builds.

NOTE: existing users with a stored dtype:'q4' need to either remove
and re-add the WebGPU provider, or edit the dtype field in Settings.
The first run after switching will re-download ~500MB (q4f16 weights);
the old q4 weights stay in IndexedDB but go unused.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Some Chrome/GPU combos hit 'Integer overflow' from safeint.h during
OrtRun on Qwen 3 + q4f16. The mixed-precision quantization kernels
take an int32-shape code path that overflows for the model's
attention buffer math. fp16 uses single-precision kernels throughout
and sidesteps the issue at the cost of ~1.2GB download (vs ~500MB).

- Note the workaround in offscreen.js's pipeline-load comment.
- Add a Troubleshooting table to the vendor README covering the full
  error cascade we've walked through: bare-specifier, asyncify-mjs,
  bad_alloc, integer-overflow, no-backend.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If WebGPU silently falls back to a software adapter (SwiftShader on
Windows when discrete GPU is power-saved, Lavapipe on Linux without
a Vulkan driver, etc.), inference burns 500MB on a download then OOMs
the WASM heap with std::bad_alloc on first token. From the user's
side this looks like dtype/model bugs.

Make the offscreen probe call requestAdapter() and report
isFallbackAdapter. webgpu.js#testConnection turns that into a
specific error message naming chrome://flags. The pipeline loader
also logs adapter info + onnx backend keys to the offscreen DevTools
console so we can diagnose future "all dtypes OOM" reports without
another round-trip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
transformers.js's init code auto-sets wasmPaths to the .asyncify
variant for non-Safari browsers (line ~7786 of transformers.web.js).
The asyncify wasm has Asyncify stack-switching support but NO JSEP
(JavaScript Execution Provider) exports — and the WebGPU EP is
plumbed THROUGH JSEP.

ort.webgpu.mjs calls things like `wasm2.jsepOnCreateSession?.()`
with optional chaining; when those exports are undefined, WebGPU
initialization SILENTLY no-ops. The runtime then runs the entire
model on the WASM CPU backend, blowing the 2GB heap on any sub-1B
model. From the user's side this looks like 'std::bad_alloc on every
dtype' even though chrome://gpu shows WebGPU is hardware-accelerated.

Fix: set wasmPaths to the {mjs, wasm} object form pointing at the
.jsep files. The urlOverride path in ort.webgpu.mjs uses them
directly, bypassing the asyncify default. .jsep.wasm exports the
jsep* functions the WebGPU EP needs.

Add hasWebgpuBackend + wasmPaths to the diagnostic log so a future
regression is one line to spot. Update the troubleshooting table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The asyncify wasm (which is the WebGPU-capable build in
onnxruntime-web 1.20+ — webgpuInit / webgpuRegisterDevice live there,
NOT in the jsep wasm) uses threading for its heap allocator. Threading
needs SharedArrayBuffer. SharedArrayBuffer needs crossOriginIsolated.
That needs cross_origin_embedder_policy + cross_origin_opener_policy
in the manifest.

Without isolation, the wasm falls back to a plain ArrayBuffer heap
that's tiny — and inference std::bad_allocs on any 100MB+ allocation
even when chrome://gpu shows WebGPU is hardware-accelerated and
navigator.gpu hands out a real adapter. Confusing because the surface
error looks like model-too-big rather than configuration.

Add COOP/COEP to the chrome manifest. Also revert the wasmPaths
override to point at .asyncify.{mjs,wasm} (the previous commit
mistakenly pointed at .jsep, which lacks the webgpu* exports and gave
us 'webgpuInit is not a function' instead). Add crossOriginIsolated +
SharedArrayBuffer presence to the diagnostic log so the manifest
change is verifiable without DevTools spelunking.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Past the WASM-heap saga — WebGPU is now actually running the model
(input log shows past_conv.0 / past_recurrent.0 state tensors,
confirming Qwen 3.5 0.8B's hybrid Mamba+Transformer architecture is
executing). The new error is a Dawn-side "Failed to allocate memory
for buffer mapping" during mapAsync, which reads GPU buffers back
to CPU.

Hybrid/vision models like Qwen 3.5 have past_conv, past_recurrent,
AND transformer KV cache — that's a lot of buffers to map back and
forth between GPU and CPU on every forward pass. Setting
preferredOutputLocation: 'gpu-buffer' keeps the outputs as GPU
buffers, so the next forward pass can feed them directly without
the round-trip and Dawn doesn't run out of mapping staging memory.

transformers.js attempts this automatically for kv-cache outputs
when the model config provides cache_config, but the wiring doesn't
always populate the right names for hybrid/VL models. Setting it
globally is the safe override.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
esokullu and others added 2 commits May 26, 2026 03:59
# Conflicts:
#	manifest.json
#	package-lock.json
#	package.json
#	src/chrome/ARCHITECTURE.md
#	src/chrome/manifest.json
#	src/chrome/src/ui/settings.js
#	src/firefox/ARCHITECTURE.md
#	src/firefox/manifest.json
#	src/firefox/src/ui/settings.js
#	test/run.js
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@esokullu

Copy link
Copy Markdown
Collaborator Author

@codex what's wrong here? i'm getting:

Error: webgpu: The data is not on CPU. Use getData() to download GPU data to CPU, or use texture or gpuBuffer property to access the GPU data directly.

@chatgpt-codex-connector

Copy link
Copy Markdown

Summary

  • Fixed the WebGPU inference worker to stop forcing preferredOutputLocation: 'gpu-buffer' on all generation outputs, which can trigger the exact runtime error you reported (The data is not on CPU...) when transformers.js post-processing expects CPU-readable tensors. src/chrome/src/offscreen/inference-worker.jsL122-L130
  • Added an inline explanatory comment so future changes preserve this constraint in the chat-generation path. src/chrome/src/offscreen/inference-worker.jsL125-L129
  • Committed the change on the current branch (ed8e6ce) and created a follow-up PR message with the rationale and validation steps.

Testing

  • node test/run.js (197 passed, 0 failed)

View task →

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant