fix: `embed --stale` pulls all chunks every cycle (3 TB egress regression) by kyledeanjackson · Pull Request #775 · garrytan/gbrain

kyledeanjackson · 2026-05-09T05:18:51Z

TL;DR

Three compounding bugs in embed --stale + the autopilot launchd plist
caused ~3 TB/month of Postgres egress on a fully-embedded brain
(~682 pages) because every autopilot cycle re-pulled all chunks across
the wire just to discover there was nothing to embed. This PR fixes
all three layers and adds an early-exit fast path so steady-state
brains do near-zero work per cycle.

Verified on production data — autopilot cycle time goes from ~10s
fetching ~100 MB of vectors to <1s pulling a few hundred bytes.

What I observed

Supabase pooler-egress on a 329-page brain (~1443 chunks) climbed to
3,062 GB / 250 GB quota = 1,225% in a single billing cycle. Chart
showed a sharp transition: zero egress before the day the brain hit
100% embedded coverage, then 400–600 GB/day sustained.

03 May  04 May  05 May  06 May  07 May  08 May
 200GB   480GB   450GB   410GB   450GB   610GB

Cached egress: 0 (nothing was being served from the pooler cache).
Storage: 0 (not file traffic). Realtime: 0 (no WebSocket fanout).
Edge functions: 0. Pure PostgREST/pooler row traffic.

Realtime concurrent connections peaked at 4 — so it wasn't volume
from many clients, it was a small number of clients pulling the same
rows over and over.

Root cause

Autopilot:  KeepAlive=true + no sleep    →  ~4200 cycles/day
   ×
embedAll:   iterates all 682 pages       →  682 getChunks/cycle
   ×
getChunks:  SELECT cc.* including        →  ~30 KB/chunk over the wire
            the 1536-dim embedding          (vector marshalled as JSON
            column                          text balloons vs. binary)

= ~100 MB per cycle × 4200 cycles/day ≈ 420 GB/day  ✓ matches chart

Three independent contributors, each amplifying the others:

1. `KeepAlive=true` autopilot is a hot loop, not a periodic task

The plist generated by gbrain autopilot --install had no
StartInterval, no internal sleep, just KeepAlive=true. launchd
restarts the wrapper as soon as it exits, so cycles run back-to-back.
On one user's machine: 254,050 cycles in ~60 days = ~4,233 per day
= one every ~20 seconds.

Log evidence (every cycle):

[autopilot wrapper] starting; openai_key=set
[autopilot wrapper] sync source=default …
[autopilot wrapper] sync source=bh-brain …
[autopilot wrapper] sync source=bh-vault …
[autopilot wrapper] embed --stale --all
Embedded 0 chunks across 682 pages    ← all the work, none of the value
[autopilot wrapper] cycle complete

KeepAlive is the wrong launchd primitive for a periodic task — that's
what StartInterval is for.

2. `embedAll` calls `getChunks` for every page, then filters in memory

async function embedAll(engine, staleOnly, …) {
  const pages = await engine.listPages({ limit: 100000 });
  // ...
  async function embedOnePage(page) {
    const chunks = await engine.getChunks(page.slug);   // 682 round-trips
    const toEmbed = staleOnly
      ? chunks.filter(c => !c.embedded_at)              // filter AFTER fetch
      : chunks;
    // ...
  }
}

When staleOnly is true and the brain is fully embedded, toEmbed
is empty for every page — but every page's chunks are pulled across
the wire first, just to be discarded.

3. `getChunks` SELECTs the embedding column unnecessarily

async getChunks(slug: string): Promise<Chunk[]> {
  const rows = await sql`
    SELECT cc.* FROM content_chunks cc       // ← includes 1536-dim vector
    JOIN pages p ON p.id = cc.page_id
    WHERE p.slug = ${slug}
    ORDER BY cc.chunk_index
  `;
  return rows.map((r) => rowToChunk(r));     // rowToChunk(includeEmbedding=false)
}                                              // already drops it on parse

rowToChunk() defaults to includeEmbedding=false and discards the
vector after fetching. So the bytes were pulled across the network
only to be thrown away. A separate getChunksWithEmbeddings() already
exists for the legitimate caller (migrate-engine.ts).

The fix

Three small commits, each addresses one layer:

fix(engine) — getChunks projects only the columns
rowToChunk actually reads. Adds listStalePageSlugs() engine
method (one query: SELECT DISTINCT page_id FROM content_chunks WHERE embedding IS NULL).
fix(embed) — embedAll(staleOnly=true) calls
listStalePageSlugs() first. If empty, log + return. Otherwise
filter pages to only those with stale chunks, then iterate
normally. The non---stale path is unchanged.
fix(autopilot) — plist template uses
StartInterval=300 instead of KeepAlive=true. ~288 cycles/day
max. Tunable per-user.

Verification

Tests

$ bun run typecheck
$ bun test test/embed.test.ts test/autopilot-install.test.ts test/pglite-engine.test.ts
 96 pass  0 fail  189 expect() calls

Full suite: 2086 pass / 18 fail / 3 errors. The 18 failures are all
pre-existing beforeEach hook timed out / PGLite not connected
flakes in dream.test.ts, orphans.test.ts, multi-source- integration.test.ts, etc. — confirmed by re-running on main (clean
tree) where the same files all pass in isolation. None touch any of
the files in this PR.

Real-world

Tested against the production brain (682 pages, fully embedded):

$ time gbrain embed --stale
[embed.pages] start
[embed.pages] 6/6 (100%)
Embedded 0 chunks across 6 pages
[embed.pages] 6/6 (100%) done

real    0m0.893s

Six pages had transient stale chunks from a recent ingest. Old code
would have done 682 round-trips returning vectors; new code did one
small query and returned in <1s. After re-embedding those 6 pages,
subsequent runs early-exit:

$ time gbrain embed --stale
Embedded 0 chunks across 0 pages

real    0m0.02s

Production cutover

Re-enabled the autopilot with the patched code on the same machine
that was bleeding 400–600 GB/day. The Supabase egress chart will
confirm the fix over the next 24h; the chart and a follow-up will
land in the linked issue.

Behavior changes (intentional)

EmbedResult semantics change in --stale mode:

Field	Old (`--stale`)	New (`--stale`)
`pages_processed`	every page in brain	pages with at least one stale chunk
`total_chunks`	every chunk in those pages	only chunks on stale pages
`skipped`	every already-embedded chunk anywhere	skipped chunks on visited pages
`embedded`	unchanged	unchanged
`would_embed` (dry-run)	unchanged	unchanged

The new semantics are arguably more useful — they describe work done,
not work considered. The --all path is unchanged.

JSDoc on EmbedResult updated to call this out.

Migration notes for existing installs

The --install template change only affects new installs. Existing
users have plists with KeepAlive=true already deployed. They can:

Easy: gbrain autopilot --uninstall && gbrain autopilot --install
to regenerate from the new template, OR
Manual: edit ~/Library/LaunchAgents/com.gbrain.autopilot.plist
to replace <key>KeepAlive</key><true/> with
<key>StartInterval</key><integer>300</integer>, then
launchctl bootout gui/$UID/com.gbrain.autopilot && launchctl bootstrap gui/$UID ~/Library/LaunchAgents/com.gbrain.autopilot.plist.

A doctor / heal step that detects the legacy plist and rewrites it
in place would be a nice follow-up. Happy to add it if you'd like.

Out of scope

pgvector index tuning (HNSW vs. IVFFLAT) — separate concern
Search/query path — searchVector() already projects correctly,
no change needed there
Per-row JSON-vs-binary encoding — postgres-js handles this; the
fix is to not fetch the column at all

^{Need help on this PR? Tag @codesmith with what you need.}

Let Codesmith autofix CI failures and bot reviews

…PageSlugs Two related changes that lay the groundwork for fixing a hot-loop egress regression in `embed --stale`: 1. `getChunks(slug)` previously did `SELECT cc.*` which includes the 1536-dim `embedding` vector column. `rowToChunk()` already defaults to `includeEmbedding=false` and discards the column on parse, so the bytes were pulled across the wire only to be thrown away. Switched to an explicit projection that excludes `embedding`. Callers that actually need the vector (re-rank, similarity) already have a dedicated `getChunksWithEmbeddings()` method. 2. New `listStalePageSlugs()` returns the slugs of pages with at least one chunk where `embedding IS NULL`. Used by the next commit's embed --stale fast-path to avoid iterating every page in the brain when nothing is stale. Both engines updated for parity. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

`gbrain embed --stale` previously iterated every page in the brain and called `getChunks(slug)` on each one before filtering chunks where `embedded_at IS NULL`. On a steady-state brain (everything already embedded) this is wasted work — every cycle does N getChunks round-trips just to discover there's nothing to do. In production this manifested as 3 TB/month of Postgres egress on a fully-embedded brain (~682 pages) when the daemon polled rapidly. The autopilot plist's KeepAlive=true (separate fix) was the trigger; this is the underlying multiplier. Fix: when `staleOnly=true`, query `listStalePageSlugs()` first. If empty, return immediately. If non-empty, iterate only those pages — not every page in the brain. Behavior changes intentionally: - `pages_processed` and `total_chunks` in `--stale` mode now reflect the filtered (stale-only) set, not the entire brain. Test updated to assert the new semantics. The non-`--stale` path (`--all`) is unchanged. Combined with the `getChunks` projection fix in the previous commit, egress per cycle drops from ~100 MB to a few hundred bytes when the brain is fully embedded. Verified on a 682-page brain: cycle time 0.9s, log shows "Embedded 0 chunks across 0 pages" instead of "Embedded 0 chunks across 682 pages". Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…=true The launchd plist generated by `gbrain autopilot --install` set KeepAlive=true with no internal sleep in the wrapper script. launchd restarts the wrapper as soon as it exits, so each cycle runs back- to-back with effectively no delay. On one machine this produced ~4200 cycles/day (~one cycle every 20s) against a fully-embedded brain — combined with two query-side bugs (fixed in the prior two commits), it drove a 3 TB/month Postgres egress overage. Switching to StartInterval=300 caps the cadence at 288 cycles/day (one cycle per 5 minutes) regardless of how fast a single cycle exits. This is the correct launchd primitive for "run periodically on a schedule" — KeepAlive is for "respawn if the process dies", which the wrapper isn't. Tunable per-user by editing the generated plist directly. Existing user installs need to either re-run `gbrain autopilot --install` (which regenerates the plist) or hand-edit the deployed plist to swap KeepAlive for StartInterval. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

kyledeanjackson and others added 3 commits May 9, 2026 13:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: `embed --stale` pulls all chunks every cycle (3 TB egress regression)#775

fix: `embed --stale` pulls all chunks every cycle (3 TB egress regression)#775
kyledeanjackson wants to merge 3 commits intogarrytan:masterfrom
kyledeanjackson:fix/embed-stale-egress

kyledeanjackson commented May 9, 2026 •

edited by blacksmith-sh Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kyledeanjackson commented May 9, 2026 • edited by blacksmith-sh Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

What I observed

Root cause

1. KeepAlive=true autopilot is a hot loop, not a periodic task

2. embedAll calls getChunks for every page, then filters in memory

3. getChunks SELECTs the embedding column unnecessarily

The fix

Verification

Tests

Real-world

Production cutover

Behavior changes (intentional)

Migration notes for existing installs

Out of scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kyledeanjackson commented May 9, 2026 •

edited by blacksmith-sh Bot

Loading

1. `KeepAlive=true` autopilot is a hot loop, not a periodic task

2. `embedAll` calls `getChunks` for every page, then filters in memory

3. `getChunks` SELECTs the embedding column unnecessarily