Skip to content

fix: embed --stale pulls all chunks every cycle (3 TB egress regression)#775

Open
kyledeanjackson wants to merge 3 commits intogarrytan:masterfrom
kyledeanjackson:fix/embed-stale-egress
Open

fix: embed --stale pulls all chunks every cycle (3 TB egress regression)#775
kyledeanjackson wants to merge 3 commits intogarrytan:masterfrom
kyledeanjackson:fix/embed-stale-egress

Conversation

@kyledeanjackson
Copy link
Copy Markdown

@kyledeanjackson kyledeanjackson commented May 9, 2026

TL;DR

Three compounding bugs in embed --stale + the autopilot launchd plist
caused ~3 TB/month of Postgres egress on a fully-embedded brain
(~682 pages) because every autopilot cycle re-pulled all chunks across
the wire just to discover there was nothing to embed. This PR fixes
all three layers and adds an early-exit fast path so steady-state
brains do near-zero work per cycle.

Verified on production data — autopilot cycle time goes from ~10s
fetching ~100 MB of vectors to <1s pulling a few hundred bytes.

What I observed

Supabase pooler-egress on a 329-page brain (~1443 chunks) climbed to
3,062 GB / 250 GB quota = 1,225% in a single billing cycle. Chart
showed a sharp transition: zero egress before the day the brain hit
100% embedded coverage, then 400–600 GB/day sustained.

03 May  04 May  05 May  06 May  07 May  08 May
 200GB   480GB   450GB   410GB   450GB   610GB

Cached egress: 0 (nothing was being served from the pooler cache).
Storage: 0 (not file traffic). Realtime: 0 (no WebSocket fanout).
Edge functions: 0. Pure PostgREST/pooler row traffic.

Realtime concurrent connections peaked at 4 — so it wasn't volume
from many clients, it was a small number of clients pulling the same
rows over and over.

Root cause

Autopilot:  KeepAlive=true + no sleep    →  ~4200 cycles/day
   ×
embedAll:   iterates all 682 pages       →  682 getChunks/cycle
   ×
getChunks:  SELECT cc.* including        →  ~30 KB/chunk over the wire
            the 1536-dim embedding          (vector marshalled as JSON
            column                          text balloons vs. binary)

= ~100 MB per cycle × 4200 cycles/day ≈ 420 GB/day  ✓ matches chart

Three independent contributors, each amplifying the others:

1. KeepAlive=true autopilot is a hot loop, not a periodic task

The plist generated by gbrain autopilot --install had no
StartInterval, no internal sleep, just KeepAlive=true. launchd
restarts the wrapper as soon as it exits, so cycles run back-to-back.
On one user's machine: 254,050 cycles in ~60 days = ~4,233 per day
= one every ~20 seconds.

Log evidence (every cycle):

[autopilot wrapper] starting; openai_key=set
[autopilot wrapper] sync source=default …
[autopilot wrapper] sync source=bh-brain …
[autopilot wrapper] sync source=bh-vault …
[autopilot wrapper] embed --stale --all
Embedded 0 chunks across 682 pages    ← all the work, none of the value
[autopilot wrapper] cycle complete

KeepAlive is the wrong launchd primitive for a periodic task — that's
what StartInterval is for.

2. embedAll calls getChunks for every page, then filters in memory

async function embedAll(engine, staleOnly, ) {
  const pages = await engine.listPages({ limit: 100000 });
  // ...
  async function embedOnePage(page) {
    const chunks = await engine.getChunks(page.slug);   // 682 round-trips
    const toEmbed = staleOnly
      ? chunks.filter(c => !c.embedded_at)              // filter AFTER fetch
      : chunks;
    // ...
  }
}

When staleOnly is true and the brain is fully embedded, toEmbed
is empty for every page — but every page's chunks are pulled across
the wire first, just to be discarded.

3. getChunks SELECTs the embedding column unnecessarily

async getChunks(slug: string): Promise<Chunk[]> {
  const rows = await sql`
    SELECT cc.* FROM content_chunks cc       // ← includes 1536-dim vector
    JOIN pages p ON p.id = cc.page_id
    WHERE p.slug = ${slug}
    ORDER BY cc.chunk_index
  `;
  return rows.map((r) => rowToChunk(r));     // rowToChunk(includeEmbedding=false)
}                                              // already drops it on parse

rowToChunk() defaults to includeEmbedding=false and discards the
vector after fetching. So the bytes were pulled across the network
only to be thrown away. A separate getChunksWithEmbeddings() already
exists for the legitimate caller (migrate-engine.ts).

The fix

Three small commits, each addresses one layer:

  1. fix(engine)getChunks projects only the columns
    rowToChunk actually reads. Adds listStalePageSlugs() engine
    method (one query: SELECT DISTINCT page_id FROM content_chunks WHERE embedding IS NULL).

  2. fix(embed)embedAll(staleOnly=true) calls
    listStalePageSlugs() first. If empty, log + return. Otherwise
    filter pages to only those with stale chunks, then iterate
    normally. The non---stale path is unchanged.

  3. fix(autopilot) — plist template uses
    StartInterval=300 instead of KeepAlive=true. ~288 cycles/day
    max. Tunable per-user.

Verification

Tests

$ bun run typecheck
$ bun test test/embed.test.ts test/autopilot-install.test.ts test/pglite-engine.test.ts
 96 pass  0 fail  189 expect() calls

Full suite: 2086 pass / 18 fail / 3 errors. The 18 failures are all
pre-existing beforeEach hook timed out / PGLite not connected
flakes in dream.test.ts, orphans.test.ts, multi-source- integration.test.ts, etc. — confirmed by re-running on main (clean
tree) where the same files all pass in isolation. None touch any of
the files in this PR.

Real-world

Tested against the production brain (682 pages, fully embedded):

$ time gbrain embed --stale
[embed.pages] start
[embed.pages] 6/6 (100%)
Embedded 0 chunks across 6 pages
[embed.pages] 6/6 (100%) done

real    0m0.893s

Six pages had transient stale chunks from a recent ingest. Old code
would have done 682 round-trips returning vectors; new code did one
small query and returned in <1s. After re-embedding those 6 pages,
subsequent runs early-exit:

$ time gbrain embed --stale
Embedded 0 chunks across 0 pages

real    0m0.02s

Production cutover

Re-enabled the autopilot with the patched code on the same machine
that was bleeding 400–600 GB/day. The Supabase egress chart will
confirm the fix over the next 24h; the chart and a follow-up will
land in the linked issue.

Behavior changes (intentional)

EmbedResult semantics change in --stale mode:

Field Old (--stale) New (--stale)
pages_processed every page in brain pages with at least one stale chunk
total_chunks every chunk in those pages only chunks on stale pages
skipped every already-embedded chunk anywhere skipped chunks on visited pages
embedded unchanged unchanged
would_embed (dry-run) unchanged unchanged

The new semantics are arguably more useful — they describe work done,
not work considered. The --all path is unchanged.

JSDoc on EmbedResult updated to call this out.

Migration notes for existing installs

The --install template change only affects new installs. Existing
users have plists with KeepAlive=true already deployed. They can:

  • Easy: gbrain autopilot --uninstall && gbrain autopilot --install
    to regenerate from the new template, OR
  • Manual: edit ~/Library/LaunchAgents/com.gbrain.autopilot.plist
    to replace <key>KeepAlive</key><true/> with
    <key>StartInterval</key><integer>300</integer>, then
    launchctl bootout gui/$UID/com.gbrain.autopilot && launchctl bootstrap gui/$UID ~/Library/LaunchAgents/com.gbrain.autopilot.plist.

A doctor / heal step that detects the legacy plist and rewrites it
in place would be a nice follow-up. Happy to add it if you'd like.

Out of scope

  • pgvector index tuning (HNSW vs. IVFFLAT) — separate concern
  • Search/query path — searchVector() already projects correctly,
    no change needed there
  • Per-row JSON-vs-binary encoding — postgres-js handles this; the
    fix is to not fetch the column at all

View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

kyledeanjackson and others added 3 commits May 9, 2026 13:16
…PageSlugs

Two related changes that lay the groundwork for fixing a hot-loop
egress regression in `embed --stale`:

1. `getChunks(slug)` previously did `SELECT cc.*` which includes the
   1536-dim `embedding` vector column. `rowToChunk()` already defaults
   to `includeEmbedding=false` and discards the column on parse, so the
   bytes were pulled across the wire only to be thrown away. Switched
   to an explicit projection that excludes `embedding`. Callers that
   actually need the vector (re-rank, similarity) already have a
   dedicated `getChunksWithEmbeddings()` method.

2. New `listStalePageSlugs()` returns the slugs of pages with at least
   one chunk where `embedding IS NULL`. Used by the next commit's
   embed --stale fast-path to avoid iterating every page in the brain
   when nothing is stale.

Both engines updated for parity.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
`gbrain embed --stale` previously iterated every page in the brain and
called `getChunks(slug)` on each one before filtering chunks where
`embedded_at IS NULL`. On a steady-state brain (everything already
embedded) this is wasted work — every cycle does N getChunks
round-trips just to discover there's nothing to do.

In production this manifested as 3 TB/month of Postgres egress on a
fully-embedded brain (~682 pages) when the daemon polled rapidly.
The autopilot plist's KeepAlive=true (separate fix) was the
trigger; this is the underlying multiplier.

Fix: when `staleOnly=true`, query `listStalePageSlugs()` first. If
empty, return immediately. If non-empty, iterate only those pages —
not every page in the brain.

Behavior changes intentionally:
- `pages_processed` and `total_chunks` in `--stale` mode now reflect
  the filtered (stale-only) set, not the entire brain. Test updated
  to assert the new semantics. The non-`--stale` path (`--all`) is
  unchanged.

Combined with the `getChunks` projection fix in the previous commit,
egress per cycle drops from ~100 MB to a few hundred bytes when the
brain is fully embedded. Verified on a 682-page brain: cycle time
0.9s, log shows "Embedded 0 chunks across 0 pages" instead of
"Embedded 0 chunks across 682 pages".

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…=true

The launchd plist generated by `gbrain autopilot --install` set
KeepAlive=true with no internal sleep in the wrapper script. launchd
restarts the wrapper as soon as it exits, so each cycle runs back-
to-back with effectively no delay. On one machine this produced
~4200 cycles/day (~one cycle every 20s) against a fully-embedded
brain — combined with two query-side bugs (fixed in the prior two
commits), it drove a 3 TB/month Postgres egress overage.

Switching to StartInterval=300 caps the cadence at 288 cycles/day
(one cycle per 5 minutes) regardless of how fast a single cycle
exits. This is the correct launchd primitive for "run periodically
on a schedule" — KeepAlive is for "respawn if the process dies",
which the wrapper isn't. Tunable per-user by editing the generated
plist directly.

Existing user installs need to either re-run `gbrain autopilot
--install` (which regenerates the plist) or hand-edit the deployed
plist to swap KeepAlive for StartInterval.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant