fix: embed --stale pulls all chunks every cycle (3 TB egress regression)#775
Open
kyledeanjackson wants to merge 3 commits intogarrytan:masterfrom
Open
fix: embed --stale pulls all chunks every cycle (3 TB egress regression)#775kyledeanjackson wants to merge 3 commits intogarrytan:masterfrom
embed --stale pulls all chunks every cycle (3 TB egress regression)#775kyledeanjackson wants to merge 3 commits intogarrytan:masterfrom
Conversation
…PageSlugs Two related changes that lay the groundwork for fixing a hot-loop egress regression in `embed --stale`: 1. `getChunks(slug)` previously did `SELECT cc.*` which includes the 1536-dim `embedding` vector column. `rowToChunk()` already defaults to `includeEmbedding=false` and discards the column on parse, so the bytes were pulled across the wire only to be thrown away. Switched to an explicit projection that excludes `embedding`. Callers that actually need the vector (re-rank, similarity) already have a dedicated `getChunksWithEmbeddings()` method. 2. New `listStalePageSlugs()` returns the slugs of pages with at least one chunk where `embedding IS NULL`. Used by the next commit's embed --stale fast-path to avoid iterating every page in the brain when nothing is stale. Both engines updated for parity. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
`gbrain embed --stale` previously iterated every page in the brain and called `getChunks(slug)` on each one before filtering chunks where `embedded_at IS NULL`. On a steady-state brain (everything already embedded) this is wasted work — every cycle does N getChunks round-trips just to discover there's nothing to do. In production this manifested as 3 TB/month of Postgres egress on a fully-embedded brain (~682 pages) when the daemon polled rapidly. The autopilot plist's KeepAlive=true (separate fix) was the trigger; this is the underlying multiplier. Fix: when `staleOnly=true`, query `listStalePageSlugs()` first. If empty, return immediately. If non-empty, iterate only those pages — not every page in the brain. Behavior changes intentionally: - `pages_processed` and `total_chunks` in `--stale` mode now reflect the filtered (stale-only) set, not the entire brain. Test updated to assert the new semantics. The non-`--stale` path (`--all`) is unchanged. Combined with the `getChunks` projection fix in the previous commit, egress per cycle drops from ~100 MB to a few hundred bytes when the brain is fully embedded. Verified on a 682-page brain: cycle time 0.9s, log shows "Embedded 0 chunks across 0 pages" instead of "Embedded 0 chunks across 682 pages". Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…=true The launchd plist generated by `gbrain autopilot --install` set KeepAlive=true with no internal sleep in the wrapper script. launchd restarts the wrapper as soon as it exits, so each cycle runs back- to-back with effectively no delay. On one machine this produced ~4200 cycles/day (~one cycle every 20s) against a fully-embedded brain — combined with two query-side bugs (fixed in the prior two commits), it drove a 3 TB/month Postgres egress overage. Switching to StartInterval=300 caps the cadence at 288 cycles/day (one cycle per 5 minutes) regardless of how fast a single cycle exits. This is the correct launchd primitive for "run periodically on a schedule" — KeepAlive is for "respawn if the process dies", which the wrapper isn't. Tunable per-user by editing the generated plist directly. Existing user installs need to either re-run `gbrain autopilot --install` (which regenerates the plist) or hand-edit the deployed plist to swap KeepAlive for StartInterval. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Three compounding bugs in
embed --stale+ the autopilot launchd plistcaused ~3 TB/month of Postgres egress on a fully-embedded brain
(~682 pages) because every autopilot cycle re-pulled all chunks across
the wire just to discover there was nothing to embed. This PR fixes
all three layers and adds an early-exit fast path so steady-state
brains do near-zero work per cycle.
Verified on production data — autopilot cycle time goes from ~10s
fetching ~100 MB of vectors to <1s pulling a few hundred bytes.
What I observed
Supabase pooler-egress on a 329-page brain (~1443 chunks) climbed to
3,062 GB / 250 GB quota = 1,225% in a single billing cycle. Chart
showed a sharp transition: zero egress before the day the brain hit
100% embedded coverage, then 400–600 GB/day sustained.
Cached egress: 0 (nothing was being served from the pooler cache).
Storage: 0 (not file traffic). Realtime: 0 (no WebSocket fanout).
Edge functions: 0. Pure PostgREST/pooler row traffic.
Realtime concurrent connections peaked at 4 — so it wasn't volume
from many clients, it was a small number of clients pulling the same
rows over and over.
Root cause
Three independent contributors, each amplifying the others:
1.
KeepAlive=trueautopilot is a hot loop, not a periodic taskThe plist generated by
gbrain autopilot --installhad noStartInterval, no internalsleep, justKeepAlive=true. launchdrestarts the wrapper as soon as it exits, so cycles run back-to-back.
On one user's machine: 254,050 cycles in ~60 days = ~4,233 per day
= one every ~20 seconds.
Log evidence (every cycle):
KeepAliveis the wrong launchd primitive for a periodic task — that'swhat
StartIntervalis for.2.
embedAllcallsgetChunksfor every page, then filters in memoryWhen
staleOnlyis true and the brain is fully embedded,toEmbedis empty for every page — but every page's chunks are pulled across
the wire first, just to be discarded.
3.
getChunksSELECTs the embedding column unnecessarilyrowToChunk()defaults toincludeEmbedding=falseand discards thevector after fetching. So the bytes were pulled across the network
only to be thrown away. A separate
getChunksWithEmbeddings()alreadyexists for the legitimate caller (
migrate-engine.ts).The fix
Three small commits, each addresses one layer:
fix(engine)—getChunksprojects only the columnsrowToChunkactually reads. AddslistStalePageSlugs()enginemethod (one query:
SELECT DISTINCT page_id FROM content_chunks WHERE embedding IS NULL).fix(embed)—embedAll(staleOnly=true)callslistStalePageSlugs()first. If empty, log + return. Otherwisefilter
pagesto only those with stale chunks, then iteratenormally. The non-
--stalepath is unchanged.fix(autopilot)— plist template usesStartInterval=300instead ofKeepAlive=true. ~288 cycles/daymax. Tunable per-user.
Verification
Tests
Full suite: 2086 pass / 18 fail / 3 errors. The 18 failures are all
pre-existing
beforeEach hook timed out/PGLite not connectedflakes in
dream.test.ts,orphans.test.ts,multi-source- integration.test.ts, etc. — confirmed by re-running onmain(cleantree) where the same files all pass in isolation. None touch any of
the files in this PR.
Real-world
Tested against the production brain (682 pages, fully embedded):
Six pages had transient stale chunks from a recent ingest. Old code
would have done 682 round-trips returning vectors; new code did one
small query and returned in <1s. After re-embedding those 6 pages,
subsequent runs early-exit:
Production cutover
Re-enabled the autopilot with the patched code on the same machine
that was bleeding 400–600 GB/day. The Supabase egress chart will
confirm the fix over the next 24h; the chart and a follow-up will
land in the linked issue.
Behavior changes (intentional)
EmbedResultsemantics change in--stalemode:--stale)--stale)pages_processedtotal_chunksskippedembeddedwould_embed(dry-run)The new semantics are arguably more useful — they describe work done,
not work considered. The
--allpath is unchanged.JSDoc on
EmbedResultupdated to call this out.Migration notes for existing installs
The
--installtemplate change only affects new installs. Existingusers have plists with
KeepAlive=truealready deployed. They can:gbrain autopilot --uninstall && gbrain autopilot --installto regenerate from the new template, OR
~/Library/LaunchAgents/com.gbrain.autopilot.plistto replace
<key>KeepAlive</key><true/>with<key>StartInterval</key><integer>300</integer>, thenlaunchctl bootout gui/$UID/com.gbrain.autopilot && launchctl bootstrap gui/$UID ~/Library/LaunchAgents/com.gbrain.autopilot.plist.A doctor / heal step that detects the legacy plist and rewrites it
in place would be a nice follow-up. Happy to add it if you'd like.
Out of scope
searchVector()already projects correctly,no change needed there
fix is to not fetch the column at all
Need help on this PR? Tag
@codesmithwith what you need.