fix(multi-source): thread source_id through per-page tx surface (closes Postgres 21000 mid-import)#707
fix(multi-source): thread source_id through per-page tx surface (closes Postgres 21000 mid-import)#707mdcruz88 wants to merge 6 commits intogarrytan:masterfrom
Conversation
Multi-source brains crashed mid-import with Postgres 21000 ("more than one
row returned by a subquery used as an expression"). Root cause: putPage's
INSERT column list omitted source_id, so writes intended for a non-default
source (e.g. 'jarvis-memory') silently fabricated a duplicate row at
(default, slug). The schema has UNIQUE(source_id, slug) but DEFAULT 'default'
for source_id; calling putPage(slug, page) without source_id landed at
(default, slug) and ON CONFLICT updated the wrong row, leaving the intended
source row stale. Subsequent bare-slug subqueries inside the same tx —
(SELECT id FROM pages WHERE slug = $1) in getTags / removeTag / deleteChunks
/ removeLink / addLink (cross-product) — then matched 2 rows and crashed
with 21000, rolling back the entire import. Observed: 18 sync failures
against a 'jarvis-memory'-sourced brain.
Fix:
- putPage adds source_id to the INSERT column list (defaults 'default' for
back-compat).
- Every bare-slug page-id subquery becomes source-qualified
(AND source_id = $X) in both engines: createVersion, upsertChunks,
getChunks, addTag, removeTag, getTags, deleteChunks, removeLink,
addTimelineEntry, deletePage, updateSlug.
- addLink rewritten away from FROM pages f, pages t cross-product into a
VALUES + JOIN-on-(slug, source_id) shape mirroring addLinksBatch.
- engine.ts interface: 11 method signatures gain optional opts.sourceId
(or opts.{from,to,origin}SourceId for addLink/removeLink). All optional;
existing callers default to source='default' and behave identically.
- import-file.ts: importFromContent / importFromFile / importCodeFile take
opts.sourceId and thread txOpts = { sourceId } through every per-page tx
call. engine.getPage callsite source-scoped for accurate idempotency.
- commands/sync.ts: thread opts.sourceId at importFile (line 581 + 641),
un-syncable cleanup (487-498), delete phase (557), rename phase (574),
and post-sync extract phase (815-816).
- commands/reindex-code.ts: thread opts.sourceId at importCodeFile call.
- commands/extract.ts: extractLinksForSlugs / extractTimelineForSlugs accept
opts.sourceId and propagate via linkOpts / entryOpts.
- commands/reconcile-links.ts: ReconcileLinksOpts.sourceId was declared but
ignored end-to-end; now wired through getPage + addLink calls.
- commands/migrate-engine.ts: --force wipe switched to executeRaw('DELETE
FROM pages') to preserve the pre-PR all-sources semantic after deletePage
became default-source-scoped.
Regression test: test/source-id-tx-regression.test.ts (19 tests). Validates
two sources × same slug coexist; getTags/addTag/removeTag/deleteChunks/
upsertChunks/createVersion/addLink/addTimelineEntry/deletePage/updateSlug
source-scoped writes don't 21000; back-compat without opts targets
source='default'; addLink fail-fast on missing source-qualified endpoint;
importFromContent end-to-end tx thread without fabricating duplicate.
Adversarial review: Codex (gpt-5.5 reviewer) + Grok (xAI flagship reviewer)
3-round crew loop. Round 1: 2 HIGH (addTimelineEntry + extract.ts thread)
+ 2 MED. Round 2: 1 CRITICAL + 1 HIGH (deletePage + updateSlug bare-slug)
+ 2 MED. Round 3: 2 HIGH (getChunks + migrate-engine semantic regression
introduced by R2 fix). Round 4: both reviewers CLEAR.
Deferred to follow-up PRs (noted as TODO):
- src/commands/embed.ts source-aware threading (auto-embed at sync.ts:823
has a TODO; try/catch swallows the failure as best-effort).
- src/core/postgres-engine.ts:1511 / pglite-engine.ts:1446 putRawData
bare-slug (lower-impact metadata path).
- Read-surface bare-slug consistency cleanup (getLinks/getBacklinks/
getTimeline/getRawData/getVersions): non-mutating, won't 21000.
- reconcile-links.ts CLI --source flag exposure (internal opt is wired;
CLI parser is a UX feature for later).
Existing rows in production written under (default, slug) by the old
putPage when caller meant another source remain misrouted. Backfill
heuristics need install-specific knowledge of intended source and are
outside this PR's scope; surface as a deployment-side cleanup task.
bun run typecheck clean, bun run build clean, 19/19 regression tests pass,
4082 unit pass / 1 pre-existing fail (BrainRegistry test depending on
test-env ~/.gbrain/ absence — fails on untouched main, unrelated).
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…n#707 gap) PR garrytan#707 fixed source_id routing for sync's incremental loop (lines 581/641) but performFullSync (line 922) calls runImport without threading sourceId. Result: full syncs route pages to default even with --source <id>. Verified on v0.30.1 by direct PGLite probe after `gbrain sync --source X --full`: all pages landed in default, not the named source. Fix: - runImport accepts sourceId in opts (programmatic only — no CLI flag, preserving PR garrytan#707's design intent of `gbrain import` being default-only). - runImport threads sourceId to importFile + importImageFile. - performFullSync passes opts.sourceId to runImport. - ImportImageOptions type accepts sourceId for runImport branch (importImageFile body wiring deferred — image imports out of scope for current use case; TS error fix only). Verified: real sync test against /tmp/test-sync routes 1 page to "testsync" source, 0 to default (post-fix). 19/19 source-id regression tests still pass. Typecheck clean. Co-Authored-By: Claude Opus 4.7 <[email protected]>
|
👋 Wanted to share verification + a small gap I found while applying this in production. What I did: Rebased your branch onto current Verification:
One gap I found: Small follow-up patch (3 files, +18/-4) preserves your design intent of Branch with rebase + gap-fix: Either way — really clean fix. The bare-slug subquery analysis caught a real Postgres 21000 surface I'd never have spotted. The 19 regression tests are the kind of thoroughness that makes adoption easy. Thanks 🙏 |
|
@jeremyknows really appreciate you driving the rebase and locking in the text/code |
PR garrytan#707's existing 19-test suite at test/source-id-tx-regression.test.ts covers the engine-layer transaction surface (putPage / addTag / etc.) but does NOT exercise commands/sync.ts:performFullSync. Verified via `grep -c 'performFullSync' test/source-id-tx-regression.test.ts → 0`. This means the +18/-4 fix at sync.ts:892 (performFullSync passing sourceId to runImport) had no automated coverage. Adds 2 PGLite-only regression tests: 1. `performFullSync with --source routes pages to named source (not default)` — fixture: temp git repo with 2 markdown files. Calls performSync with { full: true, sourceId: 'testsrc-pfs', noPull: true, noEmbed: true }. Asserts pages.source_id = 'testsrc-pfs', not 'default'. Pre-fix: FAILS (verified by checking out 46cd197 — rebased PR garrytan#707 only, without my gap-fix — and running this test). Post-fix: PASSES. 2. `performFullSync WITHOUT --source still targets default (back-compat)` — same fixture, no sourceId opt. Asserts pages.source_id = 'default'. Both pre-fix and post-fix: PASSES (back-compat preserved by the fix). Verified: 21/21 tests pass on this branch (19 from PR garrytan#707 + 2 new). `bun run typecheck` clean. `bun run verify` clean (8 guard checks pass). Co-Authored-By: Claude Opus 4.7 <[email protected]>
|
Quick update — based on the verification work I'd offered earlier in this thread, I've now opened #757 as a standalone PR with the rebased commits + the gap-fix + a regression test that exercises The branch (
If you'd prefer to pull Thanks again for the thoroughness in #707 — the bare-slug subquery analysis was the kind of work I wouldn't have spotted on my own. |
…test Test docstring carried references to a private downstream consumer's internal vocabulary (PRISM Round 2, Atlas Terminal agent, PR-E acceptance criterion). Scrubbed for upstream readability — test logic unchanged.
Follow-up to PR garrytan#707's text/code source_id thread + the performFullSync gap-fix: the image-import path (importImageFile + withImportTransaction) still routed all image-page, chunk, file, and sibling-link writes to source='default' even when the caller passed sourceId. Reachable from performFullSync via collectMarkdownFiles when GBRAIN_EMBEDDING_MULTIMODAL=true, recreating the exact silent-misroute bug PR garrytan#707 was meant to close. Fix: - ImportTransactionSpec gains sourceId?: string. withImportTransaction threads it to tx.createVersion / putPage / getPage (file lookup) / upsertChunks / deleteChunks, and merges it into FileSpec.source_id when spec.file is present. - importImageFile pre-existence check (engine.getPage(imageSlug)) source-scopes via opts.sourceId so identical-named images across sources don't false-skip. - importImageFile after-hook (image_of sibling auto-link) source-scopes both the candidate lookup and the addLink edge endpoints, so a multi-source brain can't cross-link an image in source X to a same-slug text page in source Y. - Test: 2 new cases in test/import-image-file.test.ts — sourceId='img-src' routes page+chunk+file to that source (not default); omitting sourceId still targets default (back-compat preserved). Co-Authored-By: Claude Opus 4.7 <[email protected]>
…ull sync Follow-up to PR garrytan#707's text/code source_id thread + the performFullSync gap-fix: runImport unconditionally wrote three global (non-source-scoped) config keys on success — sync.last_commit / sync.last_run / sync.repo_path. When called from performFullSync with a sourceId, those writes overwrote the global keys with the source's values; a later bare `gbrain sync` then read X's repo path as the default-source repo and imported X content into source='default'. Silent contamination, hidden behind a correct-looking sources.X.last_sync_at. Fix: - runImport gains `writeSyncConfig?: boolean` (default true) on its programmatic opts. When false, the three global setConfig calls are skipped — but failure recording (sync-failures.jsonl) still runs unconditionally so doctor surfaces problems regardless of source. - performFullSync passes `writeSyncConfig: !opts.sourceId` so default-source full syncs preserve legacy back-compat (global writes happen) while named-source full syncs delegate sync-state ownership to performFullSync's own source-scoped writeSyncAnchor calls. - Test: 2 new cases in test/performfullsync-source-id.test.ts — sentinel values for sync.repo_path / sync.last_commit survive a `--source X --full` run; back-compat full sync (no --source) still updates them. Co-Authored-By: Claude Opus 4.7 <[email protected]>
4823ddb to
f6535bd
Compare
|
Rebased onto v0.30.1 + folded in @jeremyknows's commits + added the two scoping pieces flagged in my earlier reply. PR is now MERGEABLE. Stack (6 commits on top of v0.30.1 master,
Test posture (local):
Co-authorship: all 3 of @jeremyknows's commits land verbatim with their original authorship (rebase preserved attribution on commit 1; commits 2 + 3 keep his author trailer). My follow-up commits (4, 5, 6) reference his work in their bodies. Out-of-scope follow-ups captured but explicitly NOT in this PR:
@jeremyknows — feel free to close #757 whenever; everything from your branch is now in #707 with attribution preserved. Thanks for the rebase + the regression test that empirically proved the gap. @garrytan — closes #497, #540 + the multi-source full-sync silent-misroute class. Happy to split this into smaller PRs if the scope is too wide for a single review; let me know. |
Summary
Multi-source brains crash mid-import with Postgres
21000("more than one row returned by a subquery used as an expression"), rolling back the entire transaction.Root cause:
putPage's INSERT column list omitssource_id, so writes intended for a non-default source (e.g. a memory page sourced from'jarvis-memory') silently fabricate a duplicate row at(default, slug). The schema hasUNIQUE(source_id, slug)butDEFAULT 'default'forsource_id. ON CONFLICT then targets the default row, leaving the intended-source row stale, and a duplicate accumulates. Bare-slug subqueries inside the same tx —(SELECT id FROM pages WHERE slug = $1)ingetTags,removeTag,deleteChunks,removeLink, plusaddLink'sFROM pages f, pages tcross-product — match >1 row and crash with21000.Observed in production: 18 sync failures against a non-default-sourced brain.
Patch shape
source_idto the INSERT column list (defaults'default'for back-compat).AND source_id = $X) in both engines:createVersion,upsertChunks,getChunks,addTag,removeTag,getTags,deleteChunks,removeLink,addTimelineEntry,deletePage,updateSlug.FROM pages f, pages tcross-product into aVALUES + JOIN-on-(slug, source_id)shape mirroringaddLinksBatch.opts.sourceId(oropts.{from,to,origin}SourceIdforaddLink/removeLink). All optional; existing callers default tosource='default'and behave identically.importFromContent/importFromFile/importCodeFiletakeopts.sourceIdand threadtxOpts = { sourceId }through every per-page tx call.engine.getPagecallsite source-scoped for accurate idempotency.opts.sourceIdatimportFile(line 581 + 641), un-syncable cleanup (487-498), delete phase (557), rename phase (574), and post-sync extract phase (815-816).opts.sourceIdatimportCodeFilecall.extractLinksForSlugs/extractTimelineForSlugsacceptopts.sourceIdand propagate vialinkOpts/entryOpts.ReconcileLinksOpts.sourceIdwas declared but ignored end-to-end; now wired throughgetPage+addLink.--forcewipe switched toexecuteRaw('DELETE FROM pages')to preserve the pre-PR all-sources semantic afterdeletePagebecame default-source-scoped.Backwards compatibility
optsparameter is optional. Callers that don't passoptscontinue to targetsource='default'(the schema default) and behave identically to pre-fix.put_pageop handler (src/core/operations.ts) andgbrain import(src/commands/import.ts) are deliberately unchanged — both write to the default source by design and have no--sourceflag.Tests
New regression suite at
test/source-id-tx-regression.test.ts— 19 tests covering:putPage(no21000).getTags/addTag/removeTag/deleteChunks/upsertChunks/createVersion/addLink/removeLink/addTimelineEntry/deletePage/updateSlugsource-scoped writes don't21000.optstargetssource='default'.addLinkfail-fast on missing source-qualified endpoint (was: silent cross-product hit on the wrong source).importFromContentend-to-end transaction thread without fabricating a duplicate.Validation
bun run typecheckcleanbun run buildcleanBrainRegistry — lazy init > empty/null/undefined id routes to hostand reproduces on untouchedmaster(verified viagit stash); the test depends on~/.gbrain/being absent in the test environment, unrelated to this PR.Adversarial review
Codex (gpt-5.5 reviewer) + Grok (xAI flagship reviewer) 4-round adversarial crew loop:
addTimelineEntry+ post-sync extract bypass) + 2 MEDIUM. Fixed.deletePage+updateSlugbare-slug across both engines + sync callsites) + 2 MEDIUM. Fixed.getChunksbare-slug +migrate-engine.ts --forcesemantic regression introduced by R2'sdeletePagesource-scoping) + 3 MEDIUM. Fixed.Deferred to follow-up PRs
Surfaced for visibility; not blocking this fix:
src/commands/embed.tssource-aware threading. The auto-embed phase atsync.ts:823re-entersrunEmbedwhich callsupsertChunksdefaulting tosource='default'. For non-default-source syncs,runEmbedeither fails with "Page not found" or updates the wrong source's chunks. Currently swallowed by best-effort try/catch (TODO comment insync.ts:823).putRawDatabare-slug atpostgres-engine.ts:1511/pglite-engine.ts:1446. Same family; lower-impact (raw-data path is metadata, not the import-tx hot path).getLinks/getBacklinks/getTimeline/getRawData/getVersions. Non-mutating, will not21000, but inconsistent with the new convention.reconcile-links.tsCLI--sourceflag. The internalsourceIdopt is now wired; exposing it to CLI users is a UX feature.(default, slug)by the pre-fixputPage. Backfill heuristics need install-specific knowledge of each row's intended source, so we leave it as a deployment-side cleanup task rather than embed an assumption-laden migration in upstream.Test plan
bun run typecheckpassesbun run buildpassesbun test test/source-id-tx-regression.test.ts— 19/19 passbash scripts/run-unit-parallel.sh— 4082 pass / 1 pre-existing fail / 0 skip21000, no duplicate row fabricated🤖 Generated with Claude Code
Need help on this PR? Tag
@codesmithwith what you need.