Skip to content

fix(multi-source): thread source_id through per-page tx surface (closes Postgres 21000 mid-import)#707

Closed
mdcruz88 wants to merge 6 commits intogarrytan:masterfrom
mdcruz88:fix/source-id-tx-thread
Closed

fix(multi-source): thread source_id through per-page tx surface (closes Postgres 21000 mid-import)#707
mdcruz88 wants to merge 6 commits intogarrytan:masterfrom
mdcruz88:fix/source-id-tx-thread

Conversation

@mdcruz88
Copy link
Copy Markdown

@mdcruz88 mdcruz88 commented May 7, 2026

Summary

Multi-source brains crash mid-import with Postgres 21000 ("more than one row returned by a subquery used as an expression"), rolling back the entire transaction.

Root cause: putPage's INSERT column list omits source_id, so writes intended for a non-default source (e.g. a memory page sourced from 'jarvis-memory') silently fabricate a duplicate row at (default, slug). The schema has UNIQUE(source_id, slug) but DEFAULT 'default' for source_id. ON CONFLICT then targets the default row, leaving the intended-source row stale, and a duplicate accumulates. Bare-slug subqueries inside the same tx — (SELECT id FROM pages WHERE slug = $1) in getTags, removeTag, deleteChunks, removeLink, plus addLink's FROM pages f, pages t cross-product — match >1 row and crash with 21000.

Observed in production: 18 sync failures against a non-default-sourced brain.

Patch shape

  • putPage adds source_id to the INSERT column list (defaults 'default' for back-compat).
  • Every bare-slug page-id subquery becomes source-qualified (AND source_id = $X) in both engines: createVersion, upsertChunks, getChunks, addTag, removeTag, getTags, deleteChunks, removeLink, addTimelineEntry, deletePage, updateSlug.
  • addLink rewritten away from FROM pages f, pages t cross-product into a VALUES + JOIN-on-(slug, source_id) shape mirroring addLinksBatch.
  • engine.ts interface: 11 method signatures gain optional opts.sourceId (or opts.{from,to,origin}SourceId for addLink/removeLink). All optional; existing callers default to source='default' and behave identically.
  • import-file.ts: importFromContent / importFromFile / importCodeFile take opts.sourceId and thread txOpts = { sourceId } through every per-page tx call. engine.getPage callsite source-scoped for accurate idempotency.
  • commands/sync.ts: thread opts.sourceId at importFile (line 581 + 641), un-syncable cleanup (487-498), delete phase (557), rename phase (574), and post-sync extract phase (815-816).
  • commands/reindex-code.ts: thread opts.sourceId at importCodeFile call.
  • commands/extract.ts: extractLinksForSlugs / extractTimelineForSlugs accept opts.sourceId and propagate via linkOpts / entryOpts.
  • commands/reconcile-links.ts: ReconcileLinksOpts.sourceId was declared but ignored end-to-end; now wired through getPage + addLink.
  • commands/migrate-engine.ts: --force wipe switched to executeRaw('DELETE FROM pages') to preserve the pre-PR all-sources semantic after deletePage became default-source-scoped.

Backwards compatibility

  • Every method's opts parameter is optional. Callers that don't pass opts continue to target source='default' (the schema default) and behave identically to pre-fix.
  • The MCP put_page op handler (src/core/operations.ts) and gbrain import (src/commands/import.ts) are deliberately unchanged — both write to the default source by design and have no --source flag.

Tests

New regression suite at test/source-id-tx-regression.test.ts — 19 tests covering:

  • Two sources × same slug coexist via putPage (no 21000).
  • getTags / addTag / removeTag / deleteChunks / upsertChunks / createVersion / addLink / removeLink / addTimelineEntry / deletePage / updateSlug source-scoped writes don't 21000.
  • Back-compat: every method without opts targets source='default'.
  • addLink fail-fast on missing source-qualified endpoint (was: silent cross-product hit on the wrong source).
  • importFromContent end-to-end transaction thread without fabricating a duplicate.

Validation

  • bun run typecheck clean
  • bun run build clean
  • 19/19 regression tests pass in 866ms
  • Full unit suite: 4082 pass / 1 fail / 0 skip. The 1 fail is BrainRegistry — lazy init > empty/null/undefined id routes to host and reproduces on untouched master (verified via git stash); the test depends on ~/.gbrain/ being absent in the test environment, unrelated to this PR.

Adversarial review

Codex (gpt-5.5 reviewer) + Grok (xAI flagship reviewer) 4-round adversarial crew loop:

  • R1: 2 HIGH (addTimelineEntry + post-sync extract bypass) + 2 MEDIUM. Fixed.
  • R2: 1 CRITICAL + 1 HIGH (deletePage + updateSlug bare-slug across both engines + sync callsites) + 2 MEDIUM. Fixed.
  • R3: 2 HIGH (getChunks bare-slug + migrate-engine.ts --force semantic regression introduced by R2's deletePage source-scoping) + 3 MEDIUM. Fixed.
  • R4: Both reviewers CLEAR. Convergence reached.

Deferred to follow-up PRs

Surfaced for visibility; not blocking this fix:

  1. src/commands/embed.ts source-aware threading. The auto-embed phase at sync.ts:823 re-enters runEmbed which calls upsertChunks defaulting to source='default'. For non-default-source syncs, runEmbed either fails with "Page not found" or updates the wrong source's chunks. Currently swallowed by best-effort try/catch (TODO comment in sync.ts:823).
  2. putRawData bare-slug at postgres-engine.ts:1511 / pglite-engine.ts:1446. Same family; lower-impact (raw-data path is metadata, not the import-tx hot path).
  3. Read-surface consistency cleanup for getLinks / getBacklinks / getTimeline / getRawData / getVersions. Non-mutating, will not 21000, but inconsistent with the new convention.
  4. reconcile-links.ts CLI --source flag. The internal sourceId opt is now wired; exposing it to CLI users is a UX feature.
  5. Backfill / cleanup migration for production rows misrouted to (default, slug) by the pre-fix putPage. Backfill heuristics need install-specific knowledge of each row's intended source, so we leave it as a deployment-side cleanup task rather than embed an assumption-laden migration in upstream.

Test plan

  • bun run typecheck passes
  • bun run build passes
  • bun test test/source-id-tx-regression.test.ts — 19/19 pass
  • bash scripts/run-unit-parallel.sh — 4082 pass / 1 pre-existing fail / 0 skip
  • Manual: PGLite multi-source flow (two sources × same slug, importFromContent under non-default source) — no 21000, no duplicate row fabricated

🤖 Generated with Claude Code


View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

Michael Dela Cruz and others added 2 commits May 8, 2026 17:21
Multi-source brains crashed mid-import with Postgres 21000 ("more than one
row returned by a subquery used as an expression"). Root cause: putPage's
INSERT column list omitted source_id, so writes intended for a non-default
source (e.g. 'jarvis-memory') silently fabricated a duplicate row at
(default, slug). The schema has UNIQUE(source_id, slug) but DEFAULT 'default'
for source_id; calling putPage(slug, page) without source_id landed at
(default, slug) and ON CONFLICT updated the wrong row, leaving the intended
source row stale. Subsequent bare-slug subqueries inside the same tx —
(SELECT id FROM pages WHERE slug = $1) in getTags / removeTag / deleteChunks
/ removeLink / addLink (cross-product) — then matched 2 rows and crashed
with 21000, rolling back the entire import. Observed: 18 sync failures
against a 'jarvis-memory'-sourced brain.

Fix:
- putPage adds source_id to the INSERT column list (defaults 'default' for
  back-compat).
- Every bare-slug page-id subquery becomes source-qualified
  (AND source_id = $X) in both engines: createVersion, upsertChunks,
  getChunks, addTag, removeTag, getTags, deleteChunks, removeLink,
  addTimelineEntry, deletePage, updateSlug.
- addLink rewritten away from FROM pages f, pages t cross-product into a
  VALUES + JOIN-on-(slug, source_id) shape mirroring addLinksBatch.
- engine.ts interface: 11 method signatures gain optional opts.sourceId
  (or opts.{from,to,origin}SourceId for addLink/removeLink). All optional;
  existing callers default to source='default' and behave identically.
- import-file.ts: importFromContent / importFromFile / importCodeFile take
  opts.sourceId and thread txOpts = { sourceId } through every per-page tx
  call. engine.getPage callsite source-scoped for accurate idempotency.
- commands/sync.ts: thread opts.sourceId at importFile (line 581 + 641),
  un-syncable cleanup (487-498), delete phase (557), rename phase (574),
  and post-sync extract phase (815-816).
- commands/reindex-code.ts: thread opts.sourceId at importCodeFile call.
- commands/extract.ts: extractLinksForSlugs / extractTimelineForSlugs accept
  opts.sourceId and propagate via linkOpts / entryOpts.
- commands/reconcile-links.ts: ReconcileLinksOpts.sourceId was declared but
  ignored end-to-end; now wired through getPage + addLink calls.
- commands/migrate-engine.ts: --force wipe switched to executeRaw('DELETE
  FROM pages') to preserve the pre-PR all-sources semantic after deletePage
  became default-source-scoped.

Regression test: test/source-id-tx-regression.test.ts (19 tests). Validates
two sources × same slug coexist; getTags/addTag/removeTag/deleteChunks/
upsertChunks/createVersion/addLink/addTimelineEntry/deletePage/updateSlug
source-scoped writes don't 21000; back-compat without opts targets
source='default'; addLink fail-fast on missing source-qualified endpoint;
importFromContent end-to-end tx thread without fabricating duplicate.

Adversarial review: Codex (gpt-5.5 reviewer) + Grok (xAI flagship reviewer)
3-round crew loop. Round 1: 2 HIGH (addTimelineEntry + extract.ts thread)
+ 2 MED. Round 2: 1 CRITICAL + 1 HIGH (deletePage + updateSlug bare-slug)
+ 2 MED. Round 3: 2 HIGH (getChunks + migrate-engine semantic regression
introduced by R2 fix). Round 4: both reviewers CLEAR.

Deferred to follow-up PRs (noted as TODO):
- src/commands/embed.ts source-aware threading (auto-embed at sync.ts:823
  has a TODO; try/catch swallows the failure as best-effort).
- src/core/postgres-engine.ts:1511 / pglite-engine.ts:1446 putRawData
  bare-slug (lower-impact metadata path).
- Read-surface bare-slug consistency cleanup (getLinks/getBacklinks/
  getTimeline/getRawData/getVersions): non-mutating, won't 21000.
- reconcile-links.ts CLI --source flag exposure (internal opt is wired;
  CLI parser is a UX feature for later).

Existing rows in production written under (default, slug) by the old
putPage when caller meant another source remain misrouted. Backfill
heuristics need install-specific knowledge of intended source and are
outside this PR's scope; surface as a deployment-side cleanup task.

bun run typecheck clean, bun run build clean, 19/19 regression tests pass,
4082 unit pass / 1 pre-existing fail (BrainRegistry test depending on
test-env ~/.gbrain/ absence — fails on untouched main, unrelated).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…n#707 gap)

PR garrytan#707 fixed source_id routing for sync's incremental loop (lines 581/641)
but performFullSync (line 922) calls runImport without threading sourceId.
Result: full syncs route pages to default even with --source <id>. Verified
on v0.30.1 by direct PGLite probe after `gbrain sync --source X --full`:
all pages landed in default, not the named source.

Fix:
- runImport accepts sourceId in opts (programmatic only — no CLI flag,
  preserving PR garrytan#707's design intent of `gbrain import` being default-only).
- runImport threads sourceId to importFile + importImageFile.
- performFullSync passes opts.sourceId to runImport.
- ImportImageOptions type accepts sourceId for runImport branch (importImageFile
  body wiring deferred — image imports out of scope for current use case;
  TS error fix only).

Verified: real sync test against /tmp/test-sync routes 1 page to "testsync"
source, 0 to default (post-fix). 19/19 source-id regression tests still pass.
Typecheck clean.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@jeremyknows
Copy link
Copy Markdown

👋 Wanted to share verification + a small gap I found while applying this in production.

What I did: Rebased your branch onto current master (v0.30.1) — your branch was at v0.28.5 base, so 5 versions behind. Resolved 7 mechanical conflicts (combining current master's v0.29.1 effective_date / effective_date_source / import_filename columns into putPage's INSERT alongside your source_id plumbing, in both pglite-engine.ts + postgres-engine.ts + import-file.ts).

Verification:

  • bun run typecheck clean post-rebase
  • ✅ All 19/19 source-id-tx-regression tests pass post-rebase
  • ✅ Full unit suite 4517/4520 (3 failures pre-existing per your PR body, confirmed not from rebase)
  • ✅ Real-corpus test on 3,171-page brain (atlas operational diaries + a separate VeeFriends KB git repo): without the patch, sync of veefriends-kb routed all 59 pages to default; with the patch, all 59 routed to veefriends-kb correctly. Direct PGLite SQL probe confirms.

One gap I found: commands/sync.ts:performFullSync (line 922 on current master) calls runImport(engine, importArgs, { commit: headCommit }) — does not pass sourceId. Your fix correctly threads sourceId through the incremental sync loop (lines 581/641) but performFullSync (which --full invokes) still drops it. So gbrain sync --source X --full still routed pages to default for me.

Small follow-up patch (3 files, +18/-4) preserves your design intent of gbrain import CLI being default-only — adds sourceId to runImport's opts as a programmatic-only parameter (no CLI flag), then performFullSync passes opts.sourceId to runImport. Also extends ImportImageOptions type to accept sourceId (TS-only fix; image-import body wiring deferred). After this, full-sync routes correctly end-to-end.

Branch with rebase + gap-fix: jeremyknows/gbrain @ fix-707-performfullsync-gap-rebase. Two commits: rebased version of your 4823ddb (now 46cd197) + my follow-up 694f9c9. Happy to either open this as a separate PR after yours merges, or push to your branch if you'd prefer a single squashed PR.

Either way — really clean fix. The bare-slug subquery analysis caught a real Postgres 21000 surface I'd never have spotted. The 19 regression tests are the kind of thoroughness that makes adoption easy. Thanks 🙏

@mdcruz88
Copy link
Copy Markdown
Author

mdcruz88 commented May 8, 2026

@jeremyknows really appreciate you driving the rebase and locking in the text/code sourceId threading — huge help getting us this far. After verifying on jk-rebase with my crew, we also need to wire sourceId through the image-import body (it's reachable under GBRAIN_EMBEDDING_MULTIMODAL=true from performFullSync) and add a writeSyncConfig?: boolean opt to runImport so full source syncs don't clobber the global sync.repo_path / sync.last_commit keys before performFullSync writes its source-scoped anchor. I'll fold your changes in with Co-authored-by: jeremyknows, implement those two pieces, rebase onto v0.30.1, add a performFullSync regression test, and ship as one tight PR.

PR garrytan#707's existing 19-test suite at test/source-id-tx-regression.test.ts
covers the engine-layer transaction surface (putPage / addTag / etc.)
but does NOT exercise commands/sync.ts:performFullSync. Verified via
`grep -c 'performFullSync' test/source-id-tx-regression.test.ts → 0`.

This means the +18/-4 fix at sync.ts:892 (performFullSync passing
sourceId to runImport) had no automated coverage.

Adds 2 PGLite-only regression tests:

1. `performFullSync with --source routes pages to named source (not default)`
   — fixture: temp git repo with 2 markdown files. Calls performSync with
   { full: true, sourceId: 'testsrc-pfs', noPull: true, noEmbed: true }.
   Asserts pages.source_id = 'testsrc-pfs', not 'default'. Pre-fix: FAILS
   (verified by checking out 46cd197 — rebased PR garrytan#707 only, without my
   gap-fix — and running this test). Post-fix: PASSES.

2. `performFullSync WITHOUT --source still targets default (back-compat)`
   — same fixture, no sourceId opt. Asserts pages.source_id = 'default'.
   Both pre-fix and post-fix: PASSES (back-compat preserved by the fix).

Verified: 21/21 tests pass on this branch (19 from PR garrytan#707 + 2 new).
`bun run typecheck` clean. `bun run verify` clean (8 guard checks pass).

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@jeremyknows
Copy link
Copy Markdown

Quick update — based on the verification work I'd offered earlier in this thread, I've now opened #757 as a standalone PR with the rebased commits + the gap-fix + a regression test that exercises performFullSync (the existing 19-test suite at test/source-id-tx-regression.test.ts doesn't cover that path; verified via grep, and the test fails empirically on 46cd197 / fails-correctly when run pre-fix).

The branch (jeremyknows/gbrain @ fix-707-performfullsync-gap-rebase) includes:

  1. Your 4823ddb rebased onto current master as 46cd197 (resolved 7 mechanical conflicts combining v0.29.1 effective_date columns with your source_id plumbing)
  2. My follow-up 694f9c9 (+18/-4: runImport accepts opts.sourceId programmatically; performFullSync passes it through; ImportImageOptions accepts sourceId for typecheck; preserves your design intent of gbrain import CLI being default-only)
  3. New 37dc598 (+139): regression test pair — one verifies --source routes correctly, one verifies back-compat without --source

If you'd prefer to pull 694f9c9 + 37dc598 into PR #707 and close #757, happy to do that — or close #757 myself once you incorporate the commits. #757 stands alone if PR #707 stays as-is, or rebases trivially after PR #707 merges.

Thanks again for the thoroughness in #707 — the bare-slug subquery analysis was the kind of work I wouldn't have spotted on my own.

Michael Dela Cruz and others added 3 commits May 9, 2026 09:16
…test

Test docstring carried references to a private downstream consumer's internal vocabulary (PRISM Round 2, Atlas Terminal agent, PR-E acceptance criterion). Scrubbed for upstream readability — test logic unchanged.
Follow-up to PR garrytan#707's text/code source_id thread + the performFullSync gap-fix:
the image-import path (importImageFile + withImportTransaction) still routed
all image-page, chunk, file, and sibling-link writes to source='default' even
when the caller passed sourceId. Reachable from performFullSync via
collectMarkdownFiles when GBRAIN_EMBEDDING_MULTIMODAL=true, recreating the
exact silent-misroute bug PR garrytan#707 was meant to close.

Fix:
- ImportTransactionSpec gains sourceId?: string. withImportTransaction threads
  it to tx.createVersion / putPage / getPage (file lookup) / upsertChunks /
  deleteChunks, and merges it into FileSpec.source_id when spec.file is present.
- importImageFile pre-existence check (engine.getPage(imageSlug)) source-scopes
  via opts.sourceId so identical-named images across sources don't false-skip.
- importImageFile after-hook (image_of sibling auto-link) source-scopes both
  the candidate lookup and the addLink edge endpoints, so a multi-source brain
  can't cross-link an image in source X to a same-slug text page in source Y.
- Test: 2 new cases in test/import-image-file.test.ts — sourceId='img-src'
  routes page+chunk+file to that source (not default); omitting sourceId still
  targets default (back-compat preserved).

Co-Authored-By: Claude Opus 4.7 <[email protected]>
…ull sync

Follow-up to PR garrytan#707's text/code source_id thread + the performFullSync gap-fix:
runImport unconditionally wrote three global (non-source-scoped) config keys
on success — sync.last_commit / sync.last_run / sync.repo_path. When called
from performFullSync with a sourceId, those writes overwrote the global keys
with the source's values; a later bare `gbrain sync` then read X's repo path
as the default-source repo and imported X content into source='default'.
Silent contamination, hidden behind a correct-looking sources.X.last_sync_at.

Fix:
- runImport gains `writeSyncConfig?: boolean` (default true) on its programmatic
  opts. When false, the three global setConfig calls are skipped — but failure
  recording (sync-failures.jsonl) still runs unconditionally so doctor surfaces
  problems regardless of source.
- performFullSync passes `writeSyncConfig: !opts.sourceId` so default-source
  full syncs preserve legacy back-compat (global writes happen) while
  named-source full syncs delegate sync-state ownership to performFullSync's
  own source-scoped writeSyncAnchor calls.
- Test: 2 new cases in test/performfullsync-source-id.test.ts — sentinel
  values for sync.repo_path / sync.last_commit survive a `--source X --full`
  run; back-compat full sync (no --source) still updates them.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@mdcruz88 mdcruz88 force-pushed the fix/source-id-tx-thread branch from 4823ddb to f6535bd Compare May 9, 2026 00:43
@mdcruz88
Copy link
Copy Markdown
Author

mdcruz88 commented May 9, 2026

Rebased onto v0.30.1 + folded in @jeremyknows's commits + added the two scoping pieces flagged in my earlier reply. PR is now MERGEABLE.

Stack (6 commits on top of v0.30.1 master, 4823ddbcf6535bdd):

  1. 46cd1977 (mdcruz88) — original engine-layer source_id thread, rebased onto v0.30.1 with @jeremyknows's 7-conflict resolution combining the v0.29.1 effective_date / effective_date_source / import_filename columns.
  2. 694f9c97 (jeremyknows) — performFullSync gap-fix (+18/-4): runImport accepts opts.sourceId, threads through to importFile + importImageFile; performFullSync passes it through.
  3. 37dc5982 (jeremyknows) — regression test pair for performFullSync source routing (+ verifies back-compat without --source).
  4. 9b4d4413 (mdcruz88) — scrub a downstream consumer's internal vocabulary from the test docstring (kept the test logic identical; just removed in-house jargon).
  5. 8f31bfd9 (mdcruz88) — wire sourceId through the image-import body. ImportTransactionSpec gains sourceId?; withImportTransaction threads it to tx.createVersion / putPage / getPage / upsertChunks / deleteChunks and sets FileSpec.source_id. importImageFile source-scopes the pre-existence check, the withImportTransaction call, and the image_of sibling-link tx.getPage + tx.addLink endpoints. 2 new test cases.
  6. f6535bdd (mdcruz88) — runImport gains writeSyncConfig?: boolean (default true). When performFullSync calls with sourceId, it passes false so the legacy global writes (sync.last_commit / sync.last_run / sync.repo_path) don't overwrite the global keys with that source's values. Without this gate, a --source X --full run leaves global sync.repo_path pointing at X, so a later bare gbrain sync reads X's repo as the default-source repo. 2 new isolation tests.

Test posture (local):

  • Targeted regression set: 71/71 pass across source-id-tx-regression.test.ts (19) + performfullsync-source-id.test.ts (4) + import-image-file.test.ts (9) + sync-failures.test.ts (39).
  • bun run typecheck — 2 pre-existing Cannot find module '@jsquash/avif/decode.js' / 'exifr' errors on lines 781 / 816 of import-file.ts, unchanged from the rebase base. None of my edits introduce new TS errors.
  • bun run verify — 6 of 8 guards pass; check:wasm exited with code 137 (OOM under bun build --compile) on my local machine — environmental. CI will run the full gate.

Co-authorship: all 3 of @jeremyknows's commits land verbatim with their original authorship (rebase preserved attribution on commit 1; commits 2 + 3 keep his author trailer). My follow-up commits (4, 5, 6) reference his work in their bodies.

Out-of-scope follow-ups captured but explicitly NOT in this PR:

@jeremyknows — feel free to close #757 whenever; everything from your branch is now in #707 with attribution preserved. Thanks for the rebase + the regression test that empirically proved the gap.

@garrytan — closes #497, #540 + the multi-source full-sync silent-misroute class. Happy to split this into smaller PRs if the scope is too wide for a single review; let me know.

@garrytan
Copy link
Copy Markdown
Owner

garrytan commented May 9, 2026

Thanks for the contribution! This is being closed because it was superseded by #757 in #776 (full superset including the performFullSync gap). Real fix shipped; your name will appear in the v0.30.3 release notes attribution where applicable. Thank you again.

@garrytan garrytan closed this May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants