Skip to content

Multi-source: pages always written with source_id='default' (federation appears empty) #1015

@joshwilks111-max

Description

@joshwilks111-max

Summary

On a multi-source brain (1 default + 19 federated), every page lands in source_id='default' regardless of which source the sync ran against. gbrain sources list reports correct paths and recent last_sync_at timestamps for all federated sources, but every one shows 0 pages. The source-aware ranking machinery added in v0.22.0 is effectively running against a single-source corpus.

Tested on gbrain CLI 0.26.0 against Postgres (Supabase pooler). Brain has 2992 pages total; all 2992 attribute to default.

Reproducer

gbrain sources add propdevnz --path /path/to/propdev
cd /path/to/propdev
gbrain sources attach propdevnz   # writes .gbrain-source dotfile
gbrain sync
gbrain sources list               # propdevnz still shows 0 pages despite recent last_sync_at

Direct probe confirms:

SELECT source_id, COUNT(*) FROM pages GROUP BY source_id;
-- default | 2992
-- (no other rows)

Per-source state IS being maintained — sources.last_sync_at, last_commit, repo_path, chunker_version all update correctly via the source-scoped sync-state helpers. Only the actual page writes are mis-attributed.

Root cause — three layers, one missing thread

  1. Engine layersrc/core/postgres-engine.ts:302-314 (and the PGLite mirror). putPage's INSERT INTO pages doesn't include source_id; relies on schema DEFAULT 'default'. The own comment at line 300 marks the gap explicitly:

    // v0.18.0 Step 2: source_id relies on schema DEFAULT 'default'. ON
    // CONFLICT target becomes (source_id, slug) since global UNIQUE(slug)
    // was dropped in migration v17. See pglite-engine.ts for matching
    // notes; multi-source sync (Step 5) will surface an explicit sourceId.

  2. Import layersrc/core/import-file.ts:338-343. importFromFile opts type is { noEmbed?: boolean; inferFrontmatter?: boolean } — no sourceId field. tx.putPage(slug, {...}) at line 272 has no source attribution to pass.

  3. Sync + import callerssrc/commands/sync.ts:527, 582 and src/commands/import.ts:108. performSync correctly resolves opts.sourceId (used for readSyncAnchor/writeSyncAnchor) but the importFile(engine, filePath, to, { noEmbed }) calls drop it on the floor.

So per-source sync state is tracked precisely (PropDev was last synced at headCommit XYZ at 5:37am) — and then the file content gets written under default. The routing infrastructure exists; the data plane was never wired.

Why it didn't get caught

  • E2E coverage for sync is single-source PGLite. There's no test asserting INSERT INTO sources (id, local_path) ...; sync; SELECT COUNT(*) FROM pages WHERE source_id = $new_source is non-zero.
  • gbrain sources list reports last_sync_at from the sources row — which IS being updated correctly — so the dashboard surface looks healthy. The 0-pages number is the only tell, and it reads as "haven't synced this one yet" rather than "data plane disconnected."

Proposed fix

Forward fix is small (~30 LOC + tests):

  1. Add sourceId?: string to importFromFile opts (and importCodeFile, importFromContent).
  2. Add source_id?: string to PageInput; pass through to tx.putPage.
  3. Engine putPage: include source_id in INSERT (COALESCE-to-default keeps back-compat). The (source_id, slug) UNIQUE is already the conflict target so no schema change.
  4. performSync and runImport thread the resolved opts.sourceId into importFile.

Backfill is the trickier part because the existing default-source pages will collide on (source_id, slug) if you just re-sync — you'd get duplicates rather than updates. Cleanest path I can see: filesystem-probe backfill — for each non-default source, walk local_path, slugify each file, and UPDATE pages SET source_id = ? WHERE source_id = 'default' AND slug = ? for matches. Deterministic, leaves genuinely-default pages alone.

Offer

Happy to put up a PR if useful — the diagnosis is the hard part and that's done. Let me know if you'd prefer to handle it yourself, or if there's a different shape you want for the fix (e.g. require explicit migration vs auto-backfill).

Diagnosed in a Claude Code session against my live brain; happy to provide additional probes if you want specific queries run.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions