Skip to content

gbrain sync deletes pages by slug only, ignoring source_id - cross-source data loss #705

@orendi84

Description

@orendi84

gbrain sync deletes pages by slug only, ignoring source_id - cross-source data loss

Summary

gbrain sync against source A can delete brain pages that came from source B if they share a slug. The deletion in src/commands/sync.ts:425-440 filters the source's git diff for un-syncable files, then calls engine.deletePage(slug) without checking whether that brain page actually originated from this source.

I hit this in production: a one-line linkedin-brain README page (slug readme, ingested April 10 from ~/linkedin-brain via gbrain import) was silently deleted by a misconfigured gbrain sync targeting ~/gbrain-prod. gbrain-prod's README.md is un-syncable (no YAML frontmatter), so sync's deletion logic treated the brain page slug readme as a stale orphan and removed it - even though the page belonged to a completely different source.

Reproduction

  1. Two source repos with same-slug files:
    • ~/source-a/README.md (no YAML frontmatter, un-syncable)
    • ~/source-b/README.md (well-formed, syncable - or just any source that imported a readme slug)
  2. Source B was ingested at some point: gbrain import ~/source-b → creates page slug readme.
  3. Some prior gbrain sync ran against source A and bookmarked last_commit (sets sync.last_commit global key, since no --source was passed).
  4. Run gbrain sync again (no --source, no --repo) - it picks up source A from the global sync.repo_path fallback, runs git pull, computes manifest, finds A's README.md was modified between bookmarks.
  5. runPhaseSync filters A's manifest.modified → un-syncable files → calls engine.deletePage('readme').
  6. The readme page from source B is GONE. Cascade-deletes its content_chunks, links, tags, page_versions, etc.

Root cause

src/commands/sync.ts:425-440 (v0.27.0):

const unsyncableModified = manifest.modified.filter(p => !isSyncable(p, syncOpts));
for (const path of unsyncableModified) {
  const slug = resolveSlugForPath(path);
  try {
    const existing = await engine.getPage(slug);
    if (existing) {
      await engine.deletePage(slug);
      console.log(`  Deleted un-syncable page: ${slug}`);
    }
  } catch { /* ignore */ }
}

engine.getPage(slug) and engine.deletePage(slug) operate on the slug across ALL sources. They don't take a sourceId parameter. Even though runPhaseSync already resolved sourceId upstream at line ~534 to thread into performSync, the deletion path doesn't use it.

Suggested fix

Filter the deletion by source provenance. Two options:

Option 1 (preferred): use ingest_log to derive provenance.
For each candidate slug, check the most recent ingest_log entry that lists this slug in pages_updated. If the entry's source_ref (or its mapped sourceId) doesn't match the current sync's source, skip the deletion. Source-agnostic deletes only fire when sync is explicitly source-agnostic (pre-v0.18 brains).

Option 2: extend deletePage to take a sourceId.
engine.deletePage(slug, sourceId) does DELETE FROM pages WHERE slug = $1 AND source_id = $2. The sync deletion path threads its sourceId (already resolved). Same-slug pages from other sources survive.

Option 2 is cleaner but a breaking change to the BrainEngine interface. Option 1 is contained but adds a query per un-syncable candidate.

Impact

Severity: data loss across sources. Hard to diagnose because the page disappears silently with a single log line, page_versions cascade-delete (no recovery from gbrain doctor or version history), and the only audit trail is ingest_log which shows the original write but doesn't capture the deletion.

Likely mostly affects users with multiple sources or who run gbrain import <dir> once and forget the global sync.* config keys it leaves behind. The deletion-by-slug ambiguity also affects sync's manifest.deleted path on line 422 (same shape).

Workaround

For users who hit this:

  • The deleted page is recoverable only if the source content is still on disk: gbrain put <slug> < ~/source-b/README.md reinstates the page.
  • To prevent recurrence, delete the bare-fallback global config keys: DELETE FROM config WHERE key IN ('sync.repo_path', 'sync.last_commit', 'sync.last_run'). After this, gbrain sync without --source errors out instead of silently targeting the wrong repo.

Version

gbrain 0.27.0 (binary). Source viewed at commit ee9ceb3 (v0.27 release base) and b325f28 (v0.28.6 master). The behavior in sync.ts:425-440 is unchanged on master.

Engagement context

Discovered during a pollution-cleanup engagement after a misconfigured autopilot --install --repo ~/gbrain-prod (instead of ~/youtube-knowledge). Cleanup deleted 176 polluted rows successfully via gbrain sync --source default --no-pull for the recovery sync.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions