fix(extract): dry-run reports net-new rows, not raw candidates (closes #397) by vinsew · Pull Request #914 · garrytan/gbrain

vinsew · 2026-05-12T10:50:26Z

Summary

gbrain extract links|timeline --source db --dry-run reports the candidate count, not the would-be net-new row count. On the second run after a real extract has populated links / timeline_entries, every candidate already exists in the DB, but dry-run still reports them all as "would create N." ON CONFLICT DO NOTHING in addLinksBatch / addTimelineEntriesBatch would silently no-op the lot on a real run.

This is the problem #397 set out to fix. That PR landed before v0.32.8's multi-source threading (#860) reshaped both extractors, so the old patch can't be cherry-picked onto master:

Candidate dedup keys now carry source ids (6 segments: from_source_id::from_slug::to_source_id::to_slug::link_type::link_source).
engine.getLinks() does not return f.source_id / t.source_id, and Link has no source-id fields. Comparing a 6-segment candidate key against a 5-segment DB row key would false-positive cross-source rows.

This PR closes #397.

Approach

Per-extractor inline SQL that returns source ids alongside the link / timeline shape, plus a per-(from-page, source) cache:

async function existingLinkKeysForFrom(fromSlug, fromSourceId) {
  const rows = await engine.executeRaw(\`
    SELECT f.slug AS from_slug, t.slug AS to_slug,
           l.link_type, l.link_source,
           f.source_id AS from_source_id, t.source_id AS to_source_id
    FROM links l
    JOIN pages f ON f.id = l.from_page_id
    JOIN pages t ON t.id = l.to_page_id
    WHERE f.slug = \$1 AND f.source_id = \$2
  \`, [fromSlug, fromSourceId]);
  return new Set(rows.map(r =>
    \`\${r.from_source_id}::\${r.from_slug}::\${r.to_source_id}::\${r.to_slug}::\${r.link_type}::\${r.link_source ?? 'markdown'}\`
  ));
}

The key shape is byte-for-byte identical to the candidate dedup key the extractor already builds (extract.ts line ~872), so the existing/seen sets compose cleanly.

Why inline SQL instead of extending `engine.getLinks()`

The cleaner-looking alternative is to add from_source_id / to_source_id to Link and to getLinks' SELECT. That ripples to 6 callers outside extract.ts:

src/core/operations.ts:734 (auto-link reconciliation)
src/core/operations.ts:1418 (per-page link API)
src/core/output/validators/back-link.ts:27 (back-link validator)
src/core/output/validators/back-link.ts:37 (back-link validator)
src/commands/migrate-engine.ts:234 (engine migration)

None of those need the source ids. Widening the interface across all five callers for a feature only extract --dry-run consumes is the wrong shape. Inline SQL keeps the blast radius at the two extractor functions.

Test plan

4 new cases in test/extract-db.test.ts:

links: dry-run after a real run reports zero new add_link rows
links: dry-run reports a newly-added candidate after the page body changes
timeline: dry-run after a real run reports zero new add_timeline rows
timeline: dry-run reports a newly-added entry after timeline section changes

Each pair pins both halves of the contract — the dedup must remove "already in DB" candidates AND must not eat genuinely new candidates.

Run results:

$ bun run typecheck
$ tsc --noEmit
(clean)

$ bun test test/extract-db.test.ts
 16 pass
 0 fail
 36 expect() calls
Ran 16 tests across 1 file. [1.79s]

Scope

Unchanged: real-run output, JSON shape, command-line surface, engine APIs.

Only changes:

src/commands/extract.ts — adds 2 cached helpers (existingLinkKeysForFrom, existingTimelineKeysForSlug) and a 4-line continue guard inside each extractor's dry-run branch.
test/extract-db.test.ts — 4 new tests.

🤖 Generated with Claude Code

^{Need help on this PR? Tag @codesmith with what you need.}

Let Codesmith autofix CI failures and bot reviews

…garrytan#397) `gbrain extract links|timeline --source db --dry-run` reports the candidate count, not the would-be net-new row count: a second run after the first one populated `links` / `timeline_entries` reports the full candidate set again, even though every row already exists. `ON CONFLICT DO NOTHING` in `addLinksBatch` / `addTimelineEntriesBatch` would silently no-op all of them on a real run, so the dry-run output overshoots by exactly that amount. This was the original problem garrytan#397 set out to fix. That PR landed before v0.32.8's multi-source threading (garrytan#860) reshaped both extractors. The old patch can't cherry-pick onto master: - candidate dedup keys now carry source ids (6 segments: from_source_id :: from_slug :: to_source_id :: to_slug :: link_type :: link_source). - `engine.getLinks()` does not return `f.source_id` / `t.source_id`, and `Link` has no source-id fields. Comparing a 6-segment candidate key against a 5-segment DB row key would false-positive cross-source rows. Fix: inline a multi-source-aware SQL query per (from_slug, from_source_id) or (slug, source_id) tuple, build a key set with the exact same 6/4 segment shape the candidate path emits, and skip dry-run candidates whose key matches a DB row. Cached per from-page (and per from-slug for timeline) so re-iteration over the same page costs one round-trip. Why inline SQL instead of extending `engine.getLinks()`: that path would change the public `Link` interface and ripple to 6 callers outside extract.ts (operations.ts × 2, back-link.ts × 2, migrate-engine.ts × 1). The inline query stays scoped to the extractor and keeps the blast radius at the two affected functions. Tests in test/extract-db.test.ts (4 new cases): links: dry-run after a real run reports zero new add_link rows links: dry-run reports a newly-added candidate after the page body changes timeline: dry-run after a real run reports zero new add_timeline rows timeline: dry-run reports a newly-added entry after timeline section changes Each pair pins both halves of the contract — the dedup must remove "already in DB" candidates AND must not eat genuinely new candidates. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

vinsew mentioned this pull request May 12, 2026

fix: align gbrain extract --dry-run and doctor guidance with actual commands #397

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(extract): dry-run reports net-new rows, not raw candidates (closes #397)#914

fix(extract): dry-run reports net-new rows, not raw candidates (closes #397)#914
vinsew wants to merge 1 commit into
garrytan:masterfrom
vinsew:fix-extract-dryrun-multisource

vinsew commented May 12, 2026 •

edited by blacksmith-sh Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vinsew commented May 12, 2026 • edited by blacksmith-sh Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

Why inline SQL instead of extending engine.getLinks()

Test plan

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vinsew commented May 12, 2026 •

edited by blacksmith-sh Bot

Loading

Why inline SQL instead of extending `engine.getLinks()`