Skip to content

fix(extract): dry-run reports net-new rows, not raw candidates (closes #397)#914

Open
vinsew wants to merge 1 commit into
garrytan:masterfrom
vinsew:fix-extract-dryrun-multisource
Open

fix(extract): dry-run reports net-new rows, not raw candidates (closes #397)#914
vinsew wants to merge 1 commit into
garrytan:masterfrom
vinsew:fix-extract-dryrun-multisource

Conversation

@vinsew
Copy link
Copy Markdown
Contributor

@vinsew vinsew commented May 12, 2026

Summary

gbrain extract links|timeline --source db --dry-run reports the candidate count, not the would-be net-new row count. On the second run after a real extract has populated links / timeline_entries, every candidate already exists in the DB, but dry-run still reports them all as "would create N." ON CONFLICT DO NOTHING in addLinksBatch / addTimelineEntriesBatch would silently no-op the lot on a real run.

This is the problem #397 set out to fix. That PR landed before v0.32.8's multi-source threading (#860) reshaped both extractors, so the old patch can't be cherry-picked onto master:

  • Candidate dedup keys now carry source ids (6 segments: from_source_id::from_slug::to_source_id::to_slug::link_type::link_source).
  • engine.getLinks() does not return f.source_id / t.source_id, and Link has no source-id fields. Comparing a 6-segment candidate key against a 5-segment DB row key would false-positive cross-source rows.

This PR closes #397.

Approach

Per-extractor inline SQL that returns source ids alongside the link / timeline shape, plus a per-(from-page, source) cache:

async function existingLinkKeysForFrom(fromSlug, fromSourceId) {
  const rows = await engine.executeRaw(\`
    SELECT f.slug AS from_slug, t.slug AS to_slug,
           l.link_type, l.link_source,
           f.source_id AS from_source_id, t.source_id AS to_source_id
    FROM links l
    JOIN pages f ON f.id = l.from_page_id
    JOIN pages t ON t.id = l.to_page_id
    WHERE f.slug = \$1 AND f.source_id = \$2
  \`, [fromSlug, fromSourceId]);
  return new Set(rows.map(r =>
    \`\${r.from_source_id}::\${r.from_slug}::\${r.to_source_id}::\${r.to_slug}::\${r.link_type}::\${r.link_source ?? 'markdown'}\`
  ));
}

The key shape is byte-for-byte identical to the candidate dedup key the extractor already builds (extract.ts line ~872), so the existing/seen sets compose cleanly.

Why inline SQL instead of extending engine.getLinks()

The cleaner-looking alternative is to add from_source_id / to_source_id to Link and to getLinks' SELECT. That ripples to 6 callers outside extract.ts:

  • src/core/operations.ts:734 (auto-link reconciliation)
  • src/core/operations.ts:1418 (per-page link API)
  • src/core/output/validators/back-link.ts:27 (back-link validator)
  • src/core/output/validators/back-link.ts:37 (back-link validator)
  • src/commands/migrate-engine.ts:234 (engine migration)

None of those need the source ids. Widening the interface across all five callers for a feature only extract --dry-run consumes is the wrong shape. Inline SQL keeps the blast radius at the two extractor functions.

Test plan

4 new cases in test/extract-db.test.ts:

  • links: dry-run after a real run reports zero new add_link rows
  • links: dry-run reports a newly-added candidate after the page body changes
  • timeline: dry-run after a real run reports zero new add_timeline rows
  • timeline: dry-run reports a newly-added entry after timeline section changes

Each pair pins both halves of the contract — the dedup must remove "already in DB" candidates AND must not eat genuinely new candidates.

Run results:

$ bun run typecheck
$ tsc --noEmit
(clean)

$ bun test test/extract-db.test.ts
 16 pass
 0 fail
 36 expect() calls
Ran 16 tests across 1 file. [1.79s]

Scope

Unchanged: real-run output, JSON shape, command-line surface, engine APIs.

Only changes:

  • src/commands/extract.ts — adds 2 cached helpers (existingLinkKeysForFrom, existingTimelineKeysForSlug) and a 4-line continue guard inside each extractor's dry-run branch.
  • test/extract-db.test.ts — 4 new tests.

🤖 Generated with Claude Code


View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

…garrytan#397)

`gbrain extract links|timeline --source db --dry-run` reports the
candidate count, not the would-be net-new row count: a second run after
the first one populated `links` / `timeline_entries` reports the full
candidate set again, even though every row already exists. `ON CONFLICT
DO NOTHING` in `addLinksBatch` / `addTimelineEntriesBatch` would silently
no-op all of them on a real run, so the dry-run output overshoots by
exactly that amount.

This was the original problem garrytan#397 set out to fix. That PR landed before
v0.32.8's multi-source threading (garrytan#860) reshaped both extractors. The
old patch can't cherry-pick onto master:

- candidate dedup keys now carry source ids (6 segments: from_source_id ::
  from_slug :: to_source_id :: to_slug :: link_type :: link_source).
- `engine.getLinks()` does not return `f.source_id` / `t.source_id`,
  and `Link` has no source-id fields. Comparing a 6-segment candidate
  key against a 5-segment DB row key would false-positive cross-source
  rows.

Fix: inline a multi-source-aware SQL query per (from_slug, from_source_id)
or (slug, source_id) tuple, build a key set with the exact same 6/4
segment shape the candidate path emits, and skip dry-run candidates whose
key matches a DB row. Cached per from-page (and per from-slug for
timeline) so re-iteration over the same page costs one round-trip.

Why inline SQL instead of extending `engine.getLinks()`: that path
would change the public `Link` interface and ripple to 6 callers
outside extract.ts (operations.ts × 2, back-link.ts × 2, migrate-engine.ts × 1).
The inline query stays scoped to the extractor and keeps the blast
radius at the two affected functions.

Tests in test/extract-db.test.ts (4 new cases):

  links: dry-run after a real run reports zero new add_link rows
  links: dry-run reports a newly-added candidate after the page body changes
  timeline: dry-run after a real run reports zero new add_timeline rows
  timeline: dry-run reports a newly-added entry after timeline section changes

Each pair pins both halves of the contract — the dedup must remove
"already in DB" candidates AND must not eat genuinely new candidates.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants