fix(extract): dry-run reports net-new rows, not raw candidates (closes #397)#914
Open
vinsew wants to merge 1 commit into
Open
fix(extract): dry-run reports net-new rows, not raw candidates (closes #397)#914vinsew wants to merge 1 commit into
vinsew wants to merge 1 commit into
Conversation
…garrytan#397) `gbrain extract links|timeline --source db --dry-run` reports the candidate count, not the would-be net-new row count: a second run after the first one populated `links` / `timeline_entries` reports the full candidate set again, even though every row already exists. `ON CONFLICT DO NOTHING` in `addLinksBatch` / `addTimelineEntriesBatch` would silently no-op all of them on a real run, so the dry-run output overshoots by exactly that amount. This was the original problem garrytan#397 set out to fix. That PR landed before v0.32.8's multi-source threading (garrytan#860) reshaped both extractors. The old patch can't cherry-pick onto master: - candidate dedup keys now carry source ids (6 segments: from_source_id :: from_slug :: to_source_id :: to_slug :: link_type :: link_source). - `engine.getLinks()` does not return `f.source_id` / `t.source_id`, and `Link` has no source-id fields. Comparing a 6-segment candidate key against a 5-segment DB row key would false-positive cross-source rows. Fix: inline a multi-source-aware SQL query per (from_slug, from_source_id) or (slug, source_id) tuple, build a key set with the exact same 6/4 segment shape the candidate path emits, and skip dry-run candidates whose key matches a DB row. Cached per from-page (and per from-slug for timeline) so re-iteration over the same page costs one round-trip. Why inline SQL instead of extending `engine.getLinks()`: that path would change the public `Link` interface and ripple to 6 callers outside extract.ts (operations.ts × 2, back-link.ts × 2, migrate-engine.ts × 1). The inline query stays scoped to the extractor and keeps the blast radius at the two affected functions. Tests in test/extract-db.test.ts (4 new cases): links: dry-run after a real run reports zero new add_link rows links: dry-run reports a newly-added candidate after the page body changes timeline: dry-run after a real run reports zero new add_timeline rows timeline: dry-run reports a newly-added entry after timeline section changes Each pair pins both halves of the contract — the dedup must remove "already in DB" candidates AND must not eat genuinely new candidates. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
gbrain extract links|timeline --source db --dry-runreports the candidate count, not the would-be net-new row count. On the second run after a real extract has populatedlinks/timeline_entries, every candidate already exists in the DB, but dry-run still reports them all as "would create N."ON CONFLICT DO NOTHINGinaddLinksBatch/addTimelineEntriesBatchwould silently no-op the lot on a real run.This is the problem #397 set out to fix. That PR landed before v0.32.8's multi-source threading (#860) reshaped both extractors, so the old patch can't be cherry-picked onto master:
from_source_id::from_slug::to_source_id::to_slug::link_type::link_source).engine.getLinks()does not returnf.source_id/t.source_id, andLinkhas no source-id fields. Comparing a 6-segment candidate key against a 5-segment DB row key would false-positive cross-source rows.This PR closes #397.
Approach
Per-extractor inline SQL that returns source ids alongside the link / timeline shape, plus a per-(from-page, source) cache:
The key shape is byte-for-byte identical to the candidate dedup key the extractor already builds (extract.ts line ~872), so the existing/seen sets compose cleanly.
Why inline SQL instead of extending
engine.getLinks()The cleaner-looking alternative is to add
from_source_id/to_source_idtoLinkand togetLinks' SELECT. That ripples to 6 callers outside extract.ts:src/core/operations.ts:734(auto-link reconciliation)src/core/operations.ts:1418(per-page link API)src/core/output/validators/back-link.ts:27(back-link validator)src/core/output/validators/back-link.ts:37(back-link validator)src/commands/migrate-engine.ts:234(engine migration)None of those need the source ids. Widening the interface across all five callers for a feature only
extract --dry-runconsumes is the wrong shape. Inline SQL keeps the blast radius at the two extractor functions.Test plan
4 new cases in
test/extract-db.test.ts:links: dry-run after a real run reports zero new add_link rowslinks: dry-run reports a newly-added candidate after the page body changestimeline: dry-run after a real run reports zero new add_timeline rowstimeline: dry-run reports a newly-added entry after timeline section changesEach pair pins both halves of the contract — the dedup must remove "already in DB" candidates AND must not eat genuinely new candidates.
Run results:
Scope
Unchanged: real-run output, JSON shape, command-line surface, engine APIs.
Only changes:
src/commands/extract.ts— adds 2 cached helpers (existingLinkKeysForFrom,existingTimelineKeysForSlug) and a 4-linecontinueguard inside each extractor's dry-run branch.test/extract-db.test.ts— 4 new tests.🤖 Generated with Claude Code
Need help on this PR? Tag
@codesmithwith what you need.