Content dedup on import: re-import is fully idempotent#133
Merged
Conversation
Member dedup left everything members own appending on re-import. This extends exact-match dedup to all content sections across all six importers: tags and groups by name, fronts by (interval, member set), journals, revisions, messages, polls, reminders, and notification config by the source timestamps every importer preserves. Active under both skip and update strategies; create keeps append-everything. New shared module import_content_dedup provides the match indexes, an occurrence-counted index for message keys (coarse source timestamps mean distinct messages can legitimately share a key, and Postgres freezes now() per transaction, so plain key-sets drop real rows), and pre-seeded pair guards for association tables. The pair guards also fix a latent crash: re-importing custom-field values into a system that already had them violated the UNIQUE(field_id, member_id) constraint in the SimplyPlural, PluralSpace, and Prism importers (and group_members in Prism), failing the whole job. Skipped rows are never mutated: parent links and reply pointers are only written onto rows created in the current run. Every section reports a *_skipped count on the import detail page. Closes #350
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #131 (member dedup) and #350. Member dedup stopped re-imports from doubling the roster, but everything members own still appended: groups, fronts, journals, messages, polls, reminders, and notification config all duplicated on a second import. This extends exact-match dedup to every content section across all six importers, so importing the same export twice adds nothing the second time.
Matching
Titles and bodies are encrypted at rest (non-deterministically), so content rows match on plaintext columns that round-trip the export unchanged: tags, groups, and custom-field definitions by name; fronts by (started_at, ended_at, resolved member set); journals by (member_id, created_at); revisions by (target type, resolved target, created_at); messages by (board_member_id, created_at); polls, reminders, watch tokens, and channels by created_at. created_at works as a key because every importer preserves the source timestamp. Member-linked keys line up because the member-dedup pass resolves skipped members onto their existing rows first.
Dedup is active under both
skipandupdatestrategies (an in-place update of an exact-match row is meaningless by construction);createkeeps the old append-everything behaviour. Every section reports a*_skippedcount on the import detail page.The interesting bit: counted skips
Plain key-set dedup is wrong for message-like sections in two ways: foreign chat timestamps (PluralSpace, Prism) are second-precision, so distinct messages legitimately share a key, and Postgres freezes now() per transaction, so sibling rows created in one import share created_at exactly. A key-set would silently drop real rows on first import. The new
CountedIndexconsumes one existing occurrence per skip instead: a re-import skips exactly as many rows per key as the system already holds and creates the rest. Idempotent across runs, lossless on the first pass. The native importer pairs it with a row map so a new reply still threads onto a skipped parent.Latent crash fix
Pre-seeding the association guards from the DB also fixes a real bug: re-importing custom-field values into a system that already had them violated the UNIQUE(field_id, member_id) constraint in the SimplyPlural, PluralSpace, and Prism importers (group_members composite PK in Prism too) and failed the whole job. Regression-tested.
Invariants
Known limitation
Prism media attachments still duplicate blobs on re-import; attachments have no stable identity in the source format short of content-hashing the bytes. Out of scope here.
Tests
New idempotence tests per importer, headlined by a native all-sections round-trip: build a system exercising every export section, import twice, assert run 2 imports zero rows everywhere with exact skip counts and unchanged API row counts. Plus PK (groups + switch history), TB (groups), SP (crash regression + front history), PluralSpace, and Prism re-import tests. Full importer + parity sweep: 138 passed.