Content dedup on import: re-import is fully idempotent by SiteRelEnby · Pull Request #133 · sheaf-project/sheaf

SiteRelEnby · 2026-06-11T20:07:53Z

Follow-up to #131 (member dedup) and #350. Member dedup stopped re-imports from doubling the roster, but everything members own still appended: groups, fronts, journals, messages, polls, reminders, and notification config all duplicated on a second import. This extends exact-match dedup to every content section across all six importers, so importing the same export twice adds nothing the second time.

Matching

Titles and bodies are encrypted at rest (non-deterministically), so content rows match on plaintext columns that round-trip the export unchanged: tags, groups, and custom-field definitions by name; fronts by (started_at, ended_at, resolved member set); journals by (member_id, created_at); revisions by (target type, resolved target, created_at); messages by (board_member_id, created_at); polls, reminders, watch tokens, and channels by created_at. created_at works as a key because every importer preserves the source timestamp. Member-linked keys line up because the member-dedup pass resolves skipped members onto their existing rows first.

Dedup is active under both skip and update strategies (an in-place update of an exact-match row is meaningless by construction); create keeps the old append-everything behaviour. Every section reports a *_skipped count on the import detail page.

The interesting bit: counted skips

Plain key-set dedup is wrong for message-like sections in two ways: foreign chat timestamps (PluralSpace, Prism) are second-precision, so distinct messages legitimately share a key, and Postgres freezes now() per transaction, so sibling rows created in one import share created_at exactly. A key-set would silently drop real rows on first import. The new CountedIndex consumes one existing occurrence per skip instead: a re-import skips exactly as many rows per key as the system already holds and creates the rest. Idempotent across runs, lossless on the first pass. The native importer pairs it with a row map so a new reply still threads onto a skipped parent.

Latent crash fix

Pre-seeding the association guards from the DB also fixes a real bug: re-importing custom-field values into a system that already had them violated the UNIQUE(field_id, member_id) constraint in the SimplyPlural, PluralSpace, and Prism importers (group_members composite PK in Prism too) and failed the whole job. Regression-tested.

Invariants

Skip never mutates an existing row: parent links and reply pointers are only written onto rows created in the current run, so a skipped group keeps its hierarchy and a skipped channel keeps its rules.
Rows without a created_at always create (no guessing).
All loaders are read-only and system-scoped (revisions scope by user, matching the model).

Known limitation

Prism media attachments still duplicate blobs on re-import; attachments have no stable identity in the source format short of content-hashing the bytes. Out of scope here.

Tests

New idempotence tests per importer, headlined by a native all-sections round-trip: build a system exercising every export section, import twice, assert run 2 imports zero rows everywhere with exact skip counts and unchanged API row counts. Plus PK (groups + switch history), TB (groups), SP (crash regression + front history), PluralSpace, and Prism re-import tests. Full importer + parity sweep: 138 passed.

Member dedup left everything members own appending on re-import. This extends exact-match dedup to all content sections across all six importers: tags and groups by name, fronts by (interval, member set), journals, revisions, messages, polls, reminders, and notification config by the source timestamps every importer preserves. Active under both skip and update strategies; create keeps append-everything. New shared module import_content_dedup provides the match indexes, an occurrence-counted index for message keys (coarse source timestamps mean distinct messages can legitimately share a key, and Postgres freezes now() per transaction, so plain key-sets drop real rows), and pre-seeded pair guards for association tables. The pair guards also fix a latent crash: re-importing custom-field values into a system that already had them violated the UNIQUE(field_id, member_id) constraint in the SimplyPlural, PluralSpace, and Prism importers (and group_members in Prism), failing the whole job. Skipped rows are never mutated: parent links and reply pointers are only written onto rows created in the current run. Every section reports a *_skipped count on the import detail page. Closes #350

SiteRelEnby added 3 commits June 11, 2026 11:24

Merge remote-tracking branch 'origin/main' into import-content-dedup

9a6ba25

Merge remote-tracking branch 'origin/main' into import-content-dedup

603ca03

SiteRelEnby enabled auto-merge June 11, 2026 20:16

SiteRelEnby merged commit ed6136e into main Jun 11, 2026
4 checks passed

SiteRelEnby deleted the import-content-dedup branch June 11, 2026 20:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Content dedup on import: re-import is fully idempotent#133

Content dedup on import: re-import is fully idempotent#133
SiteRelEnby merged 3 commits into
mainfrom
import-content-dedup

SiteRelEnby commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SiteRelEnby commented Jun 11, 2026

Matching

The interesting bit: counted skips

Latent crash fix

Invariants

Known limitation

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant