Skip to content

Content dedup on import: re-import is fully idempotent#133

Merged
SiteRelEnby merged 3 commits into
mainfrom
import-content-dedup
Jun 11, 2026
Merged

Content dedup on import: re-import is fully idempotent#133
SiteRelEnby merged 3 commits into
mainfrom
import-content-dedup

Conversation

@SiteRelEnby

Copy link
Copy Markdown
Contributor

Follow-up to #131 (member dedup) and #350. Member dedup stopped re-imports from doubling the roster, but everything members own still appended: groups, fronts, journals, messages, polls, reminders, and notification config all duplicated on a second import. This extends exact-match dedup to every content section across all six importers, so importing the same export twice adds nothing the second time.

Matching

Titles and bodies are encrypted at rest (non-deterministically), so content rows match on plaintext columns that round-trip the export unchanged: tags, groups, and custom-field definitions by name; fronts by (started_at, ended_at, resolved member set); journals by (member_id, created_at); revisions by (target type, resolved target, created_at); messages by (board_member_id, created_at); polls, reminders, watch tokens, and channels by created_at. created_at works as a key because every importer preserves the source timestamp. Member-linked keys line up because the member-dedup pass resolves skipped members onto their existing rows first.

Dedup is active under both skip and update strategies (an in-place update of an exact-match row is meaningless by construction); create keeps the old append-everything behaviour. Every section reports a *_skipped count on the import detail page.

The interesting bit: counted skips

Plain key-set dedup is wrong for message-like sections in two ways: foreign chat timestamps (PluralSpace, Prism) are second-precision, so distinct messages legitimately share a key, and Postgres freezes now() per transaction, so sibling rows created in one import share created_at exactly. A key-set would silently drop real rows on first import. The new CountedIndex consumes one existing occurrence per skip instead: a re-import skips exactly as many rows per key as the system already holds and creates the rest. Idempotent across runs, lossless on the first pass. The native importer pairs it with a row map so a new reply still threads onto a skipped parent.

Latent crash fix

Pre-seeding the association guards from the DB also fixes a real bug: re-importing custom-field values into a system that already had them violated the UNIQUE(field_id, member_id) constraint in the SimplyPlural, PluralSpace, and Prism importers (group_members composite PK in Prism too) and failed the whole job. Regression-tested.

Invariants

  • Skip never mutates an existing row: parent links and reply pointers are only written onto rows created in the current run, so a skipped group keeps its hierarchy and a skipped channel keeps its rules.
  • Rows without a created_at always create (no guessing).
  • All loaders are read-only and system-scoped (revisions scope by user, matching the model).

Known limitation

Prism media attachments still duplicate blobs on re-import; attachments have no stable identity in the source format short of content-hashing the bytes. Out of scope here.

Tests

New idempotence tests per importer, headlined by a native all-sections round-trip: build a system exercising every export section, import twice, assert run 2 imports zero rows everywhere with exact skip counts and unchanged API row counts. Plus PK (groups + switch history), TB (groups), SP (crash regression + front history), PluralSpace, and Prism re-import tests. Full importer + parity sweep: 138 passed.

Member dedup left everything members own appending on re-import. This
extends exact-match dedup to all content sections across all six
importers: tags and groups by name, fronts by (interval, member set),
journals, revisions, messages, polls, reminders, and notification
config by the source timestamps every importer preserves. Active under
both skip and update strategies; create keeps append-everything.

New shared module import_content_dedup provides the match indexes, an
occurrence-counted index for message keys (coarse source timestamps
mean distinct messages can legitimately share a key, and Postgres
freezes now() per transaction, so plain key-sets drop real rows), and
pre-seeded pair guards for association tables.

The pair guards also fix a latent crash: re-importing custom-field
values into a system that already had them violated the
UNIQUE(field_id, member_id) constraint in the SimplyPlural,
PluralSpace, and Prism importers (and group_members in Prism), failing
the whole job.

Skipped rows are never mutated: parent links and reply pointers are
only written onto rows created in the current run. Every section
reports a *_skipped count on the import detail page.

Closes #350
@SiteRelEnby SiteRelEnby enabled auto-merge June 11, 2026 20:16
@SiteRelEnby SiteRelEnby merged commit ed6136e into main Jun 11, 2026
4 checks passed
@SiteRelEnby SiteRelEnby deleted the import-content-dedup branch June 11, 2026 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant