sheaf-project · SiteRelEnby · Jun 11, 2026 · Jun 11, 2026 · Jun 11, 2026 · Jun 11, 2026
@@ -8,8 +8,9 @@ All notable changes to Sheaf are documented here. The format is based on [Keep a
 
 ### Added
 
+- **Re-import is now fully idempotent (content dedup).** Import deduplication covers everything, not just members: tags and groups match by name, fronts by their exact interval and member set, and journals, edit-history revisions, board messages, polls, reminders, and notification config by the source timestamps every importer already preserves. Importing the same export twice - any source: native JSON, the with-images archive, PluralKit, SimplyPlural, Tupperbox, PluralSpace, or Prism - adds nothing the second time; each section reports a `*_skipped` count on the import detail page instead. `conflict_strategy=create` keeps the old append-everything behaviour. This also fixes a crash on re-import in the SimplyPlural, PluralSpace, and Prism importers: a reused custom-field definition plus an already-present (deduped) member violated the one-value-per-(field, member) constraint and failed the whole job; the value and group-membership guards are now pre-seeded with what the system already has.
 - **Export-with-images archive import.** The zip the async export job produces (export.json + images/) can now be uploaded straight back through Settings -> Import: avatars, markdown image embeds, and journal/revision image attachments are re-uploaded to the importing account through the same pipeline as regular uploads (format sniff, EXIF strip + dimension cap, storage quota) and every reference is rewritten to the new keys. Previously the zip's images were carry-your-own: the JSON imported but image references were stripped and re-uploading was manual. The plain JSON import is unchanged. Restore is quota-aware (stops cleanly with a warning when full), runs the member-cap check before writing any blob so a failed import never strands storage, and discards uploads nothing ends up referencing (e.g. when the dedup pass skips an already-present member) so re-imports don't leak quota. The image-ingest pipeline itself is now shared code (`import_media`) used by the PluralSpace and Prism importers too, instead of three private copies.
-- **Import deduplication.** Every importer (PluralKit, SimplyPlural, Tupperbox, PluralSpace, Prism, and Sheaf native re-import) now matches each incoming member against the system's existing roster before writing, so re-importing the same export no longer doubles your members. Matching is by PluralKit ID where present (exact, so PK round-trips cleanly) and otherwise by name, scoped so a member and a custom front sharing a name never collide. A new `conflict_strategy` option chooses what happens on a match: `skip` (default - leave the existing member untouched and add nothing), `update` (overwrite the existing member's importable fields from the export), or `create` (the old append-everything behaviour, kept as an escape hatch). The tier member cap now counts only the members an import would actually create, so re-importing into a near-full system no longer trips the cap on members that already exist. Deduplication is member-scoped: fronts, groups, journals, messages, polls, and reminders are still appended on re-import, so re-importing those sections over existing data can still duplicate them. The PluralKit member HID is now also confirmed to land in each member's `pluralkit_id` field, which doubles as the dedup key.
+- **Import deduplication.** Every importer (PluralKit, SimplyPlural, Tupperbox, PluralSpace, Prism, and Sheaf native re-import) now matches each incoming member against the system's existing roster before writing, so re-importing the same export no longer doubles your members. Matching is by PluralKit ID where present (exact, so PK round-trips cleanly) and otherwise by name, scoped so a member and a custom front sharing a name never collide. A new `conflict_strategy` option chooses what happens on a match: `skip` (default - leave the existing member untouched and add nothing), `update` (overwrite the existing member's importable fields from the export), or `create` (the old append-everything behaviour, kept as an escape hatch). The tier member cap now counts only the members an import would actually create, so re-importing into a near-full system no longer trips the cap on members that already exist. The PluralKit member HID is now also confirmed to land in each member's `pluralkit_id` field, which doubles as the dedup key. (Non-member content dedup landed alongside - see the entry above.)
 ### Fixed
 
 - **Build provenance for local compose builds.** `GET /v1/version` reports the commit/tag/build-time the backend was built from; CI-built ghcr images already set these, but a local `docker compose build` left them null because the compose `args` didn't forward them. The app service now accepts `GIT_COMMIT` / `GIT_TAG` / `BUILD_TIME` from the host environment (documented in SELFHOSTING.md), so a compose build can identify itself too. Unset values stay null, same as before.

@@ -90,7 +90,11 @@ class PluralspaceImportResult(BaseModel):
     groups_imported: int = 0
     custom_fields_imported: int = 0
     fronts_imported: int = 0
+    fronts_skipped: int = 0
     journals_imported: int = 0
+    journals_skipped: int = 0
     messages_imported: int = 0
+    messages_skipped: int = 0
     polls_imported: int = 0
+    polls_skipped: int = 0
     warnings: list[str] = Field(default_factory=list)
@@ -92,10 +92,15 @@ class PrismImportResult(BaseModel):
     custom_fields_imported: int = 0
     custom_field_values_imported: int = 0
     fronts_imported: int = 0
+    fronts_skipped: int = 0
     journals_imported: int = 0
+    journals_skipped: int = 0
     messages_imported: int = 0
+    messages_skipped: int = 0
     board_posts_imported: int = 0
+    board_posts_skipped: int = 0
     polls_imported: int = 0
+    polls_skipped: int = 0
     reminders_imported: int = 0
     media_attachments_imported: int = 0
     warnings: list[str] = Field(default_factory=list)
@@ -54,7 +54,10 @@ class SPImportResult(BaseModel):
     members_skipped: int = 0
     members_updated: int = 0
     fronts_imported: int = 0
+    fronts_skipped: int = 0
     groups_imported: int = 0
+    groups_skipped: int = 0
     custom_fields_imported: int = 0
+    custom_fields_skipped: int = 0
     notes_skipped: int = 0  # Until journal feature exists
     warnings: list[str] = []
@@ -47,4 +47,5 @@ class TBImportResult(BaseModel):
     members_skipped: int = 0
     members_updated: int = 0
     groups_imported: int = 0
+    groups_skipped: int = 0
     warnings: list[str] = []
@@ -0,0 +1,340 @@
+"""Exact-match dedup for non-member import content.
+
+`import_dedup` makes re-imported MEMBERS resolve onto the existing
+roster; this module does the same for everything members own, so a full
+re-import (especially the Sheaf native restore path) is idempotent
+instead of appending a second copy of every group, front, journal,
+message, poll, reminder, and notification channel.
+
+Matching is exact and system-scoped. Titles and bodies are encrypted
+at rest (non-deterministically), so content rows match on plaintext
+columns that round-trip the export unchanged:
+
+- tags / groups / custom-field definitions: name (+ field type)
+- fronts: (started_at, ended_at, the resolved member set)
+- journals: (member_id, created_at)
+- content revisions: (target_type, resolved target id, created_at)
+- messages: (board_member_id, created_at)
+- polls / reminders / watch tokens / channels: created_at
+
+created_at works as a key because every importer preserves the source
+timestamp on the row it creates, so the same export always lands on
+the same instants. Member-linked keys work because the member-dedup
+pass resolves skipped members onto their existing rows first - the
+front/journal keys then line up with what is already in the DB.
+
+Semantics: rows dedup under both SKIP and UPDATE strategies - an
+in-place "update" of an exact-match content row is meaningless by
+construction (the key IS the content identity). CREATE keeps the old
+append-everything behaviour. Callers gate on
+`strategy != ImportConflictStrategy.CREATE`.
+
+Association tables (group_members, member_tags, reminder scopes) get
+pair-set guards: a skipped group plus a newly-created member must
+still link, while a skipped group plus a skipped member must not
+violate the composite primary key.
+"""
+
+from __future__ import annotations
+
+import uuid
+from datetime import datetime
+from typing import Any
+
+from sqlalchemy import select
+from sqlalchemy.ext.asyncio import AsyncSession
+
+from sheaf.models.content_revision import ContentRevision
+from sheaf.models.custom_field import CustomFieldDefinition, CustomFieldValue
+from sheaf.models.front import Front
+from sheaf.models.group import Group
+from sheaf.models.journal_entry import JournalEntry
+from sheaf.models.member import front_members, group_members, member_tags
+from sheaf.models.message import Message
+from sheaf.models.notification_channel import NotificationChannel
+from sheaf.models.poll import Poll
+from sheaf.models.reminder import Reminder
+from sheaf.models.tag import Tag
+from sheaf.models.watch_token import WatchToken
+
+
+class ContentMatchIndex:
+    """One section's exact-match index.
+
+    Maps match key -> an opaque value: the existing ORM row where the
+    caller links downstream records against it (groups, journals,
+    field definitions, messages), or just True where existence is all
+    that matters (fronts, polls, reminders).
+
+    `get(key)` returns the existing value (a duplicate - skip the
+    candidate, link against the returned row if applicable) or None
+    when the key is new. The caller then builds the row and calls
+    `register(key, row)` so later rows in the same import batch dedup
+    against it too.
+    """
+
+    def __init__(self, rows: dict | None = None):
+        self._rows: dict = dict(rows) if rows else {}
+
+    def get(self, key: Any) -> Any | None:
+        return self._rows.get(key)
+
+    def register(self, key: Any, value: Any = True) -> None:
+        self._rows[key] = value
+
+    def __contains__(self, key: Any) -> bool:
+        return key in self._rows
+
+
+class CountedIndex:
+    """Occurrence-counted dedup for sections whose match key can
+    legitimately collide WITHIN one import.
+
+    Chat-style messages carry second-precision source timestamps, so
+    two distinct messages in the same second share a key. A plain
+    key-set would wrongly drop the second one on first import. Instead,
+    `should_skip(key)` consumes one existing occurrence per call: a
+    re-import skips exactly as many rows per key as the system already
+    holds and creates the rest, which is idempotent across runs while
+    never dropping distinct same-second rows on the first pass.
+    """
+
+    def __init__(self, counts: dict | None = None):
+        self._remaining: dict = dict(counts) if counts else {}
+
+    def should_skip(self, key: Any) -> bool:
+        remaining = self._remaining.get(key, 0)
+        if remaining > 0:
+            self._remaining[key] = remaining - 1
+            return True
+        return False
+
+
+class PairGuard:
+    """Existence guard for association-table inserts.
+
+    `add(pair)` returns True exactly once per pair (insert it), False
+    on repeats (it already exists in the DB or earlier in this batch).
+    """
+
+    def __init__(self, existing: set | None = None):
+        self._pairs: set = set(existing) if existing else set()
+
+    def add(self, pair: tuple) -> bool:
+        if pair in self._pairs:
+            return False
+        self._pairs.add(pair)
+        return True
+
+
+# --- Section loaders ---------------------------------------------------------
+#
+# Each returns the section's existing match keys for one system, shaped
+# for ContentMatchIndex / PairGuard. All read-only.
+
+
+async def load_tag_index(db: AsyncSession, system_id: uuid.UUID) -> ContentMatchIndex:
+    rows = await db.execute(select(Tag).where(Tag.system_id == system_id))
+    return ContentMatchIndex({t.name: t for t in rows.scalars().all()})
+
+
+async def load_group_index(
+    db: AsyncSession, system_id: uuid.UUID
+) -> ContentMatchIndex:
+    rows = await db.execute(select(Group).where(Group.system_id == system_id))
+    return ContentMatchIndex({g.name: g for g in rows.scalars().all()})
+
+
+async def load_field_def_index(
+    db: AsyncSession, system_id: uuid.UUID
+) -> ContentMatchIndex:
+    """Custom-field definitions keyed by (name, type) - the same dedup
+    key the native importer has always used for definitions."""
+    rows = await db.execute(
+        select(CustomFieldDefinition).where(
+            CustomFieldDefinition.system_id == system_id
+        )
+    )
+    return ContentMatchIndex(
+        {(fd.name, str(fd.field_type)): fd for fd in rows.scalars().all()}
+    )
+
+
+async def load_field_value_guard(
+    db: AsyncSession, system_id: uuid.UUID
+) -> PairGuard:
+    """(field_id, member_id) pairs already holding a value - the
+    UNIQUE(field_id, member_id) constraint makes a blind re-insert a
+    hard error, so every importer must consult this before adding."""
+    rows = await db.execute(
+        select(CustomFieldValue.field_id, CustomFieldValue.member_id).join(
+            CustomFieldDefinition,
+            CustomFieldValue.field_id == CustomFieldDefinition.id,
+        ).where(CustomFieldDefinition.system_id == system_id)
+    )
+    return PairGuard({(r.field_id, r.member_id) for r in rows})
+
+
+async def load_front_index(
+    db: AsyncSession, system_id: uuid.UUID
+) -> ContentMatchIndex:
+    """Fronts keyed by (started_at, ended_at, frozenset of member ids)."""
+    front_rows = await db.execute(
+        select(Front.id, Front.started_at, Front.ended_at).where(
+            Front.system_id == system_id
+        )
+    )
+    fronts = {r.id: (r.started_at, r.ended_at) for r in front_rows}
+    members_by_front: dict[uuid.UUID, set[uuid.UUID]] = {}
+    if fronts:
+        assoc = await db.execute(
+            select(front_members.c.front_id, front_members.c.member_id).where(
+                front_members.c.front_id.in_(fronts)
+            )
+        )
+        for r in assoc:
+            members_by_front.setdefault(r.front_id, set()).add(r.member_id)
+    return ContentMatchIndex(
+        {
+            (started, ended, frozenset(members_by_front.get(fid, set()))): True
+            for fid, (started, ended) in fronts.items()
+        }
+    )
+
+
+def front_key(
+    started_at: datetime,
+    ended_at: datetime | None,
+    member_ids: set[uuid.UUID] | frozenset[uuid.UUID],
+) -> tuple:
+    return (started_at, ended_at, frozenset(member_ids))
+
+
+async def load_journal_index(
+    db: AsyncSession, system_id: uuid.UUID
+) -> ContentMatchIndex:
+    """Journals keyed by (member_id-or-None, created_at), mapping to the
+    row so revisions on a skipped journal attach to the existing one."""
+    rows = await db.execute(
+        select(JournalEntry).where(JournalEntry.system_id == system_id)
+    )
+    return ContentMatchIndex(
+        {(j.member_id, j.created_at): j for j in rows.scalars().all()}
+    )
+
+
+async def load_revision_index(
+    db: AsyncSession, user_id: uuid.UUID
+) -> ContentMatchIndex:
+    """Revisions scope by user (no system_id column on the model)."""
+    rows = await db.execute(
+        select(
+            ContentRevision.target_type,
+            ContentRevision.target_id,
+            ContentRevision.created_at,
+        ).where(ContentRevision.user_id == user_id)
+    )
+    return ContentMatchIndex(
+        {(str(r.target_type), r.target_id, r.created_at): True for r in rows}
+    )
+
+
+async def load_message_index(
+    db: AsyncSession, system_id: uuid.UUID
+) -> ContentMatchIndex:
+    """Messages keyed by (board_member_id, created_at), mapping to the
+    row so a new reply whose parent was skipped still threads onto the
+    existing parent. Used by the native importer, whose timestamps are
+    microsecond-precision DB round-trips."""
+    rows = await db.execute(
+        select(Message).where(Message.system_id == system_id)
+    )
+    return ContentMatchIndex(
+        {(m.board_member_id, m.created_at): m for m in rows.scalars().all()}
+    )
+
+
+async def load_message_count_index(
+    db: AsyncSession, system_id: uuid.UUID
+) -> CountedIndex:
+    """Occurrence counts of (board_member_id, created_at) message keys.
+
+    For foreign importers (PluralSpace chat, Prism conversations and
+    board posts) whose source timestamps are coarse enough that
+    distinct messages legitimately share a key - see CountedIndex.
+    """
+    rows = await db.execute(
+        select(Message.board_member_id, Message.created_at).where(
+            Message.system_id == system_id
+        )
+    )
+    counts: dict[tuple, int] = {}
+    for r in rows:
+        key = (r.board_member_id, r.created_at)
+        counts[key] = counts.get(key, 0) + 1
+    return CountedIndex(counts)
+
+
+async def load_poll_index(
+    db: AsyncSession, system_id: uuid.UUID
+) -> ContentMatchIndex:
+    rows = await db.execute(
+        select(Poll.created_at).where(Poll.system_id == system_id)
+    )
+    return ContentMatchIndex({r.created_at: True for r in rows})
+
+
+async def load_reminder_index(
+    db: AsyncSession, system_id: uuid.UUID
+) -> ContentMatchIndex:
+    rows = await db.execute(
+        select(Reminder.created_at).where(Reminder.system_id == system_id)
+    )
+    return ContentMatchIndex({r.created_at: True for r in rows})
+
+
+async def load_watch_token_index(
+    db: AsyncSession, system_id: uuid.UUID
+) -> ContentMatchIndex:
+    rows = await db.execute(
+        select(WatchToken).where(WatchToken.system_id == system_id)
+    )
+    return ContentMatchIndex(
+        {t.created_at: t for t in rows.scalars().all()}
+    )
+
+
+async def load_channel_index(
+    db: AsyncSession, system_id: uuid.UUID
+) -> ContentMatchIndex:
+    """Channels keyed by created_at, scoped via their watch token."""
+    rows = await db.execute(
+        select(NotificationChannel)
+        .join(WatchToken, NotificationChannel.watch_token_id == WatchToken.id)
+        .where(WatchToken.system_id == system_id)
+    )
+    return ContentMatchIndex(
+        {c.created_at: c for c in rows.scalars().all()}
+    )
+
+
+async def load_group_member_guard(
+    db: AsyncSession, system_id: uuid.UUID
+) -> PairGuard:
+    rows = await db.execute(
+        select(group_members.c.group_id, group_members.c.member_id).join(
+            Group, group_members.c.group_id == Group.id
+        ).where(Group.system_id == system_id)
+    )
+    return PairGuard({(r.group_id, r.member_id) for r in rows})
+
+
+async def load_member_tag_guard(
+    db: AsyncSession, system_id: uuid.UUID
+) -> PairGuard:
+    rows = await db.execute(
+        select(member_tags.c.tag_id, member_tags.c.member_id).join(
+            Tag, member_tags.c.tag_id == Tag.id
+        ).where(Tag.system_id == system_id)
+    )
+    return PairGuard({(r.tag_id, r.member_id) for r in rows})