Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,9 @@ All notable changes to Sheaf are documented here. The format is based on [Keep a

### Added

- **Re-import is now fully idempotent (content dedup).** Import deduplication covers everything, not just members: tags and groups match by name, fronts by their exact interval and member set, and journals, edit-history revisions, board messages, polls, reminders, and notification config by the source timestamps every importer already preserves. Importing the same export twice - any source: native JSON, the with-images archive, PluralKit, SimplyPlural, Tupperbox, PluralSpace, or Prism - adds nothing the second time; each section reports a `*_skipped` count on the import detail page instead. `conflict_strategy=create` keeps the old append-everything behaviour. This also fixes a crash on re-import in the SimplyPlural, PluralSpace, and Prism importers: a reused custom-field definition plus an already-present (deduped) member violated the one-value-per-(field, member) constraint and failed the whole job; the value and group-membership guards are now pre-seeded with what the system already has.
- **Export-with-images archive import.** The zip the async export job produces (export.json + images/) can now be uploaded straight back through Settings -> Import: avatars, markdown image embeds, and journal/revision image attachments are re-uploaded to the importing account through the same pipeline as regular uploads (format sniff, EXIF strip + dimension cap, storage quota) and every reference is rewritten to the new keys. Previously the zip's images were carry-your-own: the JSON imported but image references were stripped and re-uploading was manual. The plain JSON import is unchanged. Restore is quota-aware (stops cleanly with a warning when full), runs the member-cap check before writing any blob so a failed import never strands storage, and discards uploads nothing ends up referencing (e.g. when the dedup pass skips an already-present member) so re-imports don't leak quota. The image-ingest pipeline itself is now shared code (`import_media`) used by the PluralSpace and Prism importers too, instead of three private copies.
- **Import deduplication.** Every importer (PluralKit, SimplyPlural, Tupperbox, PluralSpace, Prism, and Sheaf native re-import) now matches each incoming member against the system's existing roster before writing, so re-importing the same export no longer doubles your members. Matching is by PluralKit ID where present (exact, so PK round-trips cleanly) and otherwise by name, scoped so a member and a custom front sharing a name never collide. A new `conflict_strategy` option chooses what happens on a match: `skip` (default - leave the existing member untouched and add nothing), `update` (overwrite the existing member's importable fields from the export), or `create` (the old append-everything behaviour, kept as an escape hatch). The tier member cap now counts only the members an import would actually create, so re-importing into a near-full system no longer trips the cap on members that already exist. Deduplication is member-scoped: fronts, groups, journals, messages, polls, and reminders are still appended on re-import, so re-importing those sections over existing data can still duplicate them. The PluralKit member HID is now also confirmed to land in each member's `pluralkit_id` field, which doubles as the dedup key.
- **Import deduplication.** Every importer (PluralKit, SimplyPlural, Tupperbox, PluralSpace, Prism, and Sheaf native re-import) now matches each incoming member against the system's existing roster before writing, so re-importing the same export no longer doubles your members. Matching is by PluralKit ID where present (exact, so PK round-trips cleanly) and otherwise by name, scoped so a member and a custom front sharing a name never collide. A new `conflict_strategy` option chooses what happens on a match: `skip` (default - leave the existing member untouched and add nothing), `update` (overwrite the existing member's importable fields from the export), or `create` (the old append-everything behaviour, kept as an escape hatch). The tier member cap now counts only the members an import would actually create, so re-importing into a near-full system no longer trips the cap on members that already exist. The PluralKit member HID is now also confirmed to land in each member's `pluralkit_id` field, which doubles as the dedup key. (Non-member content dedup landed alongside - see the entry above.)
### Fixed

- **Build provenance for local compose builds.** `GET /v1/version` reports the commit/tag/build-time the backend was built from; CI-built ghcr images already set these, but a local `docker compose build` left them null because the compose `args` didn't forward them. The app service now accepts `GIT_COMMIT` / `GIT_TAG` / `BUILD_TIME` from the host environment (documented in SELFHOSTING.md), so a compose build can identify itself too. Unset values stay null, same as before.
Expand Down
4 changes: 4 additions & 0 deletions sheaf/schemas/pluralspace_import.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,11 @@ class PluralspaceImportResult(BaseModel):
groups_imported: int = 0
custom_fields_imported: int = 0
fronts_imported: int = 0
fronts_skipped: int = 0
journals_imported: int = 0
journals_skipped: int = 0
messages_imported: int = 0
messages_skipped: int = 0
polls_imported: int = 0
polls_skipped: int = 0
warnings: list[str] = Field(default_factory=list)
5 changes: 5 additions & 0 deletions sheaf/schemas/prism_import.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,10 +92,15 @@ class PrismImportResult(BaseModel):
custom_fields_imported: int = 0
custom_field_values_imported: int = 0
fronts_imported: int = 0
fronts_skipped: int = 0
journals_imported: int = 0
journals_skipped: int = 0
messages_imported: int = 0
messages_skipped: int = 0
board_posts_imported: int = 0
board_posts_skipped: int = 0
polls_imported: int = 0
polls_skipped: int = 0
reminders_imported: int = 0
media_attachments_imported: int = 0
warnings: list[str] = Field(default_factory=list)
3 changes: 3 additions & 0 deletions sheaf/schemas/sp_import.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,10 @@ class SPImportResult(BaseModel):
members_skipped: int = 0
members_updated: int = 0
fronts_imported: int = 0
fronts_skipped: int = 0
groups_imported: int = 0
groups_skipped: int = 0
custom_fields_imported: int = 0
custom_fields_skipped: int = 0
notes_skipped: int = 0 # Until journal feature exists
warnings: list[str] = []
1 change: 1 addition & 0 deletions sheaf/schemas/tb_import.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,4 +47,5 @@ class TBImportResult(BaseModel):
members_skipped: int = 0
members_updated: int = 0
groups_imported: int = 0
groups_skipped: int = 0
warnings: list[str] = []
340 changes: 340 additions & 0 deletions sheaf/services/import_content_dedup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,340 @@
"""Exact-match dedup for non-member import content.

`import_dedup` makes re-imported MEMBERS resolve onto the existing
roster; this module does the same for everything members own, so a full
re-import (especially the Sheaf native restore path) is idempotent
instead of appending a second copy of every group, front, journal,
message, poll, reminder, and notification channel.

Matching is exact and system-scoped. Titles and bodies are encrypted
at rest (non-deterministically), so content rows match on plaintext
columns that round-trip the export unchanged:

- tags / groups / custom-field definitions: name (+ field type)
- fronts: (started_at, ended_at, the resolved member set)
- journals: (member_id, created_at)
- content revisions: (target_type, resolved target id, created_at)
- messages: (board_member_id, created_at)
- polls / reminders / watch tokens / channels: created_at

created_at works as a key because every importer preserves the source
timestamp on the row it creates, so the same export always lands on
the same instants. Member-linked keys work because the member-dedup
pass resolves skipped members onto their existing rows first - the
front/journal keys then line up with what is already in the DB.

Semantics: rows dedup under both SKIP and UPDATE strategies - an
in-place "update" of an exact-match content row is meaningless by
construction (the key IS the content identity). CREATE keeps the old
append-everything behaviour. Callers gate on
`strategy != ImportConflictStrategy.CREATE`.

Association tables (group_members, member_tags, reminder scopes) get
pair-set guards: a skipped group plus a newly-created member must
still link, while a skipped group plus a skipped member must not
violate the composite primary key.
"""

from __future__ import annotations

import uuid
from datetime import datetime
from typing import Any

from sqlalchemy import select
from sqlalchemy.ext.asyncio import AsyncSession

from sheaf.models.content_revision import ContentRevision
from sheaf.models.custom_field import CustomFieldDefinition, CustomFieldValue
from sheaf.models.front import Front
from sheaf.models.group import Group
from sheaf.models.journal_entry import JournalEntry
from sheaf.models.member import front_members, group_members, member_tags
from sheaf.models.message import Message
from sheaf.models.notification_channel import NotificationChannel
from sheaf.models.poll import Poll
from sheaf.models.reminder import Reminder
from sheaf.models.tag import Tag
from sheaf.models.watch_token import WatchToken


class ContentMatchIndex:
"""One section's exact-match index.

Maps match key -> an opaque value: the existing ORM row where the
caller links downstream records against it (groups, journals,
field definitions, messages), or just True where existence is all
that matters (fronts, polls, reminders).

`get(key)` returns the existing value (a duplicate - skip the
candidate, link against the returned row if applicable) or None
when the key is new. The caller then builds the row and calls
`register(key, row)` so later rows in the same import batch dedup
against it too.
"""

def __init__(self, rows: dict | None = None):
self._rows: dict = dict(rows) if rows else {}

def get(self, key: Any) -> Any | None:
return self._rows.get(key)

def register(self, key: Any, value: Any = True) -> None:
self._rows[key] = value

def __contains__(self, key: Any) -> bool:
return key in self._rows


class CountedIndex:
"""Occurrence-counted dedup for sections whose match key can
legitimately collide WITHIN one import.

Chat-style messages carry second-precision source timestamps, so
two distinct messages in the same second share a key. A plain
key-set would wrongly drop the second one on first import. Instead,
`should_skip(key)` consumes one existing occurrence per call: a
re-import skips exactly as many rows per key as the system already
holds and creates the rest, which is idempotent across runs while
never dropping distinct same-second rows on the first pass.
"""

def __init__(self, counts: dict | None = None):
self._remaining: dict = dict(counts) if counts else {}

def should_skip(self, key: Any) -> bool:
remaining = self._remaining.get(key, 0)
if remaining > 0:
self._remaining[key] = remaining - 1
return True
return False


class PairGuard:
"""Existence guard for association-table inserts.

`add(pair)` returns True exactly once per pair (insert it), False
on repeats (it already exists in the DB or earlier in this batch).
"""

def __init__(self, existing: set | None = None):
self._pairs: set = set(existing) if existing else set()

def add(self, pair: tuple) -> bool:
if pair in self._pairs:
return False
self._pairs.add(pair)
return True


# --- Section loaders ---------------------------------------------------------
#
# Each returns the section's existing match keys for one system, shaped
# for ContentMatchIndex / PairGuard. All read-only.


async def load_tag_index(db: AsyncSession, system_id: uuid.UUID) -> ContentMatchIndex:
rows = await db.execute(select(Tag).where(Tag.system_id == system_id))
return ContentMatchIndex({t.name: t for t in rows.scalars().all()})


async def load_group_index(
db: AsyncSession, system_id: uuid.UUID
) -> ContentMatchIndex:
rows = await db.execute(select(Group).where(Group.system_id == system_id))
return ContentMatchIndex({g.name: g for g in rows.scalars().all()})


async def load_field_def_index(
db: AsyncSession, system_id: uuid.UUID
) -> ContentMatchIndex:
"""Custom-field definitions keyed by (name, type) - the same dedup
key the native importer has always used for definitions."""
rows = await db.execute(
select(CustomFieldDefinition).where(
CustomFieldDefinition.system_id == system_id
)
)
return ContentMatchIndex(
{(fd.name, str(fd.field_type)): fd for fd in rows.scalars().all()}
)


async def load_field_value_guard(
db: AsyncSession, system_id: uuid.UUID
) -> PairGuard:
"""(field_id, member_id) pairs already holding a value - the
UNIQUE(field_id, member_id) constraint makes a blind re-insert a
hard error, so every importer must consult this before adding."""
rows = await db.execute(
select(CustomFieldValue.field_id, CustomFieldValue.member_id).join(
CustomFieldDefinition,
CustomFieldValue.field_id == CustomFieldDefinition.id,
).where(CustomFieldDefinition.system_id == system_id)
)
return PairGuard({(r.field_id, r.member_id) for r in rows})


async def load_front_index(
db: AsyncSession, system_id: uuid.UUID
) -> ContentMatchIndex:
"""Fronts keyed by (started_at, ended_at, frozenset of member ids)."""
front_rows = await db.execute(
select(Front.id, Front.started_at, Front.ended_at).where(
Front.system_id == system_id
)
)
fronts = {r.id: (r.started_at, r.ended_at) for r in front_rows}
members_by_front: dict[uuid.UUID, set[uuid.UUID]] = {}
if fronts:
assoc = await db.execute(
select(front_members.c.front_id, front_members.c.member_id).where(
front_members.c.front_id.in_(fronts)
)
)
for r in assoc:
members_by_front.setdefault(r.front_id, set()).add(r.member_id)
return ContentMatchIndex(
{
(started, ended, frozenset(members_by_front.get(fid, set()))): True
for fid, (started, ended) in fronts.items()
}
)


def front_key(
started_at: datetime,
ended_at: datetime | None,
member_ids: set[uuid.UUID] | frozenset[uuid.UUID],
) -> tuple:
return (started_at, ended_at, frozenset(member_ids))


async def load_journal_index(
db: AsyncSession, system_id: uuid.UUID
) -> ContentMatchIndex:
"""Journals keyed by (member_id-or-None, created_at), mapping to the
row so revisions on a skipped journal attach to the existing one."""
rows = await db.execute(
select(JournalEntry).where(JournalEntry.system_id == system_id)
)
return ContentMatchIndex(
{(j.member_id, j.created_at): j for j in rows.scalars().all()}
)


async def load_revision_index(
db: AsyncSession, user_id: uuid.UUID
) -> ContentMatchIndex:
"""Revisions scope by user (no system_id column on the model)."""
rows = await db.execute(
select(
ContentRevision.target_type,
ContentRevision.target_id,
ContentRevision.created_at,
).where(ContentRevision.user_id == user_id)
)
return ContentMatchIndex(
{(str(r.target_type), r.target_id, r.created_at): True for r in rows}
)


async def load_message_index(
db: AsyncSession, system_id: uuid.UUID
) -> ContentMatchIndex:
"""Messages keyed by (board_member_id, created_at), mapping to the
row so a new reply whose parent was skipped still threads onto the
existing parent. Used by the native importer, whose timestamps are
microsecond-precision DB round-trips."""
rows = await db.execute(
select(Message).where(Message.system_id == system_id)
)
return ContentMatchIndex(
{(m.board_member_id, m.created_at): m for m in rows.scalars().all()}
)


async def load_message_count_index(
db: AsyncSession, system_id: uuid.UUID
) -> CountedIndex:
"""Occurrence counts of (board_member_id, created_at) message keys.

For foreign importers (PluralSpace chat, Prism conversations and
board posts) whose source timestamps are coarse enough that
distinct messages legitimately share a key - see CountedIndex.
"""
rows = await db.execute(
select(Message.board_member_id, Message.created_at).where(
Message.system_id == system_id
)
)
counts: dict[tuple, int] = {}
for r in rows:
key = (r.board_member_id, r.created_at)
counts[key] = counts.get(key, 0) + 1
return CountedIndex(counts)


async def load_poll_index(
db: AsyncSession, system_id: uuid.UUID
) -> ContentMatchIndex:
rows = await db.execute(
select(Poll.created_at).where(Poll.system_id == system_id)
)
return ContentMatchIndex({r.created_at: True for r in rows})


async def load_reminder_index(
db: AsyncSession, system_id: uuid.UUID
) -> ContentMatchIndex:
rows = await db.execute(
select(Reminder.created_at).where(Reminder.system_id == system_id)
)
return ContentMatchIndex({r.created_at: True for r in rows})


async def load_watch_token_index(
db: AsyncSession, system_id: uuid.UUID
) -> ContentMatchIndex:
rows = await db.execute(
select(WatchToken).where(WatchToken.system_id == system_id)
)
return ContentMatchIndex(
{t.created_at: t for t in rows.scalars().all()}
)


async def load_channel_index(
db: AsyncSession, system_id: uuid.UUID
) -> ContentMatchIndex:
"""Channels keyed by created_at, scoped via their watch token."""
rows = await db.execute(
select(NotificationChannel)
.join(WatchToken, NotificationChannel.watch_token_id == WatchToken.id)
.where(WatchToken.system_id == system_id)
)
return ContentMatchIndex(
{c.created_at: c for c in rows.scalars().all()}
)


async def load_group_member_guard(
db: AsyncSession, system_id: uuid.UUID
) -> PairGuard:
rows = await db.execute(
select(group_members.c.group_id, group_members.c.member_id).join(
Group, group_members.c.group_id == Group.id
).where(Group.system_id == system_id)
)
return PairGuard({(r.group_id, r.member_id) for r in rows})


async def load_member_tag_guard(
db: AsyncSession, system_id: uuid.UUID
) -> PairGuard:
rows = await db.execute(
select(member_tags.c.tag_id, member_tags.c.member_id).join(
Tag, member_tags.c.tag_id == Tag.id
).where(Tag.system_id == system_id)
)
return PairGuard({(r.tag_id, r.member_id) for r in rows})
Loading
Loading