Skip to content

fix(profile/flush-scheduler): probe before destructive oplog reset (Audit #333 C3)#361

Merged
vrogojin merged 1 commit into
mainfrom
fix/issue-333-c3-orbitdb-auto-drop-guard
May 30, 2026
Merged

fix(profile/flush-scheduler): probe before destructive oplog reset (Audit #333 C3)#361
vrogojin merged 1 commit into
mainfrom
fix/issue-333-c3-orbitdb-auto-drop-guard

Conversation

@vrogojin
Copy link
Copy Markdown
Contributor

Re-filed from closed PR #348 — closed by GitHub when its base branch (fix/issue-333-c2-migration-flush-before-cleanup) was auto-deleted on #347's merge. The commit content + tests are unchanged; only the PR is new.

Closes the C3 critical finding of audit #333 — auto-db.drop() on a transient block-load error. Before this PR, FlushScheduler.addBundleWithOplogAutoReset (profile/profile-token-storage/flush-scheduler.ts:1722) transitioned straight from "extractLostHeadCid matched the error" to resetCorruptedLog() (which calls db.drop()). The matcher cannot distinguish a permanently-corrupt head (Helia GC ran) from a transiently-unreachable one (gateway blip / propagation lag) — a momentary fetch failure wiped all OUTBOX/SENT/disposition/finalization entries not yet captured in a pinned bundle.

Fix

Before the destructive reset, probe configured IPFS gateways for the lost head CID with exponential backoff (verifyCidAccessibleWithRetry, 30 s deadline, 5 s per-attempt HEAD timeout — matching the Issue #239 shutdown-gate convention).

Probe outcome Post-probe addBundle retry Action
ok: true succeeds SKIP reset (transient miss recovered)
ok: true fails with same auto-reset signature reset with the freshest lostHeadCid
ok: true fails with DIFFERENT error class re-throw the new error (no reset)
ok: false reset (existing behaviour)
probe itself throws reset (cannot prove recoverability)
no gateways configured reset (pre-fix behaviour; no recovery surface)

The effectiveLostHeadCid local rebinding ensures the reset reason, marker, and event data carry the freshest unreachable CID when the probe-retry surfaces a different head.

Cost analysis

  • Worst case: +30 s before destructive teardown on a genuine corruption.
  • Best case (transient blip): ~200 ms — single successful gateway HEAD, immediate addBundle retry, no reset, zero data loss.
  • Trade: 30 s of latency on the bad path against permanent loss of operational state on the false-positive path.

Test plan

  • 7 new C3 regression tests in tests/unit/profile/flush-scheduler-c3-oplog-reset-probe.test.ts:
    • Transient blip → no reset
    • Permanent loss → reset
    • Probe ok but retry fails same signature → reset with freshest CID
    • Probe ok but retry fails DIFFERENT signature → re-throw, no reset
    • No gateways configured → preserve pre-fix behaviour
    • Probe itself throws → fall through to reset
    • Unrelated errors → no probe, no reset
  • All 7 existing flush-scheduler-oplog-reset.test.ts tests pass unchanged
  • tsc --noEmit clean

Audit traceability

Stack note

This is the third PR in the audit #333 stack. Stack order:

…oplog reset

FlushScheduler.addBundleWithOplogAutoReset transitioned straight from
"extractLostHeadCid matched the error" to resetCorruptedLog() (which
calls db.drop()). The matcher cannot distinguish a permanently-corrupt
head (Helia GC ran on a memory-blockstore wallet) from a transiently-
unreachable one (gateway blip, peer offline, propagation lag). A
momentary fetch failure thus wiped all OUTBOX/SENT/disposition/
finalization entries not yet captured in a pinned bundle — permanent
data loss on a recoverable error.

Fix:

  - Probe configured IPFS gateways for the lost head CID with
    exponential backoff (verifyCidAccessibleWithRetry, 30 s deadline,
    5 s per-attempt HEAD timeout — matches the Issue #239 shutdown gate
    convention for testnet propagation) BEFORE the destructive reset.

  - On a successful probe, retry addBundle ONCE. If the retry
    succeeds the reset is SKIPPED — no operational state is lost. This
    is the primary recovery path for the gateway-blip / propagation-
    lag class of failure.

  - On probe failure (no gateway served the CID within the deadline)
    OR a post-probe retry that still hits the same auto-reset signature
    (local Helia is the bottleneck, not the network), fall through to
    the existing reset path. The freshest lostHeadCid is propagated
    into the resetCorruptedLog reason payload and event data.

  - On a post-probe retry that hits a DIFFERENT error class (e.g.,
    POINTER_MONOTONICITY_VIOLATION), re-throw the new error — resetting
    the log would not help.

  - With no gateways configured, the probe is skipped entirely and the
    pre-fix behaviour is preserved (no recovery surface, destructive
    reset is the only forward path).

  - When the probe itself throws (e.g., gateway URL validation failed
    before any HEAD ran), fall through to reset — we cannot prove the
    head is recoverable, so the safe-on-error default is the existing
    behaviour.

Worst-case cost: +30 s before destructive teardown on a genuine
corruption. Small price relative to the data-loss exposure on the
false-positive path.

Tests: 7 new C3 regression tests covering each path (transient blip,
permanent loss, probe-ok-but-retry-fails, retry-different-error,
no-gateways, probe-throws, unrelated-errors). Existing 7-test
flush-scheduler-oplog-reset.test.ts suite unchanged and still green.
443 unit test files (8134 tests) pass. tsc clean.

Refs: #333 (C3). Stacked on top of #347 (C2) which is stacked on #346 (C1).
@vrogojin vrogojin merged commit 720224c into main May 30, 2026
3 checks passed
@vrogojin vrogojin deleted the fix/issue-333-c3-orbitdb-auto-drop-guard branch May 30, 2026 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant