fix(profile/flush-scheduler): probe before destructive oplog reset (Audit #333 C3) by vrogojin · Pull Request #361 · unicity-sphere/sphere-sdk

vrogojin · 2026-05-30T15:44:47Z

Re-filed from closed PR #348 — closed by GitHub when its base branch (fix/issue-333-c2-migration-flush-before-cleanup) was auto-deleted on #347's merge. The commit content + tests are unchanged; only the PR is new.

Closes the C3 critical finding of audit #333 — auto-db.drop() on a transient block-load error. Before this PR, FlushScheduler.addBundleWithOplogAutoReset (profile/profile-token-storage/flush-scheduler.ts:1722) transitioned straight from "extractLostHeadCid matched the error" to resetCorruptedLog() (which calls db.drop()). The matcher cannot distinguish a permanently-corrupt head (Helia GC ran) from a transiently-unreachable one (gateway blip / propagation lag) — a momentary fetch failure wiped all OUTBOX/SENT/disposition/finalization entries not yet captured in a pinned bundle.

Fix

Before the destructive reset, probe configured IPFS gateways for the lost head CID with exponential backoff (verifyCidAccessibleWithRetry, 30 s deadline, 5 s per-attempt HEAD timeout — matching the Issue #239 shutdown-gate convention).

Probe outcome	Post-probe addBundle retry	Action
`ok: true`	succeeds	SKIP reset (transient miss recovered)
`ok: true`	fails with same auto-reset signature	reset with the freshest `lostHeadCid`
`ok: true`	fails with DIFFERENT error class	re-throw the new error (no reset)
`ok: false`	—	reset (existing behaviour)
probe itself throws	—	reset (cannot prove recoverability)
no gateways configured	—	reset (pre-fix behaviour; no recovery surface)

The effectiveLostHeadCid local rebinding ensures the reset reason, marker, and event data carry the freshest unreachable CID when the probe-retry surfaces a different head.

Cost analysis

Worst case: +30 s before destructive teardown on a genuine corruption.
Best case (transient blip): ~200 ms — single successful gateway HEAD, immediate addBundle retry, no reset, zero data loss.
Trade: 30 s of latency on the bad path against permanent loss of operational state on the false-positive path.

Test plan

7 new C3 regression tests in tests/unit/profile/flush-scheduler-c3-oplog-reset-probe.test.ts:
- Transient blip → no reset
- Permanent loss → reset
- Probe ok but retry fails same signature → reset with freshest CID
- Probe ok but retry fails DIFFERENT signature → re-throw, no reset
- No gateways configured → preserve pre-fix behaviour
- Probe itself throws → fall through to reset
- Unrelated errors → no probe, no reset
All 7 existing flush-scheduler-oplog-reset.test.ts tests pass unchanged
tsc --noEmit clean

Audit traceability

Issue: Audit: integration/all-fixes — money-path subsystems (UXF / Profile / transfer pipeline) #333 (C3)
Audit recommendation followed: "gate the destructive reset behind a retry/backoff that confirms the head is genuinely unrecoverable". The companion "and/or snapshot before drop" is left as a deferred follow-up.

Stack note

This is the third PR in the audit #333 stack. Stack order:

fix(profile/storage): plug plaintext-seed window in encrypt() (Audit #333 C1) #346 (C1) ✅ merged
fix(profile/migration): force durable flush before cleanup (Audit #333 C2) #347 (C2) ✅ merged
this (C3)
fix(payments/transfer): same-process source lock for conservative-sender (Audit #333 H1) #349 (H1), fix(uxf): bind manifest tokenId to root content (Audit #333 H2) #351 (H2), fix(uxf): surface mergePkg skipped tokens + add strict mode (Audit #333 H3) #352 (H3), fix(payments/transfer): re-derive requestId binding gate (Audit #333 H4) #353 (H4), fix(payments/transfer): unlock source tokens on failed-permanent (Audit #333 H5) #354 (H5), fix(profile): add recompute-content verifier to ManifestCas (Audit #333 H7) #355 (H7), test(payments/transfer): close V6-RECOVER test-coverage gap + nightly soak CI (Audit #333 follow-up) #358 (V6-RECOVER test gap) — pending

…oplog reset FlushScheduler.addBundleWithOplogAutoReset transitioned straight from "extractLostHeadCid matched the error" to resetCorruptedLog() (which calls db.drop()). The matcher cannot distinguish a permanently-corrupt head (Helia GC ran on a memory-blockstore wallet) from a transiently- unreachable one (gateway blip, peer offline, propagation lag). A momentary fetch failure thus wiped all OUTBOX/SENT/disposition/ finalization entries not yet captured in a pinned bundle — permanent data loss on a recoverable error. Fix: - Probe configured IPFS gateways for the lost head CID with exponential backoff (verifyCidAccessibleWithRetry, 30 s deadline, 5 s per-attempt HEAD timeout — matches the Issue #239 shutdown gate convention for testnet propagation) BEFORE the destructive reset. - On a successful probe, retry addBundle ONCE. If the retry succeeds the reset is SKIPPED — no operational state is lost. This is the primary recovery path for the gateway-blip / propagation- lag class of failure. - On probe failure (no gateway served the CID within the deadline) OR a post-probe retry that still hits the same auto-reset signature (local Helia is the bottleneck, not the network), fall through to the existing reset path. The freshest lostHeadCid is propagated into the resetCorruptedLog reason payload and event data. - On a post-probe retry that hits a DIFFERENT error class (e.g., POINTER_MONOTONICITY_VIOLATION), re-throw the new error — resetting the log would not help. - With no gateways configured, the probe is skipped entirely and the pre-fix behaviour is preserved (no recovery surface, destructive reset is the only forward path). - When the probe itself throws (e.g., gateway URL validation failed before any HEAD ran), fall through to reset — we cannot prove the head is recoverable, so the safe-on-error default is the existing behaviour. Worst-case cost: +30 s before destructive teardown on a genuine corruption. Small price relative to the data-loss exposure on the false-positive path. Tests: 7 new C3 regression tests covering each path (transient blip, permanent loss, probe-ok-but-retry-fails, retry-different-error, no-gateways, probe-throws, unrelated-errors). Existing 7-test flush-scheduler-oplog-reset.test.ts suite unchanged and still green. 443 unit test files (8134 tests) pass. tsc clean. Refs: #333 (C3). Stacked on top of #347 (C2) which is stacked on #346 (C1).

vrogojin merged commit 720224c into main May 30, 2026
3 checks passed

vrogojin deleted the fix/issue-333-c3-orbitdb-auto-drop-guard branch May 30, 2026 15:52

This was referenced May 30, 2026

fix(uxf,payments): bundleCid determinism — lock envelope timestamp across attempts #362

Merged

Audit: integration/all-fixes — money-path subsystems (UXF / Profile / transfer pipeline) #333

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(profile/flush-scheduler): probe before destructive oplog reset (Audit #333 C3)#361

fix(profile/flush-scheduler): probe before destructive oplog reset (Audit #333 C3)#361
vrogojin merged 1 commit into
mainfrom
fix/issue-333-c3-orbitdb-auto-drop-guard

vrogojin commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vrogojin commented May 30, 2026

Fix

Cost analysis

Test plan

Audit traceability

Stack note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant