fix(profile/flush-scheduler): probe before destructive oplog reset (Audit #333 C3)#361
Merged
Merged
Conversation
…oplog reset
FlushScheduler.addBundleWithOplogAutoReset transitioned straight from
"extractLostHeadCid matched the error" to resetCorruptedLog() (which
calls db.drop()). The matcher cannot distinguish a permanently-corrupt
head (Helia GC ran on a memory-blockstore wallet) from a transiently-
unreachable one (gateway blip, peer offline, propagation lag). A
momentary fetch failure thus wiped all OUTBOX/SENT/disposition/
finalization entries not yet captured in a pinned bundle — permanent
data loss on a recoverable error.
Fix:
- Probe configured IPFS gateways for the lost head CID with
exponential backoff (verifyCidAccessibleWithRetry, 30 s deadline,
5 s per-attempt HEAD timeout — matches the Issue #239 shutdown gate
convention for testnet propagation) BEFORE the destructive reset.
- On a successful probe, retry addBundle ONCE. If the retry
succeeds the reset is SKIPPED — no operational state is lost. This
is the primary recovery path for the gateway-blip / propagation-
lag class of failure.
- On probe failure (no gateway served the CID within the deadline)
OR a post-probe retry that still hits the same auto-reset signature
(local Helia is the bottleneck, not the network), fall through to
the existing reset path. The freshest lostHeadCid is propagated
into the resetCorruptedLog reason payload and event data.
- On a post-probe retry that hits a DIFFERENT error class (e.g.,
POINTER_MONOTONICITY_VIOLATION), re-throw the new error — resetting
the log would not help.
- With no gateways configured, the probe is skipped entirely and the
pre-fix behaviour is preserved (no recovery surface, destructive
reset is the only forward path).
- When the probe itself throws (e.g., gateway URL validation failed
before any HEAD ran), fall through to reset — we cannot prove the
head is recoverable, so the safe-on-error default is the existing
behaviour.
Worst-case cost: +30 s before destructive teardown on a genuine
corruption. Small price relative to the data-loss exposure on the
false-positive path.
Tests: 7 new C3 regression tests covering each path (transient blip,
permanent loss, probe-ok-but-retry-fails, retry-different-error,
no-gateways, probe-throws, unrelated-errors). Existing 7-test
flush-scheduler-oplog-reset.test.ts suite unchanged and still green.
443 unit test files (8134 tests) pass. tsc clean.
Refs: #333 (C3). Stacked on top of #347 (C2) which is stacked on #346 (C1).
This was referenced May 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes the C3 critical finding of audit #333 — auto-
db.drop()on a transient block-load error. Before this PR,FlushScheduler.addBundleWithOplogAutoReset(profile/profile-token-storage/flush-scheduler.ts:1722) transitioned straight from "extractLostHeadCidmatched the error" toresetCorruptedLog()(which callsdb.drop()). The matcher cannot distinguish a permanently-corrupt head (Helia GC ran) from a transiently-unreachable one (gateway blip / propagation lag) — a momentary fetch failure wiped all OUTBOX/SENT/disposition/finalization entries not yet captured in a pinned bundle.Fix
Before the destructive reset, probe configured IPFS gateways for the lost head CID with exponential backoff (
verifyCidAccessibleWithRetry, 30 s deadline, 5 s per-attempt HEAD timeout — matching the Issue #239 shutdown-gate convention).ok: trueok: truelostHeadCidok: trueok: falseThe
effectiveLostHeadCidlocal rebinding ensures the reset reason, marker, and event data carry the freshest unreachable CID when the probe-retry surfaces a different head.Cost analysis
Test plan
tests/unit/profile/flush-scheduler-c3-oplog-reset-probe.test.ts:flush-scheduler-oplog-reset.test.tstests pass unchangedtsc --noEmitcleanAudit traceability
Stack note
This is the third PR in the audit #333 stack. Stack order: