fix(profile/flush-scheduler): probe before destructive oplog reset (Audit #333 C3) by vrogojin · Pull Request #348 · unicity-sphere/sphere-sdk

vrogojin · 2026-05-30T09:42:53Z

Stacked PR. Built atop #347 (C2), which is stacked on #346 (C1). Use the stacked-diff view for the C3-only delta.

Summary

Closes the C3 critical finding of audit #333 — the auto-db.drop() on a transient block-load error. Before this PR, FlushScheduler.addBundleWithOplogAutoReset (profile/profile-token-storage/flush-scheduler.ts:1722) transitioned straight from "extractLostHeadCid matched the error" to resetCorruptedLog() (which calls db.drop()). The matcher cannot distinguish a permanently-corrupt head (Helia GC ran on a memory-blockstore wallet) from a transiently-unreachable one (gateway blip, peer offline, propagation lag). A momentary fetch failure thus wiped all OUTBOX/SENT/disposition/finalization entries not yet captured in a pinned bundle — permanent data loss on a recoverable error.

Fix

Before the destructive reset, probe configured IPFS gateways for the lost head CID with exponential backoff (verifyCidAccessibleWithRetry, 30 s deadline, 5 s per-attempt HEAD timeout — matching the Issue #239 shutdown-gate convention for testnet propagation):

Probe outcome	Post-probe addBundle retry	Action
`ok: true`	succeeds	SKIP reset (transient miss recovered)
`ok: true`	fails with same auto-reset signature	reset with the freshest `lostHeadCid`
`ok: true`	fails with DIFFERENT error class	re-throw the new error (no reset)
`ok: false`	—	reset (existing behaviour)
probe itself throws	—	reset (cannot prove recoverability)
no gateways configured	—	reset (pre-fix behaviour; no recovery surface)

The effectiveLostHeadCid local rebinding ensures the reset reason, marker, and event data carry the freshest unreachable CID when the probe-retry surfaces a different head.

Cost analysis

Worst case: +30 s before destructive teardown on a genuine corruption (no gateway serves the CID, retry exhausted).
Best case (transient blip): ~200 ms — single successful gateway HEAD, immediate addBundle retry, no reset, zero data loss.
Net trade: trading 30 s of latency on the bad path against permanent loss of operational state on the false-positive path. The false-positive path is what the audit specifically called out.

Test plan

7 new C3 regression tests in tests/unit/profile/flush-scheduler-c3-oplog-reset-probe.test.ts:
- Transient blip — probe succeeds, retry succeeds → no reset, no events
- Permanent loss — probe fails → reset happens with the original lostHeadCid
- Probe ok but retry fails with same signature — reset happens with the freshest lostHeadCid
- Probe ok but retry fails with DIFFERENT error — re-throw the new error, no reset
- No gateways configured — preserves pre-fix behaviour (straight to reset)
- Probe itself throws — fall through to reset (cannot prove recoverability)
- Unrelated errors — no probe, no reset (pass-through to caller)
Existing 7-test flush-scheduler-oplog-reset.test.ts suite unchanged and still green (uses empty ipfsGateways, so the new probe branch is skipped — exact pre-fix path)
All 443 unit test files (8134 tests) pass
tsc --noEmit clean
eslint clean on touched files

Audit traceability

Issue: Audit: integration/all-fixes — money-path subsystems (UXF / Profile / transfer pipeline) #333 (C3)
Files / lines in audit: profile/orbitdb-adapter.ts:1382-1420, profile/profile-token-storage/flush-scheduler.ts:1722-1847
Audit recommendation followed: "gate the destructive reset behind a retry/backoff that confirms the head is genuinely unrecoverable". The companion "and/or snapshot before drop" is left as a deferred follow-up — significantly larger change requiring a redesign of the recovery pipeline, separate from this critical-fix scope.

Release-gate status

With #346, #347, and this PR landed, the audit's release-gate critical tier (C1+C2+C3) is closed. High-tier items H1–H8 follow as a continued stack.

Next in the stack

H1 (conservative-sender source lock) follows, branched off this PR.

…nservative-sender conservative-sender.ts had zero locking primitives. instant-sender.ts declared a module-local sourceLocks map (Wave 5 #171), so conservative sends and instant-vs-conservative cross-pairs did NOT serialize on shared source tokens: two concurrent sends could both pass selection, both commit on-chain, and only the aggregator caught the duplicate-spend after a source was burned. Fix has three parts: - Extracted the per-source lock registry from instant-sender.ts to a new `modules/payments/transfer/source-locks.ts`. The Wave 4-7 semantics are preserved verbatim (lex-sorted acquisition for deadlock prevention, 60 s liveness floor with force-release, Set dedup, fail-closed __resetSourceLocksForTesting guard). The shared module exports `acquireSourceLocks` and `__resetSourceLocksForTesting`. - instant-sender.ts re-exports `__resetSourceLocksForTesting` so the existing tests/unit/payments/transfer/instant-sender.test.ts import path continues to work. - conservative-sender.ts now acquires per-source locks after source selection completes (Step 2.5, mirroring the instant pattern) and releases them in the outer `finally`. Adds a TEST-ONLY __sourceLockMaxHoldMs override on ConservativeSenderDeps, symmetric with InstantSenderDeps. Cross-pair invariant: because both senders now share the SAME process-global map, an in-flight instant send blocks a concurrent conservative send sharing a source token, and vice versa. Tests: - 9 new direct-module tests on source-locks.ts covering acquire serialization, deadlock prevention via lex-sort, dedup, liveness floor (force-release), back-compat re-export identity, fail-closed guard preserved. - 4 new integration tests on conservative-sender proving the pipeline acquires + releases the lock: serialization on shared source, release on success, release on failure (try/finally), disjoint sources proceed concurrently. - All 36 existing instant-sender tests pass unchanged (the extracted module is byte-equivalent to the original Wave 7 code modulo the new callerLabel parameter, which defaults to 'sender' when omitted). 8147 unit tests pass (2 pre-existing skipped). tsc clean. Refs: #333 (H1). Stacked on #348 (C3) → #347 (C2) → #346 (C1).

…s content.tokenId The manifest `tokenId → rootHash` map was only regex-shape-checked at the import boundary (uxf/json.ts:430, uxf/ipld.ts:555) and never asserted equal to the genesis tokenId encoded in the referenced token-root element. Pool elements still self-verified (content-hash recomputed), so a hostile sender could craft `{ "tokenId=A": <hash-of-root-minting- tokenId-B> }` and pass every existing gate. Downstream consumers that trust the manifest key (dedup, ownership filtering, balance computation) would mis-identify the token — an identity-confusion primitive that the audit specifically called out. Fix layered defense-in-depth: - uxf/verify.ts — primary structural-integrity gate. When iterating manifest entries during the DAG walk, assert that the root element exists with type ELEMENT_TYPE_TOKEN_ROOT AND that element.content.tokenId === manifestKey. Mismatches surface as MANIFEST_TOKENID_MISMATCH errors; wrong element types surface as MANIFEST_TYPE_MISMATCH. modules/payments/transfer/bundle-verifier already calls pkg.verify() so the receive pipeline (the audit's primary concern) is covered. - uxf/json.ts packageFromJson — fail-fast at the JSON import boundary with VERIFICATION_FAILED. Belt-and-braces for consumers that bypass verify() (e.g., direct UxfPackage.fromJson() use without a downstream bundle-verifier round). - uxf/ipld.ts importFromCar — same fail-fast at the CAR import boundary. Mirrors the json-path check for the CAR-bound delivery flow. The export side already refuses to write packages with mismatched manifest keys (the verify() result feeds production callers' export paths), so this fix focuses on hostile-receive scenarios. Tests: 8 new H2 regression tests covering verify() positive + MANIFEST_TOKENID_MISMATCH + MANIFEST_TYPE_MISMATCH, the json import boundary positive + spoofed-key + wrong-type rejections, and the CAR import boundary spoofed-key + positive flow. All 443 existing uxf tests pass unchanged. 8155 total unit tests pass. Refs: #333 (H2). Stacked on #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).

…mode Before this fix, `UxfPackage.merge` (→ `mergePkg`) silently dropped any tokenId whose per-token resolver threw. A single malformed sibling element inside a Rule 4 synthetic-rebuild made a legitimately-received token vanish from the merged manifest with only a `logger.warn`. On the receive path this manifested as token loss from view: the caller had no machine-readable signal that anything had gone wrong. Fix: - `UxfPackage.merge()` now returns `{ skipped: MergeSkip[] }`. Each `MergeSkip` records the tokenId, the original error, the target's prior root (if any), and the source's incoming root that failed to incorporate. Callers can react: replay, retry, surface to UI, or quarantine. - `opts.strict: true` aggregates skipped tokens into a `UxfError('MERGE_PARTIAL_FAILURE')` thrown BEFORE the atomic apply phase — target state is unchanged on the throw. Use on receive- path JOINs where any silent drop is unacceptable; leave default-off for opportunistic peer JOINs where partial coverage is fine. - `opts.onSkip` fires once per skipped tokenId for callers that want telemetry visibility without strict-mode failure. Throws inside onSkip are caught and logged — observability must not change merge semantics. - New error code `MERGE_PARTIAL_FAILURE` on UxfErrorCode. The thrown error also carries a non-typed `skipped` field with the structured skip list so callers needing machine-readable details can read it via `(err as { skipped }).skipped`. - Back-compat at the internal `mergePkg` boundary: legacy positional `(target, source, verifiedProofs)` callers keep working via a Set-vs-opts shape probe. Public callers (4 in tree: PaymentsModule, profile-token-storage-provider, profile/consolidation, flush-scheduler) ignore the returned `{ skipped }` and continue to use merge as void — behavior unchanged for them. Tests: 8 new H3 regression tests covering empty-skipped baseline, captured-skip on resolver throw, MergeSkip metadata fields, strict-mode hard fail, strict-mode atomicity invariant (target unchanged on throw), strict-mode happy path, onSkip callback fires once per skip, and onSkip throw does not change merge semantics. All 451 existing uxf tests pass unchanged. 8163 total unit tests pass. Refs: #333 (H3). Stacked on #351 (H2) → #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).

…e in disposition-engine modules/payments/transfer/disposition-engine.ts:629 called `verifyProof (proof, trustBase, requestId)` with the bundle-supplied requestId and trusted that the un-audited hydrateChain adapter had derived it canonically. If the adapter erred or a malicious sender hand-crafted the bundle with a proof anchored to a DIFFERENT transaction's requestId, the proof would still verify (it IS a genuine on-chain proof) but it would be incorrectly attributed to this transaction — a proof-binding forgery. Fix: - New `AssertRequestIdBindingFn` type and `AssertRequestIdBindingResult` interface on disposition-engine.ts. The hook canonically re-derives RequestId from (authenticator.publicKey, sourceState) and compares to the bundle's requestId; production wiring calls RequestId.create and compares toJSON. - Optional `assertRequestIdBinding?` dep added to DispositionEngineInput. When provided, the engine calls it BEFORE every verifyProof invocation: * `{ ok: true }` → proof verification proceeds. * `{ ok: false }` → cryptoInvalid('proof-invalid') — same §5.3 [C](3) class as a clean PATH_INVALID at the verifier, because the (proof, requestId) pair does not bind to this tx. * throw → structuralInvalid('proof-throw'). The binding gate fires BEFORE verifyProof so a forged binding short-circuits the (potentially expensive) on-chain proof check. - Optional shape preserves back-compat with the 66 existing engine tests (which do not set the hook) and with all production callers that have not yet been updated. - Plumbed through legacy-shape-adapter.ts: new optional `assertRequestIdBinding?` field on LegacyShapeAdapterInput is forwarded to DispositionEngineInput in buildEngineInput. Existing callers that don't supply the hook degrade to the pre-fix behaviour; new callers pass the production-grade RequestId.create comparer. Tests: 7 new H4 regression tests covering: - Back-compat (hook absent → proof verification proceeds) - Hook returns ok=true (binding asserted, proof verified) - Hook receives both bundle requestId and authenticator unchanged - Hook returns ok=false (cryptoInvalid('proof-invalid'), verifyProof NOT invoked) - Hook throws (structuralInvalid('proof-throw')) - Binding gate fires BEFORE verifyProof (ordering invariant) - Null-proof chain entries do NOT trigger the binding gate All 66 existing disposition-engine tests pass unchanged. All 1491 transfer tests pass. 8170 total unit tests pass. tsc clean. Refs: #333 (H4). Stacked on #352 (H3) → #351 (H2) → #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).

…permanent modules/payments/transfer/finalization-worker-sender.ts transitioned the outbox entry to `failed-permanent` on any hard-fail and never touched the source tokens that the instant-sender had marked `transferring`/`pending` at submit time. With orphan auto-recovery default-OFF, the spender-side balance was permanently locked as unspendable. Fix: - Added optional `sourceTokenIds?: ReadonlyArray<string>` to `UxfTransferOutboxEntry`. The instant-sender now records the selected source set at outbox-record construction time via `OutboxBuildArgs.sourceTokenIds`. Optional with `[]` semantics on `undefined` for back-compat with entries written before H5. - Added optional `recoverFailedPermanentSources?(sources, outboxId)` hook to `FinalizationWorkerSenderOptions`. When wired, the worker invokes it once at the `failed-permanent` transition with the entry's recorded source tokenIds. Production wiring flips each source back to a spendable state (typically `confirmed`) gated on a same-source-no-other-live-outbox check; the audit's gap was that the worker provided no signal at all. - The hook is best-effort: throws are caught and emitted as a `transfer:failed` event for triage, but the `failed-permanent` transition itself stands. This preserves the terminal-state contract — a recovery hook failure must not re-introduce the locked-pending bug it exists to fix. - Optional shape preserves back-compat with the 90+ existing finalization-worker-sender tests (which do not set the hook). Tests: 5 new H5 regression tests covering: - Hook fires with the recorded source ids on failed-permanent - Pre-H5 entries (no sourceTokenIds field) get `[]` (not skipped) - Hook absent: worker still transitions cleanly - Hook throws: failed-permanent stands, transfer:failed emitted - Hook fires AFTER the outbox transition (ordering invariant) The "hook does NOT fire on success/in-progress" guarantee is structurally enforced by code placement (inside the `if (totalFailure > 0)` branch); a behavioural test would require duplicating the full §6.1 fake-adapter wiring for a property the code-review already shows. All 1498 transfer tests pass. 8175 total unit tests pass. tsc clean. Refs: #333 (H5). Stacked on #353 (H4) → #352 (H3) → #351 (H2) → #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).

…nifestCas ManifestCas.update's precondition check compared `observed.rootHash` (a string read from the entry) to `prev.contentHash` (the caller's expected value). Both sides are arbitrary writer-supplied labels — a CAS pass meant "the labels agree", not "the content under the label is the content the caller meant". An attacker writing an entry with a forged `rootHash` field that doesn't match the actual pool content slipped through unchanged. Fix: - New optional `VerifyEntryRootFn` hook accepted via the ManifestCas constructor's `opts.verifyEntryRoot`. When wired, the hook is invoked AFTER the label CAS passes and BEFORE the write. The hook fetches the content the entry's `rootHash` references (production: the UXF pool element), recomputes the hash, and asserts it matches the declaration. - New `'integrity-failed'` discriminator on `ManifestCasResult.reason` with structured `integrityDetail` for triage. The result also carries `observed.contentHash` so callers can re-read state. - `ManifestCidRewriteCasError.casReason` widened to surface `'integrity-failed'`. The worker's outer retry loop already treats any throw from manifest-cid-rewrite as a hard CAS failure that bubbles up; the new reason rides through unchanged but preserves structure for operator triage. - Skipped on the `prev === null` (asserts-no-entry) path because there is no observed entry to verify. Skipped on label-CAS-mismatch because the swap has already been rejected. - Optional shape preserves back-compat: all 9 existing manifest-cas tests and the 100% of production callers that do not (yet) supply the verifier continue to behave identically. Production wiring is a follow-up that does not block this PR. Tests: 6 new H7 regression tests covering: - Back-compat (verifier omitted → label-only CAS as before) - Happy path (verifier called AFTER label CAS, BEFORE write) - Integrity failure (returns 'integrity-failed', no write happens, storage unchanged, integrityDetail propagated) - Skip on prev=null (no observed entry to verify) - Skip on label-CAS-mismatch (short-circuit before verifier) - 'integrity-failed' has its own discriminator distinct from cas-mismatch / not-found / concurrent-modification All 1512 transfer tests pass. 8181 total unit tests pass. tsc clean. Refs: #333 (H7). Stacked on #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2) → #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).

…ER test-coverage gap The soak (manual-test-full-recovery.sh) routinely surfaces `[ERROR] [V6-RECOVER] Stranded receive <hex> hit permanent recipient- address mismatch (HD-index recovery exhausted) (no retry): VerificationError: Recipient address mismatch` at §C → §D. The existing unit tests cover helpers in isolation (PaymentsModule.recipient-address-mismatch-recovery.test.ts) and the error CLASSIFIER (PaymentsModule.proof-polling-persistence.test.ts #269), but neither proves the composition: "given a sibling HD index in the recipient's inventory, tryRecover finds it AND the SDK's verifyRecipient agrees on the derived address". This left CI blind to the regression mode where tryRecover returns a false-positive candidate that Token.update later rejects. This commit closes the gap at three layers: L1. Real-SDK integration test (tests/integration/payments/v6-recover-real-sdk-recovery.test.ts): - Replaces the file-level vi.mock for SigningService / UnmaskedPredicate / TokenId / TokenType / HashAlgorithm with REAL SDK imports. - Builds two real HD-derived signing services. - Constructs the sender-targeted DIRECT address via real UnmaskedPredicate.create → reference.toAddress (the EXACT path the SDK's verifyRecipient uses). - Drives `tryRecoverSigningServiceForRecipient` against real TokenId / TokenType / Uint8Array salt. - 4 tests: happy path, no-match negative, deriveAddressInfo throws, and (L2 composition) tryRecover + predicate construction round-trip equals the transferTx target address. L3. CI workflow (.github/workflows/soak-nightly.yml): - Nightly cron + workflow_dispatch trigger. - Probes testnet aggregator + Nostr relay; skip-not-fail when either is unreachable (third-party outages cannot block releases). - Checks out sphere-cli alongside sphere-sdk, links the local build, runs manual-test-full-recovery.sh with SPHERE_DEBUG=*. - Captures soak.log + workspace as artifacts (30-day retention on failure, 7-day on success). - Surfaces metrics in the workflow summary: V6-RECOVER count, Stranded receive count, POINTER_MONOTONICITY_VIOLATION count, bcast_pub>0 count, exit code. What's tested by what (audit roll-up): - SDK address-derivation regression in tryRecover → L1 (real SDK). - tryRecover ↔ predicate-construction composition → L1 composition test (proves verifyRecipient cannot throw against the recovered state). - Cross-process / Nostr / IPFS / multi-daemon regressions → L3 soak. - Helper-level edge cases (mock-friendly) → existing PaymentsModule.recipient-address-mismatch-recovery.test.ts. - Error-classification on hard fail → existing #269 test in PaymentsModule.proof-polling-persistence.test.ts. Tests pass: 4/4 new integration tests + all 8185 unit tests. tsc clean. Refs: #333 (V6-RECOVER test-coverage gap, follow-up to H1-H7 stack). Stacked on #355 (H7) → #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2) → #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).

profile/migration.ts walked persist → sanity → cleanup without forcing the TXF bundle to land durably: - stepPersistToOrbitDb (:585) accepted save()'s cid: 'debounced' return as success without driving the flush. - stepSanityCheck (:643) read back via load() which returns in-memory pendingData (source: 'cache') — passing even if nothing was pinned. - stepCleanup (:779) then deleted legacy KV keys and unpinned the legacy CID. A crash (or later flush failure) between persist and the debounced flush landing lost both the legacy KV state AND the now-unpinned legacy CID. Token-DB loss was only partially mitigated by the #330 legacy-token-DB fallback if wired; legacy KV and the unpinned CID were not. Fix: - stepPersistToOrbitDb now calls profileTokenStorage.awaitNextFlush(0) immediately after save(). awaitNextFlush drives flushScheduler.forceFlushSerialized() and throws on TIMEOUT or POINTER_MONOTONICITY_VIOLATION — converting a silent crash-window data loss into a recoverable MIGRATION_FAILED. timeoutMs=0 disables the wall-clock deadline (the 4-iteration cap still applies) so large migrations are not artificially capped. - A token-storage provider without awaitNextFlush() is refused outright with MIGRATION_FAILED. The pre-fix silent-loss path was the only alternative and it is unsafe. - stepSanityCheck rejects loadResult.source === 'cache' with a hard error. After awaitNextFlush, pendingData is null on the real provider and load() walks active bundles in OrbitDB returning source 'remote'. A 'cache' result here means the durability gate was bypassed — belt-and-braces over the awaitNextFlush guarantee. Existing migration test mocks updated to simulate the real flush contract (save → pendingData/source:'cache', awaitNextFlush → flushed/ source:'remote'). 9 new C2 regression tests cover: - awaitNextFlush is called exactly once after save() (happy path) - timeoutMs=0 is passed (no artificial deadline) - TIMEOUT, POINTER_MONOTONICITY_VIOLATION, and generic flush errors all become MIGRATION_FAILED before cleanup runs - Providers omitting awaitNextFlush are refused - Null-txfData path skips the flush entirely - sanity check rejects post-persist source='cache' reads - End-to-end failure-path invariant: any flush/sanity error leaves legacy KV intact and the legacy CID reclaimable All 442 unit test files (8127 tests) pass. tsc clean. eslint clean on touched files. Refs: #333 (C2). Stacked on top of #346 (C1).

…oplog reset FlushScheduler.addBundleWithOplogAutoReset transitioned straight from "extractLostHeadCid matched the error" to resetCorruptedLog() (which calls db.drop()). The matcher cannot distinguish a permanently-corrupt head (Helia GC ran on a memory-blockstore wallet) from a transiently- unreachable one (gateway blip, peer offline, propagation lag). A momentary fetch failure thus wiped all OUTBOX/SENT/disposition/ finalization entries not yet captured in a pinned bundle — permanent data loss on a recoverable error. Fix: - Probe configured IPFS gateways for the lost head CID with exponential backoff (verifyCidAccessibleWithRetry, 30 s deadline, 5 s per-attempt HEAD timeout — matches the Issue #239 shutdown gate convention for testnet propagation) BEFORE the destructive reset. - On a successful probe, retry addBundle ONCE. If the retry succeeds the reset is SKIPPED — no operational state is lost. This is the primary recovery path for the gateway-blip / propagation- lag class of failure. - On probe failure (no gateway served the CID within the deadline) OR a post-probe retry that still hits the same auto-reset signature (local Helia is the bottleneck, not the network), fall through to the existing reset path. The freshest lostHeadCid is propagated into the resetCorruptedLog reason payload and event data. - On a post-probe retry that hits a DIFFERENT error class (e.g., POINTER_MONOTONICITY_VIOLATION), re-throw the new error — resetting the log would not help. - With no gateways configured, the probe is skipped entirely and the pre-fix behaviour is preserved (no recovery surface, destructive reset is the only forward path). - When the probe itself throws (e.g., gateway URL validation failed before any HEAD ran), fall through to reset — we cannot prove the head is recoverable, so the safe-on-error default is the existing behaviour. Worst-case cost: +30 s before destructive teardown on a genuine corruption. Small price relative to the data-loss exposure on the false-positive path. Tests: 7 new C3 regression tests covering each path (transient blip, permanent loss, probe-ok-but-retry-fails, retry-different-error, no-gateways, probe-throws, unrelated-errors). Existing 7-test flush-scheduler-oplog-reset.test.ts suite unchanged and still green. 443 unit test files (8134 tests) pass. tsc clean. Refs: #333 (C3). Stacked on top of #347 (C2) which is stacked on #346 (C1).