fix(uxf): surface mergePkg skipped tokens + add strict mode (Audit #333 H3) by vrogojin · Pull Request #352 · unicity-sphere/sphere-sdk

vrogojin · 2026-05-30T12:20:06Z

Stacked PR. Built atop #351 (H2) → #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).

Summary

Closes the H3 high finding of audit #333 — UxfPackage.mergePkg silently dropped any tokenId whose per-token resolver threw. A single malformed sibling element inside a Rule 4 synthetic rebuild made a legitimately-received token vanish from the merged manifest with only a logger.warn. On the receive path this was token loss from view with no machine-readable signal.

Fix

The pre-fix design deliberately did per-token-atomic merging ("one poisoned entry must not deny the user their good tokens"). H3 preserves that property but adds observability:

UxfPackage.merge() now returns { skipped: MergeSkip[] }. Each MergeSkip records { tokenId, error, targetExisting, sourceIncoming } so callers can replay, retry, surface to UI, or quarantine.
opts.strict: true aggregates skipped tokens into a UxfError('MERGE_PARTIAL_FAILURE') thrown before the atomic apply phase — target state is unchanged on the throw. Recipient receive-path JOINs should use strict; opportunistic peer JOINs leave it off.
opts.onSkip fires once per skipped tokenId for callers that want telemetry without strict-mode failure. Throws inside onSkip are caught and logged — observability must not change merge semantics.
New MERGE_PARTIAL_FAILURE error code on UxfErrorCode. The thrown error also carries a non-typed skipped field with the structured skip list.
Back-compat at the internal mergePkg boundary: legacy positional (target, source, verifiedProofs) callers keep working via a Set-vs-opts shape probe. The 4 public callers in tree (PaymentsModule, profile-token-storage-provider, profile/consolidation, flush-scheduler) ignore the returned { skipped } and behave exactly as before.

Test plan

8 new H3 regression tests in tests/unit/uxf/h3-mergepkg-skipped-tokens.test.ts:
- Empty skipped baseline (no resolver fires)
- Per-token throw captured in skipped (was silently dropped pre-fix)
- MergeSkip metadata fields populated correctly (targetExisting, sourceIncoming)
- Strict mode throws MERGE_PARTIAL_FAILURE with structured skipped list
- Strict-mode atomicity: target unchanged on throw (TOKEN_C not acquired when TOKEN_A's resolver fails)
- Strict mode happy path (no resolver fires → no throw, empty skipped)
- onSkip callback fires once per skip with correct payload
- onSkip throws do not change merge semantics (merge still completes, skipped reported)
All 451 existing tests/unit/uxf/ tests pass unchanged
All 8163 unit tests pass (2 pre-existing skipped)
tsc --noEmit clean

Audit traceability

Issue: Audit: integration/all-fixes — money-path subsystems (UXF / Profile / transfer pipeline) #333 (H3)
File / line in audit: uxf/UxfPackage.ts:~800
Audit recommendation followed: "Should be a hard failure or an explicit per-token error result." Implemented BOTH — strict: true for hard failure mode, and the returned skipped array as the default explicit-result mode.

Next in the stack

H4 (disposition engine must re-derive requestId) follows, branched off this PR.

…e in disposition-engine modules/payments/transfer/disposition-engine.ts:629 called `verifyProof (proof, trustBase, requestId)` with the bundle-supplied requestId and trusted that the un-audited hydrateChain adapter had derived it canonically. If the adapter erred or a malicious sender hand-crafted the bundle with a proof anchored to a DIFFERENT transaction's requestId, the proof would still verify (it IS a genuine on-chain proof) but it would be incorrectly attributed to this transaction — a proof-binding forgery. Fix: - New `AssertRequestIdBindingFn` type and `AssertRequestIdBindingResult` interface on disposition-engine.ts. The hook canonically re-derives RequestId from (authenticator.publicKey, sourceState) and compares to the bundle's requestId; production wiring calls RequestId.create and compares toJSON. - Optional `assertRequestIdBinding?` dep added to DispositionEngineInput. When provided, the engine calls it BEFORE every verifyProof invocation: * `{ ok: true }` → proof verification proceeds. * `{ ok: false }` → cryptoInvalid('proof-invalid') — same §5.3 [C](3) class as a clean PATH_INVALID at the verifier, because the (proof, requestId) pair does not bind to this tx. * throw → structuralInvalid('proof-throw'). The binding gate fires BEFORE verifyProof so a forged binding short-circuits the (potentially expensive) on-chain proof check. - Optional shape preserves back-compat with the 66 existing engine tests (which do not set the hook) and with all production callers that have not yet been updated. - Plumbed through legacy-shape-adapter.ts: new optional `assertRequestIdBinding?` field on LegacyShapeAdapterInput is forwarded to DispositionEngineInput in buildEngineInput. Existing callers that don't supply the hook degrade to the pre-fix behaviour; new callers pass the production-grade RequestId.create comparer. Tests: 7 new H4 regression tests covering: - Back-compat (hook absent → proof verification proceeds) - Hook returns ok=true (binding asserted, proof verified) - Hook receives both bundle requestId and authenticator unchanged - Hook returns ok=false (cryptoInvalid('proof-invalid'), verifyProof NOT invoked) - Hook throws (structuralInvalid('proof-throw')) - Binding gate fires BEFORE verifyProof (ordering invariant) - Null-proof chain entries do NOT trigger the binding gate All 66 existing disposition-engine tests pass unchanged. All 1491 transfer tests pass. 8170 total unit tests pass. tsc clean. Refs: #333 (H4). Stacked on #352 (H3) → #351 (H2) → #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).

…permanent modules/payments/transfer/finalization-worker-sender.ts transitioned the outbox entry to `failed-permanent` on any hard-fail and never touched the source tokens that the instant-sender had marked `transferring`/`pending` at submit time. With orphan auto-recovery default-OFF, the spender-side balance was permanently locked as unspendable. Fix: - Added optional `sourceTokenIds?: ReadonlyArray<string>` to `UxfTransferOutboxEntry`. The instant-sender now records the selected source set at outbox-record construction time via `OutboxBuildArgs.sourceTokenIds`. Optional with `[]` semantics on `undefined` for back-compat with entries written before H5. - Added optional `recoverFailedPermanentSources?(sources, outboxId)` hook to `FinalizationWorkerSenderOptions`. When wired, the worker invokes it once at the `failed-permanent` transition with the entry's recorded source tokenIds. Production wiring flips each source back to a spendable state (typically `confirmed`) gated on a same-source-no-other-live-outbox check; the audit's gap was that the worker provided no signal at all. - The hook is best-effort: throws are caught and emitted as a `transfer:failed` event for triage, but the `failed-permanent` transition itself stands. This preserves the terminal-state contract — a recovery hook failure must not re-introduce the locked-pending bug it exists to fix. - Optional shape preserves back-compat with the 90+ existing finalization-worker-sender tests (which do not set the hook). Tests: 5 new H5 regression tests covering: - Hook fires with the recorded source ids on failed-permanent - Pre-H5 entries (no sourceTokenIds field) get `[]` (not skipped) - Hook absent: worker still transitions cleanly - Hook throws: failed-permanent stands, transfer:failed emitted - Hook fires AFTER the outbox transition (ordering invariant) The "hook does NOT fire on success/in-progress" guarantee is structurally enforced by code placement (inside the `if (totalFailure > 0)` branch); a behavioural test would require duplicating the full §6.1 fake-adapter wiring for a property the code-review already shows. All 1498 transfer tests pass. 8175 total unit tests pass. tsc clean. Refs: #333 (H5). Stacked on #353 (H4) → #352 (H3) → #351 (H2) → #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).

…nifestCas ManifestCas.update's precondition check compared `observed.rootHash` (a string read from the entry) to `prev.contentHash` (the caller's expected value). Both sides are arbitrary writer-supplied labels — a CAS pass meant "the labels agree", not "the content under the label is the content the caller meant". An attacker writing an entry with a forged `rootHash` field that doesn't match the actual pool content slipped through unchanged. Fix: - New optional `VerifyEntryRootFn` hook accepted via the ManifestCas constructor's `opts.verifyEntryRoot`. When wired, the hook is invoked AFTER the label CAS passes and BEFORE the write. The hook fetches the content the entry's `rootHash` references (production: the UXF pool element), recomputes the hash, and asserts it matches the declaration. - New `'integrity-failed'` discriminator on `ManifestCasResult.reason` with structured `integrityDetail` for triage. The result also carries `observed.contentHash` so callers can re-read state. - `ManifestCidRewriteCasError.casReason` widened to surface `'integrity-failed'`. The worker's outer retry loop already treats any throw from manifest-cid-rewrite as a hard CAS failure that bubbles up; the new reason rides through unchanged but preserves structure for operator triage. - Skipped on the `prev === null` (asserts-no-entry) path because there is no observed entry to verify. Skipped on label-CAS-mismatch because the swap has already been rejected. - Optional shape preserves back-compat: all 9 existing manifest-cas tests and the 100% of production callers that do not (yet) supply the verifier continue to behave identically. Production wiring is a follow-up that does not block this PR. Tests: 6 new H7 regression tests covering: - Back-compat (verifier omitted → label-only CAS as before) - Happy path (verifier called AFTER label CAS, BEFORE write) - Integrity failure (returns 'integrity-failed', no write happens, storage unchanged, integrityDetail propagated) - Skip on prev=null (no observed entry to verify) - Skip on label-CAS-mismatch (short-circuit before verifier) - 'integrity-failed' has its own discriminator distinct from cas-mismatch / not-found / concurrent-modification All 1512 transfer tests pass. 8181 total unit tests pass. tsc clean. Refs: #333 (H7). Stacked on #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2) → #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).

…ER test-coverage gap The soak (manual-test-full-recovery.sh) routinely surfaces `[ERROR] [V6-RECOVER] Stranded receive <hex> hit permanent recipient- address mismatch (HD-index recovery exhausted) (no retry): VerificationError: Recipient address mismatch` at §C → §D. The existing unit tests cover helpers in isolation (PaymentsModule.recipient-address-mismatch-recovery.test.ts) and the error CLASSIFIER (PaymentsModule.proof-polling-persistence.test.ts #269), but neither proves the composition: "given a sibling HD index in the recipient's inventory, tryRecover finds it AND the SDK's verifyRecipient agrees on the derived address". This left CI blind to the regression mode where tryRecover returns a false-positive candidate that Token.update later rejects. This commit closes the gap at three layers: L1. Real-SDK integration test (tests/integration/payments/v6-recover-real-sdk-recovery.test.ts): - Replaces the file-level vi.mock for SigningService / UnmaskedPredicate / TokenId / TokenType / HashAlgorithm with REAL SDK imports. - Builds two real HD-derived signing services. - Constructs the sender-targeted DIRECT address via real UnmaskedPredicate.create → reference.toAddress (the EXACT path the SDK's verifyRecipient uses). - Drives `tryRecoverSigningServiceForRecipient` against real TokenId / TokenType / Uint8Array salt. - 4 tests: happy path, no-match negative, deriveAddressInfo throws, and (L2 composition) tryRecover + predicate construction round-trip equals the transferTx target address. L3. CI workflow (.github/workflows/soak-nightly.yml): - Nightly cron + workflow_dispatch trigger. - Probes testnet aggregator + Nostr relay; skip-not-fail when either is unreachable (third-party outages cannot block releases). - Checks out sphere-cli alongside sphere-sdk, links the local build, runs manual-test-full-recovery.sh with SPHERE_DEBUG=*. - Captures soak.log + workspace as artifacts (30-day retention on failure, 7-day on success). - Surfaces metrics in the workflow summary: V6-RECOVER count, Stranded receive count, POINTER_MONOTONICITY_VIOLATION count, bcast_pub>0 count, exit code. What's tested by what (audit roll-up): - SDK address-derivation regression in tryRecover → L1 (real SDK). - tryRecover ↔ predicate-construction composition → L1 composition test (proves verifyRecipient cannot throw against the recovered state). - Cross-process / Nostr / IPFS / multi-daemon regressions → L3 soak. - Helper-level edge cases (mock-friendly) → existing PaymentsModule.recipient-address-mismatch-recovery.test.ts. - Error-classification on hard fail → existing #269 test in PaymentsModule.proof-polling-persistence.test.ts. Tests pass: 4/4 new integration tests + all 8185 unit tests. tsc clean. Refs: #333 (V6-RECOVER test-coverage gap, follow-up to H1-H7 stack). Stacked on #355 (H7) → #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2) → #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).