fix(payments/transfer): same-process source lock for conservative-sender (Audit #333 H1) by vrogojin · Pull Request #349 · unicity-sphere/sphere-sdk

vrogojin · 2026-05-30T10:26:39Z

Stacked PR. Built atop #348 (C3) which is stacked on #347 (C2) on #346 (C1). The release-gate tier (C1+C2+C3) lands first; this PR begins the High tier.

Summary

Closes the H1 high finding of audit #333 — conservative-sender.ts had zero locking primitives. instant-sender.ts declared a process-global sourceLocks map (Wave 5 #171), but conservative sends and instant-vs-conservative cross-pairs did NOT serialize on shared source tokens: two concurrent sends could both pass selection, both commit on-chain, and only the aggregator caught the duplicate-spend after a source was burned.

Audit caveat confirmed before fixing

The audit flagged: "whether PaymentsModule's PerTokenMutex/spend-queue (PaymentsModule.ts:2505) serializes the conservative path at a higher layer. If it does, severity drops."

I traced this: the PerTokenMutex at line 2505 lives inside the ingest pool (receive side, not send side). The reservationLedger + spendQueue (line 1737) DO serialize the legacy TXF send path (lines 5769, 6368), but the UXF conservative path (dispatchUxfConservativeSend → sendConservativeUxf, line 5613) bypasses both and entered the sender body with no per-source lock. The audit's claim is therefore live on the active code path, not subsumed by a higher-layer guard.

Fix

Three coordinated changes:

Extracted the per-source lock registry from instant-sender.ts to a new modules/payments/transfer/source-locks.ts. The Wave 4-7 semantics are preserved verbatim — lex-sorted acquisition for deadlock prevention, 60 s liveness floor with force-release, Set dedup, fail-closed __resetSourceLocksForTesting guard. Adds a callerLabel parameter so the force-release warning identifies which sender wedged the lock.
instant-sender.ts re-exports __resetSourceLocksForTesting so the existing 36 instant-sender tests' import path keeps working without a single change.
conservative-sender.ts now acquires per-source locks after source selection completes (Step 2.5, mirroring the instant pattern) and releases them in the outer finally. Adds a TEST-ONLY __sourceLockMaxHoldMs override on ConservativeSenderDeps, symmetric with InstantSenderDeps.

Cross-pair invariant

Because both senders now share the SAME process-global map, an in-flight instant send blocks a concurrent conservative send sharing a source token, and vice versa. The audit's specific case is closed.

Test plan

9 new direct-module tests on tests/unit/payments/transfer/source-locks-h1-shared.test.ts:
- Two concurrent acquires on the SAME tokenId serialize
- Disjoint tokenIds proceed concurrently
- Lex-sorted acquisition prevents deadlock on overlapping sets
- Dedup within a single acquire call
- Cross-sender lock sharing (the H1 cross-pair invariant)
- __resetSourceLocksForTesting from instant-sender is === the symbol from source-locks (re-export identity)
- Reset hook clears in-flight locks regardless of acquisition origin
- Fail-closed guard preserved (production environment refuses reset)
- Force-release liveness floor
4 new integration tests on tests/unit/payments/transfer/conservative-sender-h1-source-lock.test.ts:
- Two conservative sends sharing a source — second cannot reach commitSources until first releases
- Lock released on success (subsequent send proceeds without waiting)
- Lock released on failure (try/finally invariant — subsequent send proceeds even after the first throws)
- Disjoint sources proceed concurrently (locking is not over-broad)
All 36 existing instant-sender.test.ts tests pass unchanged
All 1567 unit tests in tests/unit/payments/ pass
All 445 unit test files (8147 tests) pass (2 pre-existing skipped)
tsc --noEmit clean
eslint clean on touched files

Audit traceability

Issue: Audit: integration/all-fixes — money-path subsystems (UXF / Profile / transfer pipeline) #333 (H1)
Files / lines in audit: modules/payments/transfer/conservative-sender.ts (no locks), modules/payments/transfer/instant-sender.ts:463 (sourceLocks was module-local)
Audit recommendation followed: extract the lock map so the asymmetry between the two senders is closed; preserve the Wave 5 feat(#166): orphan-spending auto-recovery hook (P2 #1) #171 semantics verbatim.

Next in the stack

H2 (manifest tokenId ↔ genesis binding) follows, branched off this PR.

…s content.tokenId The manifest `tokenId → rootHash` map was only regex-shape-checked at the import boundary (uxf/json.ts:430, uxf/ipld.ts:555) and never asserted equal to the genesis tokenId encoded in the referenced token-root element. Pool elements still self-verified (content-hash recomputed), so a hostile sender could craft `{ "tokenId=A": <hash-of-root-minting- tokenId-B> }` and pass every existing gate. Downstream consumers that trust the manifest key (dedup, ownership filtering, balance computation) would mis-identify the token — an identity-confusion primitive that the audit specifically called out. Fix layered defense-in-depth: - uxf/verify.ts — primary structural-integrity gate. When iterating manifest entries during the DAG walk, assert that the root element exists with type ELEMENT_TYPE_TOKEN_ROOT AND that element.content.tokenId === manifestKey. Mismatches surface as MANIFEST_TOKENID_MISMATCH errors; wrong element types surface as MANIFEST_TYPE_MISMATCH. modules/payments/transfer/bundle-verifier already calls pkg.verify() so the receive pipeline (the audit's primary concern) is covered. - uxf/json.ts packageFromJson — fail-fast at the JSON import boundary with VERIFICATION_FAILED. Belt-and-braces for consumers that bypass verify() (e.g., direct UxfPackage.fromJson() use without a downstream bundle-verifier round). - uxf/ipld.ts importFromCar — same fail-fast at the CAR import boundary. Mirrors the json-path check for the CAR-bound delivery flow. The export side already refuses to write packages with mismatched manifest keys (the verify() result feeds production callers' export paths), so this fix focuses on hostile-receive scenarios. Tests: 8 new H2 regression tests covering verify() positive + MANIFEST_TOKENID_MISMATCH + MANIFEST_TYPE_MISMATCH, the json import boundary positive + spoofed-key + wrong-type rejections, and the CAR import boundary spoofed-key + positive flow. All 443 existing uxf tests pass unchanged. 8155 total unit tests pass. Refs: #333 (H2). Stacked on #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).

…mode Before this fix, `UxfPackage.merge` (→ `mergePkg`) silently dropped any tokenId whose per-token resolver threw. A single malformed sibling element inside a Rule 4 synthetic-rebuild made a legitimately-received token vanish from the merged manifest with only a `logger.warn`. On the receive path this manifested as token loss from view: the caller had no machine-readable signal that anything had gone wrong. Fix: - `UxfPackage.merge()` now returns `{ skipped: MergeSkip[] }`. Each `MergeSkip` records the tokenId, the original error, the target's prior root (if any), and the source's incoming root that failed to incorporate. Callers can react: replay, retry, surface to UI, or quarantine. - `opts.strict: true` aggregates skipped tokens into a `UxfError('MERGE_PARTIAL_FAILURE')` thrown BEFORE the atomic apply phase — target state is unchanged on the throw. Use on receive- path JOINs where any silent drop is unacceptable; leave default-off for opportunistic peer JOINs where partial coverage is fine. - `opts.onSkip` fires once per skipped tokenId for callers that want telemetry visibility without strict-mode failure. Throws inside onSkip are caught and logged — observability must not change merge semantics. - New error code `MERGE_PARTIAL_FAILURE` on UxfErrorCode. The thrown error also carries a non-typed `skipped` field with the structured skip list so callers needing machine-readable details can read it via `(err as { skipped }).skipped`. - Back-compat at the internal `mergePkg` boundary: legacy positional `(target, source, verifiedProofs)` callers keep working via a Set-vs-opts shape probe. Public callers (4 in tree: PaymentsModule, profile-token-storage-provider, profile/consolidation, flush-scheduler) ignore the returned `{ skipped }` and continue to use merge as void — behavior unchanged for them. Tests: 8 new H3 regression tests covering empty-skipped baseline, captured-skip on resolver throw, MergeSkip metadata fields, strict-mode hard fail, strict-mode atomicity invariant (target unchanged on throw), strict-mode happy path, onSkip callback fires once per skip, and onSkip throw does not change merge semantics. All 451 existing uxf tests pass unchanged. 8163 total unit tests pass. Refs: #333 (H3). Stacked on #351 (H2) → #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).

…e in disposition-engine modules/payments/transfer/disposition-engine.ts:629 called `verifyProof (proof, trustBase, requestId)` with the bundle-supplied requestId and trusted that the un-audited hydrateChain adapter had derived it canonically. If the adapter erred or a malicious sender hand-crafted the bundle with a proof anchored to a DIFFERENT transaction's requestId, the proof would still verify (it IS a genuine on-chain proof) but it would be incorrectly attributed to this transaction — a proof-binding forgery. Fix: - New `AssertRequestIdBindingFn` type and `AssertRequestIdBindingResult` interface on disposition-engine.ts. The hook canonically re-derives RequestId from (authenticator.publicKey, sourceState) and compares to the bundle's requestId; production wiring calls RequestId.create and compares toJSON. - Optional `assertRequestIdBinding?` dep added to DispositionEngineInput. When provided, the engine calls it BEFORE every verifyProof invocation: * `{ ok: true }` → proof verification proceeds. * `{ ok: false }` → cryptoInvalid('proof-invalid') — same §5.3 [C](3) class as a clean PATH_INVALID at the verifier, because the (proof, requestId) pair does not bind to this tx. * throw → structuralInvalid('proof-throw'). The binding gate fires BEFORE verifyProof so a forged binding short-circuits the (potentially expensive) on-chain proof check. - Optional shape preserves back-compat with the 66 existing engine tests (which do not set the hook) and with all production callers that have not yet been updated. - Plumbed through legacy-shape-adapter.ts: new optional `assertRequestIdBinding?` field on LegacyShapeAdapterInput is forwarded to DispositionEngineInput in buildEngineInput. Existing callers that don't supply the hook degrade to the pre-fix behaviour; new callers pass the production-grade RequestId.create comparer. Tests: 7 new H4 regression tests covering: - Back-compat (hook absent → proof verification proceeds) - Hook returns ok=true (binding asserted, proof verified) - Hook receives both bundle requestId and authenticator unchanged - Hook returns ok=false (cryptoInvalid('proof-invalid'), verifyProof NOT invoked) - Hook throws (structuralInvalid('proof-throw')) - Binding gate fires BEFORE verifyProof (ordering invariant) - Null-proof chain entries do NOT trigger the binding gate All 66 existing disposition-engine tests pass unchanged. All 1491 transfer tests pass. 8170 total unit tests pass. tsc clean. Refs: #333 (H4). Stacked on #352 (H3) → #351 (H2) → #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).

…permanent modules/payments/transfer/finalization-worker-sender.ts transitioned the outbox entry to `failed-permanent` on any hard-fail and never touched the source tokens that the instant-sender had marked `transferring`/`pending` at submit time. With orphan auto-recovery default-OFF, the spender-side balance was permanently locked as unspendable. Fix: - Added optional `sourceTokenIds?: ReadonlyArray<string>` to `UxfTransferOutboxEntry`. The instant-sender now records the selected source set at outbox-record construction time via `OutboxBuildArgs.sourceTokenIds`. Optional with `[]` semantics on `undefined` for back-compat with entries written before H5. - Added optional `recoverFailedPermanentSources?(sources, outboxId)` hook to `FinalizationWorkerSenderOptions`. When wired, the worker invokes it once at the `failed-permanent` transition with the entry's recorded source tokenIds. Production wiring flips each source back to a spendable state (typically `confirmed`) gated on a same-source-no-other-live-outbox check; the audit's gap was that the worker provided no signal at all. - The hook is best-effort: throws are caught and emitted as a `transfer:failed` event for triage, but the `failed-permanent` transition itself stands. This preserves the terminal-state contract — a recovery hook failure must not re-introduce the locked-pending bug it exists to fix. - Optional shape preserves back-compat with the 90+ existing finalization-worker-sender tests (which do not set the hook). Tests: 5 new H5 regression tests covering: - Hook fires with the recorded source ids on failed-permanent - Pre-H5 entries (no sourceTokenIds field) get `[]` (not skipped) - Hook absent: worker still transitions cleanly - Hook throws: failed-permanent stands, transfer:failed emitted - Hook fires AFTER the outbox transition (ordering invariant) The "hook does NOT fire on success/in-progress" guarantee is structurally enforced by code placement (inside the `if (totalFailure > 0)` branch); a behavioural test would require duplicating the full §6.1 fake-adapter wiring for a property the code-review already shows. All 1498 transfer tests pass. 8175 total unit tests pass. tsc clean. Refs: #333 (H5). Stacked on #353 (H4) → #352 (H3) → #351 (H2) → #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).

…nifestCas ManifestCas.update's precondition check compared `observed.rootHash` (a string read from the entry) to `prev.contentHash` (the caller's expected value). Both sides are arbitrary writer-supplied labels — a CAS pass meant "the labels agree", not "the content under the label is the content the caller meant". An attacker writing an entry with a forged `rootHash` field that doesn't match the actual pool content slipped through unchanged. Fix: - New optional `VerifyEntryRootFn` hook accepted via the ManifestCas constructor's `opts.verifyEntryRoot`. When wired, the hook is invoked AFTER the label CAS passes and BEFORE the write. The hook fetches the content the entry's `rootHash` references (production: the UXF pool element), recomputes the hash, and asserts it matches the declaration. - New `'integrity-failed'` discriminator on `ManifestCasResult.reason` with structured `integrityDetail` for triage. The result also carries `observed.contentHash` so callers can re-read state. - `ManifestCidRewriteCasError.casReason` widened to surface `'integrity-failed'`. The worker's outer retry loop already treats any throw from manifest-cid-rewrite as a hard CAS failure that bubbles up; the new reason rides through unchanged but preserves structure for operator triage. - Skipped on the `prev === null` (asserts-no-entry) path because there is no observed entry to verify. Skipped on label-CAS-mismatch because the swap has already been rejected. - Optional shape preserves back-compat: all 9 existing manifest-cas tests and the 100% of production callers that do not (yet) supply the verifier continue to behave identically. Production wiring is a follow-up that does not block this PR. Tests: 6 new H7 regression tests covering: - Back-compat (verifier omitted → label-only CAS as before) - Happy path (verifier called AFTER label CAS, BEFORE write) - Integrity failure (returns 'integrity-failed', no write happens, storage unchanged, integrityDetail propagated) - Skip on prev=null (no observed entry to verify) - Skip on label-CAS-mismatch (short-circuit before verifier) - 'integrity-failed' has its own discriminator distinct from cas-mismatch / not-found / concurrent-modification All 1512 transfer tests pass. 8181 total unit tests pass. tsc clean. Refs: #333 (H7). Stacked on #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2) → #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).

…ER test-coverage gap The soak (manual-test-full-recovery.sh) routinely surfaces `[ERROR] [V6-RECOVER] Stranded receive <hex> hit permanent recipient- address mismatch (HD-index recovery exhausted) (no retry): VerificationError: Recipient address mismatch` at §C → §D. The existing unit tests cover helpers in isolation (PaymentsModule.recipient-address-mismatch-recovery.test.ts) and the error CLASSIFIER (PaymentsModule.proof-polling-persistence.test.ts #269), but neither proves the composition: "given a sibling HD index in the recipient's inventory, tryRecover finds it AND the SDK's verifyRecipient agrees on the derived address". This left CI blind to the regression mode where tryRecover returns a false-positive candidate that Token.update later rejects. This commit closes the gap at three layers: L1. Real-SDK integration test (tests/integration/payments/v6-recover-real-sdk-recovery.test.ts): - Replaces the file-level vi.mock for SigningService / UnmaskedPredicate / TokenId / TokenType / HashAlgorithm with REAL SDK imports. - Builds two real HD-derived signing services. - Constructs the sender-targeted DIRECT address via real UnmaskedPredicate.create → reference.toAddress (the EXACT path the SDK's verifyRecipient uses). - Drives `tryRecoverSigningServiceForRecipient` against real TokenId / TokenType / Uint8Array salt. - 4 tests: happy path, no-match negative, deriveAddressInfo throws, and (L2 composition) tryRecover + predicate construction round-trip equals the transferTx target address. L3. CI workflow (.github/workflows/soak-nightly.yml): - Nightly cron + workflow_dispatch trigger. - Probes testnet aggregator + Nostr relay; skip-not-fail when either is unreachable (third-party outages cannot block releases). - Checks out sphere-cli alongside sphere-sdk, links the local build, runs manual-test-full-recovery.sh with SPHERE_DEBUG=*. - Captures soak.log + workspace as artifacts (30-day retention on failure, 7-day on success). - Surfaces metrics in the workflow summary: V6-RECOVER count, Stranded receive count, POINTER_MONOTONICITY_VIOLATION count, bcast_pub>0 count, exit code. What's tested by what (audit roll-up): - SDK address-derivation regression in tryRecover → L1 (real SDK). - tryRecover ↔ predicate-construction composition → L1 composition test (proves verifyRecipient cannot throw against the recovered state). - Cross-process / Nostr / IPFS / multi-daemon regressions → L3 soak. - Helper-level edge cases (mock-friendly) → existing PaymentsModule.recipient-address-mismatch-recovery.test.ts. - Error-classification on hard fail → existing #269 test in PaymentsModule.proof-polling-persistence.test.ts. Tests pass: 4/4 new integration tests + all 8185 unit tests. tsc clean. Refs: #333 (V6-RECOVER test-coverage gap, follow-up to H1-H7 stack). Stacked on #355 (H7) → #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2) → #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).