Skip to content

fix(profile/migration): force durable flush before cleanup (Audit #333 C2)#347

Merged
vrogojin merged 1 commit into
mainfrom
fix/issue-333-c2-migration-flush-before-cleanup
May 30, 2026
Merged

fix(profile/migration): force durable flush before cleanup (Audit #333 C2)#347
vrogojin merged 1 commit into
mainfrom
fix/issue-333-c2-migration-flush-before-cleanup

Conversation

@vrogojin
Copy link
Copy Markdown
Contributor

Stacked PR. Built atop #346 (C1). Review C1 first or use the stacked-diff view for the incremental delta.

Summary

Closes the C2 critical finding of audit #333 — the migration cleanup-before-flush window in ProfileMigration. Before this PR, the 6-step migration walked persist → sanity → cleanup without forcing the TXF bundle to land durably:

  • stepPersistToOrbitDb (migration.ts:585) accepted save()'s cid: 'debounced' return as success without driving the flush.
  • stepSanityCheck (:643) read back via load() which returns in-memory pendingData (source: 'cache') — passing even if nothing was pinned.
  • stepCleanup (:779) then deleted legacy KV keys and unpinned the legacy CID.

A crash (or later flush failure) between persist and the debounced flush landing lost both the legacy KV state and the now-unpinned legacy CID. Token-DB loss was only partially mitigated by the #330 legacy-token-DB fallback if wired; legacy KV and the unpinned CID were not.

Fix

  1. stepPersistToOrbitDb calls profileTokenStorage.awaitNextFlush(0) after save(). awaitNextFlush drives flushScheduler.forceFlushSerialized() and throws on TIMEOUT or POINTER_MONOTONICITY_VIOLATION — converting a silent crash-window data loss into a recoverable MIGRATION_FAILED. timeoutMs=0 disables the wall-clock deadline (the 4-iteration cap inside awaitNextFlush still applies) so large migrations are not artificially capped.
  2. Providers without awaitNextFlush are refused outright. The pre-fix silent-loss path was the only alternative and it is unsafe.
  3. stepSanityCheck rejects loadResult.source === 'cache' with a hard error. After awaitNextFlush, pendingData is null on the real provider and load() walks active bundles in OrbitDB returning source 'remote'. A 'cache' result here means the durability gate was bypassed — belt-and-braces over the awaitNextFlush guarantee.

Why this is safe in production

  • The real ProfileTokenStorageProvider implements awaitNextFlush at profile/profile-token-storage-provider.ts:1054. The audit's own audit recommendation maps directly onto this method.
  • sphere.telco's runProfileMigration wires attachIdentityToProfileProviderstokenStorage.initialize() before any migration step. By the time migrate() runs, the flush scheduler is live and awaitNextFlush is callable.
  • Existing migration tests updated to simulate the real flush contract (savesource:'cache', awaitNextFlushsource:'remote'). The buggy-provider test now hits source:'remote' so the existing token-count-mismatch path is preserved.

Test plan

  • 9 new C2 regression tests in tests/unit/profile/migration-c2-flush-before-cleanup.test.ts:
    • awaitNextFlush is called exactly once after save() (happy path)
    • timeoutMs=0 is passed (no artificial deadline)
    • TIMEOUT, POINTER_MONOTONICITY_VIOLATION, and generic flush errors all become MIGRATION_FAILED before cleanup runs
    • Providers omitting awaitNextFlush are refused
    • Null-txfData path skips the flush entirely
    • Sanity check rejects post-persist source: 'cache' reads
    • End-to-end failure-path invariant: any flush/sanity error leaves legacy KV intact and the legacy CID reclaimable
  • All 442 unit test files (8127 tests) pass
  • tsc --noEmit clean
  • eslint clean on touched files

Audit traceability

Next in the stack

C3 (auto-drop on transient block-load) follows, branched off this PR.

@vrogojin vrogojin force-pushed the fix/issue-333-c1-plaintext-seed-window branch from c76576a to 1b63ae1 Compare May 30, 2026 14:39
@vrogojin vrogojin force-pushed the fix/issue-333-c2-migration-flush-before-cleanup branch from abc5a05 to 0ad16be Compare May 30, 2026 14:39
vrogojin added a commit that referenced this pull request May 30, 2026
…oplog reset

FlushScheduler.addBundleWithOplogAutoReset transitioned straight from
"extractLostHeadCid matched the error" to resetCorruptedLog() (which
calls db.drop()). The matcher cannot distinguish a permanently-corrupt
head (Helia GC ran on a memory-blockstore wallet) from a transiently-
unreachable one (gateway blip, peer offline, propagation lag). A
momentary fetch failure thus wiped all OUTBOX/SENT/disposition/
finalization entries not yet captured in a pinned bundle — permanent
data loss on a recoverable error.

Fix:

  - Probe configured IPFS gateways for the lost head CID with
    exponential backoff (verifyCidAccessibleWithRetry, 30 s deadline,
    5 s per-attempt HEAD timeout — matches the Issue #239 shutdown gate
    convention for testnet propagation) BEFORE the destructive reset.

  - On a successful probe, retry addBundle ONCE. If the retry
    succeeds the reset is SKIPPED — no operational state is lost. This
    is the primary recovery path for the gateway-blip / propagation-
    lag class of failure.

  - On probe failure (no gateway served the CID within the deadline)
    OR a post-probe retry that still hits the same auto-reset signature
    (local Helia is the bottleneck, not the network), fall through to
    the existing reset path. The freshest lostHeadCid is propagated
    into the resetCorruptedLog reason payload and event data.

  - On a post-probe retry that hits a DIFFERENT error class (e.g.,
    POINTER_MONOTONICITY_VIOLATION), re-throw the new error — resetting
    the log would not help.

  - With no gateways configured, the probe is skipped entirely and the
    pre-fix behaviour is preserved (no recovery surface, destructive
    reset is the only forward path).

  - When the probe itself throws (e.g., gateway URL validation failed
    before any HEAD ran), fall through to reset — we cannot prove the
    head is recoverable, so the safe-on-error default is the existing
    behaviour.

Worst-case cost: +30 s before destructive teardown on a genuine
corruption. Small price relative to the data-loss exposure on the
false-positive path.

Tests: 7 new C3 regression tests covering each path (transient blip,
permanent loss, probe-ok-but-retry-fails, retry-different-error,
no-gateways, probe-throws, unrelated-errors). Existing 7-test
flush-scheduler-oplog-reset.test.ts suite unchanged and still green.
443 unit test files (8134 tests) pass. tsc clean.

Refs: #333 (C3). Stacked on top of #347 (C2) which is stacked on #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…nservative-sender

conservative-sender.ts had zero locking primitives. instant-sender.ts
declared a module-local sourceLocks map (Wave 5 #171), so conservative
sends and instant-vs-conservative cross-pairs did NOT serialize on
shared source tokens: two concurrent sends could both pass selection,
both commit on-chain, and only the aggregator caught the duplicate-spend
after a source was burned.

Fix has three parts:

  - Extracted the per-source lock registry from instant-sender.ts to a
    new `modules/payments/transfer/source-locks.ts`. The Wave 4-7
    semantics are preserved verbatim (lex-sorted acquisition for
    deadlock prevention, 60 s liveness floor with force-release, Set
    dedup, fail-closed __resetSourceLocksForTesting guard). The shared
    module exports `acquireSourceLocks` and `__resetSourceLocksForTesting`.

  - instant-sender.ts re-exports `__resetSourceLocksForTesting` so the
    existing tests/unit/payments/transfer/instant-sender.test.ts import
    path continues to work.

  - conservative-sender.ts now acquires per-source locks after source
    selection completes (Step 2.5, mirroring the instant pattern) and
    releases them in the outer `finally`. Adds a TEST-ONLY
    __sourceLockMaxHoldMs override on ConservativeSenderDeps, symmetric
    with InstantSenderDeps.

Cross-pair invariant: because both senders now share the SAME
process-global map, an in-flight instant send blocks a concurrent
conservative send sharing a source token, and vice versa.

Tests:

  - 9 new direct-module tests on source-locks.ts covering acquire
    serialization, deadlock prevention via lex-sort, dedup, liveness
    floor (force-release), back-compat re-export identity, fail-closed
    guard preserved.

  - 4 new integration tests on conservative-sender proving the
    pipeline acquires + releases the lock: serialization on shared
    source, release on success, release on failure (try/finally),
    disjoint sources proceed concurrently.

  - All 36 existing instant-sender tests pass unchanged (the extracted
    module is byte-equivalent to the original Wave 7 code modulo the
    new callerLabel parameter, which defaults to 'sender' when omitted).

8147 unit tests pass (2 pre-existing skipped). tsc clean.

Refs: #333 (H1). Stacked on #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…s content.tokenId

The manifest `tokenId → rootHash` map was only regex-shape-checked at the
import boundary (uxf/json.ts:430, uxf/ipld.ts:555) and never asserted
equal to the genesis tokenId encoded in the referenced token-root
element. Pool elements still self-verified (content-hash recomputed),
so a hostile sender could craft `{ "tokenId=A": <hash-of-root-minting-
tokenId-B> }` and pass every existing gate. Downstream consumers that
trust the manifest key (dedup, ownership filtering, balance
computation) would mis-identify the token — an identity-confusion
primitive that the audit specifically called out.

Fix layered defense-in-depth:

  - uxf/verify.ts — primary structural-integrity gate. When iterating
    manifest entries during the DAG walk, assert that the root
    element exists with type ELEMENT_TYPE_TOKEN_ROOT AND that
    element.content.tokenId === manifestKey. Mismatches surface as
    MANIFEST_TOKENID_MISMATCH errors; wrong element types surface as
    MANIFEST_TYPE_MISMATCH. modules/payments/transfer/bundle-verifier
    already calls pkg.verify() so the receive pipeline (the audit's
    primary concern) is covered.

  - uxf/json.ts packageFromJson — fail-fast at the JSON import
    boundary with VERIFICATION_FAILED. Belt-and-braces for consumers
    that bypass verify() (e.g., direct UxfPackage.fromJson() use
    without a downstream bundle-verifier round).

  - uxf/ipld.ts importFromCar — same fail-fast at the CAR import
    boundary. Mirrors the json-path check for the CAR-bound delivery
    flow.

The export side already refuses to write packages with mismatched
manifest keys (the verify() result feeds production callers' export
paths), so this fix focuses on hostile-receive scenarios.

Tests: 8 new H2 regression tests covering verify() positive +
MANIFEST_TOKENID_MISMATCH + MANIFEST_TYPE_MISMATCH, the json import
boundary positive + spoofed-key + wrong-type rejections, and the
CAR import boundary spoofed-key + positive flow. All 443 existing
uxf tests pass unchanged. 8155 total unit tests pass.

Refs: #333 (H2). Stacked on #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…mode

Before this fix, `UxfPackage.merge` (→ `mergePkg`) silently dropped any
tokenId whose per-token resolver threw. A single malformed sibling
element inside a Rule 4 synthetic-rebuild made a legitimately-received
token vanish from the merged manifest with only a `logger.warn`. On the
receive path this manifested as token loss from view: the caller had no
machine-readable signal that anything had gone wrong.

Fix:

  - `UxfPackage.merge()` now returns `{ skipped: MergeSkip[] }`. Each
    `MergeSkip` records the tokenId, the original error, the target's
    prior root (if any), and the source's incoming root that failed
    to incorporate. Callers can react: replay, retry, surface to UI,
    or quarantine.

  - `opts.strict: true` aggregates skipped tokens into a
    `UxfError('MERGE_PARTIAL_FAILURE')` thrown BEFORE the atomic apply
    phase — target state is unchanged on the throw. Use on receive-
    path JOINs where any silent drop is unacceptable; leave default-off
    for opportunistic peer JOINs where partial coverage is fine.

  - `opts.onSkip` fires once per skipped tokenId for callers that want
    telemetry visibility without strict-mode failure. Throws inside
    onSkip are caught and logged — observability must not change merge
    semantics.

  - New error code `MERGE_PARTIAL_FAILURE` on UxfErrorCode. The thrown
    error also carries a non-typed `skipped` field with the structured
    skip list so callers needing machine-readable details can read it
    via `(err as { skipped }).skipped`.

  - Back-compat at the internal `mergePkg` boundary: legacy positional
    `(target, source, verifiedProofs)` callers keep working via a
    Set-vs-opts shape probe. Public callers (4 in tree: PaymentsModule,
    profile-token-storage-provider, profile/consolidation,
    flush-scheduler) ignore the returned `{ skipped }` and continue to
    use merge as void — behavior unchanged for them.

Tests: 8 new H3 regression tests covering empty-skipped baseline,
captured-skip on resolver throw, MergeSkip metadata fields,
strict-mode hard fail, strict-mode atomicity invariant (target
unchanged on throw), strict-mode happy path, onSkip callback fires
once per skip, and onSkip throw does not change merge semantics.
All 451 existing uxf tests pass unchanged. 8163 total unit tests pass.

Refs: #333 (H3). Stacked on #351 (H2) → #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…e in disposition-engine

modules/payments/transfer/disposition-engine.ts:629 called `verifyProof
(proof, trustBase, requestId)` with the bundle-supplied requestId and
trusted that the un-audited hydrateChain adapter had derived it
canonically. If the adapter erred or a malicious sender hand-crafted
the bundle with a proof anchored to a DIFFERENT transaction's
requestId, the proof would still verify (it IS a genuine on-chain
proof) but it would be incorrectly attributed to this transaction —
a proof-binding forgery.

Fix:

  - New `AssertRequestIdBindingFn` type and `AssertRequestIdBindingResult`
    interface on disposition-engine.ts. The hook canonically re-derives
    RequestId from (authenticator.publicKey, sourceState) and compares
    to the bundle's requestId; production wiring calls RequestId.create
    and compares toJSON.

  - Optional `assertRequestIdBinding?` dep added to
    DispositionEngineInput. When provided, the engine calls it BEFORE
    every verifyProof invocation:
      * `{ ok: true }`  → proof verification proceeds.
      * `{ ok: false }` → cryptoInvalid('proof-invalid') — same
        §5.3 [C](3) class as a clean PATH_INVALID at the verifier,
        because the (proof, requestId) pair does not bind to this tx.
      * throw → structuralInvalid('proof-throw').
    The binding gate fires BEFORE verifyProof so a forged binding
    short-circuits the (potentially expensive) on-chain proof check.

  - Optional shape preserves back-compat with the 66 existing engine
    tests (which do not set the hook) and with all production callers
    that have not yet been updated.

  - Plumbed through legacy-shape-adapter.ts: new optional
    `assertRequestIdBinding?` field on LegacyShapeAdapterInput is
    forwarded to DispositionEngineInput in buildEngineInput. Existing
    callers that don't supply the hook degrade to the pre-fix
    behaviour; new callers pass the production-grade RequestId.create
    comparer.

Tests: 7 new H4 regression tests covering:
  - Back-compat (hook absent → proof verification proceeds)
  - Hook returns ok=true (binding asserted, proof verified)
  - Hook receives both bundle requestId and authenticator unchanged
  - Hook returns ok=false (cryptoInvalid('proof-invalid'),
    verifyProof NOT invoked)
  - Hook throws (structuralInvalid('proof-throw'))
  - Binding gate fires BEFORE verifyProof (ordering invariant)
  - Null-proof chain entries do NOT trigger the binding gate

All 66 existing disposition-engine tests pass unchanged. All 1491
transfer tests pass. 8170 total unit tests pass. tsc clean.

Refs: #333 (H4). Stacked on #352 (H3) → #351 (H2) → #349 (H1) → #348 (C3)
→ #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…permanent

modules/payments/transfer/finalization-worker-sender.ts transitioned
the outbox entry to `failed-permanent` on any hard-fail and never
touched the source tokens that the instant-sender had marked
`transferring`/`pending` at submit time. With orphan auto-recovery
default-OFF, the spender-side balance was permanently locked as
unspendable.

Fix:

  - Added optional `sourceTokenIds?: ReadonlyArray<string>` to
    `UxfTransferOutboxEntry`. The instant-sender now records the
    selected source set at outbox-record construction time via
    `OutboxBuildArgs.sourceTokenIds`. Optional with `[]` semantics
    on `undefined` for back-compat with entries written before H5.

  - Added optional `recoverFailedPermanentSources?(sources, outboxId)`
    hook to `FinalizationWorkerSenderOptions`. When wired, the worker
    invokes it once at the `failed-permanent` transition with the
    entry's recorded source tokenIds. Production wiring flips each
    source back to a spendable state (typically `confirmed`) gated
    on a same-source-no-other-live-outbox check; the audit's gap
    was that the worker provided no signal at all.

  - The hook is best-effort: throws are caught and emitted as a
    `transfer:failed` event for triage, but the `failed-permanent`
    transition itself stands. This preserves the terminal-state
    contract — a recovery hook failure must not re-introduce the
    locked-pending bug it exists to fix.

  - Optional shape preserves back-compat with the 90+ existing
    finalization-worker-sender tests (which do not set the hook).

Tests: 5 new H5 regression tests covering:
  - Hook fires with the recorded source ids on failed-permanent
  - Pre-H5 entries (no sourceTokenIds field) get `[]` (not skipped)
  - Hook absent: worker still transitions cleanly
  - Hook throws: failed-permanent stands, transfer:failed emitted
  - Hook fires AFTER the outbox transition (ordering invariant)

The "hook does NOT fire on success/in-progress" guarantee is
structurally enforced by code placement (inside the
`if (totalFailure > 0)` branch); a behavioural test would require
duplicating the full §6.1 fake-adapter wiring for a property the
code-review already shows.

All 1498 transfer tests pass. 8175 total unit tests pass. tsc clean.

Refs: #333 (H5). Stacked on #353 (H4) → #352 (H3) → #351 (H2) → #349 (H1)
→ #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…nifestCas

ManifestCas.update's precondition check compared `observed.rootHash`
(a string read from the entry) to `prev.contentHash` (the caller's
expected value). Both sides are arbitrary writer-supplied labels — a
CAS pass meant "the labels agree", not "the content under the label
is the content the caller meant". An attacker writing an entry with
a forged `rootHash` field that doesn't match the actual pool content
slipped through unchanged.

Fix:

  - New optional `VerifyEntryRootFn` hook accepted via the ManifestCas
    constructor's `opts.verifyEntryRoot`. When wired, the hook is
    invoked AFTER the label CAS passes and BEFORE the write. The hook
    fetches the content the entry's `rootHash` references (production:
    the UXF pool element), recomputes the hash, and asserts it matches
    the declaration.

  - New `'integrity-failed'` discriminator on `ManifestCasResult.reason`
    with structured `integrityDetail` for triage. The result also
    carries `observed.contentHash` so callers can re-read state.

  - `ManifestCidRewriteCasError.casReason` widened to surface
    `'integrity-failed'`. The worker's outer retry loop already
    treats any throw from manifest-cid-rewrite as a hard CAS failure
    that bubbles up; the new reason rides through unchanged but
    preserves structure for operator triage.

  - Skipped on the `prev === null` (asserts-no-entry) path because
    there is no observed entry to verify. Skipped on label-CAS-mismatch
    because the swap has already been rejected.

  - Optional shape preserves back-compat: all 9 existing manifest-cas
    tests and the 100% of production callers that do not (yet) supply
    the verifier continue to behave identically. Production wiring is
    a follow-up that does not block this PR.

Tests: 6 new H7 regression tests covering:
  - Back-compat (verifier omitted → label-only CAS as before)
  - Happy path (verifier called AFTER label CAS, BEFORE write)
  - Integrity failure (returns 'integrity-failed', no write happens,
    storage unchanged, integrityDetail propagated)
  - Skip on prev=null (no observed entry to verify)
  - Skip on label-CAS-mismatch (short-circuit before verifier)
  - 'integrity-failed' has its own discriminator distinct from
    cas-mismatch / not-found / concurrent-modification

All 1512 transfer tests pass. 8181 total unit tests pass. tsc clean.

Refs: #333 (H7). Stacked on #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2)
→ #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…ER test-coverage gap

The soak (manual-test-full-recovery.sh) routinely surfaces
`[ERROR] [V6-RECOVER] Stranded receive <hex> hit permanent recipient-
address mismatch (HD-index recovery exhausted) (no retry):
VerificationError: Recipient address mismatch` at §C → §D. The
existing unit tests cover helpers in isolation
(PaymentsModule.recipient-address-mismatch-recovery.test.ts) and the
error CLASSIFIER (PaymentsModule.proof-polling-persistence.test.ts
#269), but neither proves the composition: "given a sibling HD index
in the recipient's inventory, tryRecover finds it AND the SDK's
verifyRecipient agrees on the derived address". This left CI blind to
the regression mode where tryRecover returns a false-positive
candidate that Token.update later rejects.

This commit closes the gap at three layers:

  L1. Real-SDK integration test
      (tests/integration/payments/v6-recover-real-sdk-recovery.test.ts):
      - Replaces the file-level vi.mock for SigningService /
        UnmaskedPredicate / TokenId / TokenType / HashAlgorithm with
        REAL SDK imports.
      - Builds two real HD-derived signing services.
      - Constructs the sender-targeted DIRECT address via real
        UnmaskedPredicate.create → reference.toAddress (the EXACT path
        the SDK's verifyRecipient uses).
      - Drives `tryRecoverSigningServiceForRecipient` against real
        TokenId / TokenType / Uint8Array salt.
      - 4 tests: happy path, no-match negative, deriveAddressInfo throws,
        and (L2 composition) tryRecover + predicate construction
        round-trip equals the transferTx target address.

  L3. CI workflow (.github/workflows/soak-nightly.yml):
      - Nightly cron + workflow_dispatch trigger.
      - Probes testnet aggregator + Nostr relay; skip-not-fail when
        either is unreachable (third-party outages cannot block
        releases).
      - Checks out sphere-cli alongside sphere-sdk, links the local
        build, runs manual-test-full-recovery.sh with SPHERE_DEBUG=*.
      - Captures soak.log + workspace as artifacts (30-day retention
        on failure, 7-day on success).
      - Surfaces metrics in the workflow summary: V6-RECOVER count,
        Stranded receive count, POINTER_MONOTONICITY_VIOLATION count,
        bcast_pub>0 count, exit code.

What's tested by what (audit roll-up):
  - SDK address-derivation regression in tryRecover → L1 (real SDK).
  - tryRecover ↔ predicate-construction composition → L1 composition
    test (proves verifyRecipient cannot throw against the recovered
    state).
  - Cross-process / Nostr / IPFS / multi-daemon regressions → L3 soak.
  - Helper-level edge cases (mock-friendly) → existing
    PaymentsModule.recipient-address-mismatch-recovery.test.ts.
  - Error-classification on hard fail → existing #269 test in
    PaymentsModule.proof-polling-persistence.test.ts.

Tests pass: 4/4 new integration tests + all 8185 unit tests. tsc clean.

Refs: #333 (V6-RECOVER test-coverage gap, follow-up to H1-H7 stack).
Stacked on #355 (H7) → #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2)
→ #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
@vrogojin vrogojin force-pushed the fix/issue-333-c1-plaintext-seed-window branch from 1b63ae1 to 0349e37 Compare May 30, 2026 15:14
@vrogojin vrogojin force-pushed the fix/issue-333-c2-migration-flush-before-cleanup branch from 0ad16be to 31668ef Compare May 30, 2026 15:14
vrogojin added a commit that referenced this pull request May 30, 2026
…oplog reset

FlushScheduler.addBundleWithOplogAutoReset transitioned straight from
"extractLostHeadCid matched the error" to resetCorruptedLog() (which
calls db.drop()). The matcher cannot distinguish a permanently-corrupt
head (Helia GC ran on a memory-blockstore wallet) from a transiently-
unreachable one (gateway blip, peer offline, propagation lag). A
momentary fetch failure thus wiped all OUTBOX/SENT/disposition/
finalization entries not yet captured in a pinned bundle — permanent
data loss on a recoverable error.

Fix:

  - Probe configured IPFS gateways for the lost head CID with
    exponential backoff (verifyCidAccessibleWithRetry, 30 s deadline,
    5 s per-attempt HEAD timeout — matches the Issue #239 shutdown gate
    convention for testnet propagation) BEFORE the destructive reset.

  - On a successful probe, retry addBundle ONCE. If the retry
    succeeds the reset is SKIPPED — no operational state is lost. This
    is the primary recovery path for the gateway-blip / propagation-
    lag class of failure.

  - On probe failure (no gateway served the CID within the deadline)
    OR a post-probe retry that still hits the same auto-reset signature
    (local Helia is the bottleneck, not the network), fall through to
    the existing reset path. The freshest lostHeadCid is propagated
    into the resetCorruptedLog reason payload and event data.

  - On a post-probe retry that hits a DIFFERENT error class (e.g.,
    POINTER_MONOTONICITY_VIOLATION), re-throw the new error — resetting
    the log would not help.

  - With no gateways configured, the probe is skipped entirely and the
    pre-fix behaviour is preserved (no recovery surface, destructive
    reset is the only forward path).

  - When the probe itself throws (e.g., gateway URL validation failed
    before any HEAD ran), fall through to reset — we cannot prove the
    head is recoverable, so the safe-on-error default is the existing
    behaviour.

Worst-case cost: +30 s before destructive teardown on a genuine
corruption. Small price relative to the data-loss exposure on the
false-positive path.

Tests: 7 new C3 regression tests covering each path (transient blip,
permanent loss, probe-ok-but-retry-fails, retry-different-error,
no-gateways, probe-throws, unrelated-errors). Existing 7-test
flush-scheduler-oplog-reset.test.ts suite unchanged and still green.
443 unit test files (8134 tests) pass. tsc clean.

Refs: #333 (C3). Stacked on top of #347 (C2) which is stacked on #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…nservative-sender

conservative-sender.ts had zero locking primitives. instant-sender.ts
declared a module-local sourceLocks map (Wave 5 #171), so conservative
sends and instant-vs-conservative cross-pairs did NOT serialize on
shared source tokens: two concurrent sends could both pass selection,
both commit on-chain, and only the aggregator caught the duplicate-spend
after a source was burned.

Fix has three parts:

  - Extracted the per-source lock registry from instant-sender.ts to a
    new `modules/payments/transfer/source-locks.ts`. The Wave 4-7
    semantics are preserved verbatim (lex-sorted acquisition for
    deadlock prevention, 60 s liveness floor with force-release, Set
    dedup, fail-closed __resetSourceLocksForTesting guard). The shared
    module exports `acquireSourceLocks` and `__resetSourceLocksForTesting`.

  - instant-sender.ts re-exports `__resetSourceLocksForTesting` so the
    existing tests/unit/payments/transfer/instant-sender.test.ts import
    path continues to work.

  - conservative-sender.ts now acquires per-source locks after source
    selection completes (Step 2.5, mirroring the instant pattern) and
    releases them in the outer `finally`. Adds a TEST-ONLY
    __sourceLockMaxHoldMs override on ConservativeSenderDeps, symmetric
    with InstantSenderDeps.

Cross-pair invariant: because both senders now share the SAME
process-global map, an in-flight instant send blocks a concurrent
conservative send sharing a source token, and vice versa.

Tests:

  - 9 new direct-module tests on source-locks.ts covering acquire
    serialization, deadlock prevention via lex-sort, dedup, liveness
    floor (force-release), back-compat re-export identity, fail-closed
    guard preserved.

  - 4 new integration tests on conservative-sender proving the
    pipeline acquires + releases the lock: serialization on shared
    source, release on success, release on failure (try/finally),
    disjoint sources proceed concurrently.

  - All 36 existing instant-sender tests pass unchanged (the extracted
    module is byte-equivalent to the original Wave 7 code modulo the
    new callerLabel parameter, which defaults to 'sender' when omitted).

8147 unit tests pass (2 pre-existing skipped). tsc clean.

Refs: #333 (H1). Stacked on #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…s content.tokenId

The manifest `tokenId → rootHash` map was only regex-shape-checked at the
import boundary (uxf/json.ts:430, uxf/ipld.ts:555) and never asserted
equal to the genesis tokenId encoded in the referenced token-root
element. Pool elements still self-verified (content-hash recomputed),
so a hostile sender could craft `{ "tokenId=A": <hash-of-root-minting-
tokenId-B> }` and pass every existing gate. Downstream consumers that
trust the manifest key (dedup, ownership filtering, balance
computation) would mis-identify the token — an identity-confusion
primitive that the audit specifically called out.

Fix layered defense-in-depth:

  - uxf/verify.ts — primary structural-integrity gate. When iterating
    manifest entries during the DAG walk, assert that the root
    element exists with type ELEMENT_TYPE_TOKEN_ROOT AND that
    element.content.tokenId === manifestKey. Mismatches surface as
    MANIFEST_TOKENID_MISMATCH errors; wrong element types surface as
    MANIFEST_TYPE_MISMATCH. modules/payments/transfer/bundle-verifier
    already calls pkg.verify() so the receive pipeline (the audit's
    primary concern) is covered.

  - uxf/json.ts packageFromJson — fail-fast at the JSON import
    boundary with VERIFICATION_FAILED. Belt-and-braces for consumers
    that bypass verify() (e.g., direct UxfPackage.fromJson() use
    without a downstream bundle-verifier round).

  - uxf/ipld.ts importFromCar — same fail-fast at the CAR import
    boundary. Mirrors the json-path check for the CAR-bound delivery
    flow.

The export side already refuses to write packages with mismatched
manifest keys (the verify() result feeds production callers' export
paths), so this fix focuses on hostile-receive scenarios.

Tests: 8 new H2 regression tests covering verify() positive +
MANIFEST_TOKENID_MISMATCH + MANIFEST_TYPE_MISMATCH, the json import
boundary positive + spoofed-key + wrong-type rejections, and the
CAR import boundary spoofed-key + positive flow. All 443 existing
uxf tests pass unchanged. 8155 total unit tests pass.

Refs: #333 (H2). Stacked on #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…mode

Before this fix, `UxfPackage.merge` (→ `mergePkg`) silently dropped any
tokenId whose per-token resolver threw. A single malformed sibling
element inside a Rule 4 synthetic-rebuild made a legitimately-received
token vanish from the merged manifest with only a `logger.warn`. On the
receive path this manifested as token loss from view: the caller had no
machine-readable signal that anything had gone wrong.

Fix:

  - `UxfPackage.merge()` now returns `{ skipped: MergeSkip[] }`. Each
    `MergeSkip` records the tokenId, the original error, the target's
    prior root (if any), and the source's incoming root that failed
    to incorporate. Callers can react: replay, retry, surface to UI,
    or quarantine.

  - `opts.strict: true` aggregates skipped tokens into a
    `UxfError('MERGE_PARTIAL_FAILURE')` thrown BEFORE the atomic apply
    phase — target state is unchanged on the throw. Use on receive-
    path JOINs where any silent drop is unacceptable; leave default-off
    for opportunistic peer JOINs where partial coverage is fine.

  - `opts.onSkip` fires once per skipped tokenId for callers that want
    telemetry visibility without strict-mode failure. Throws inside
    onSkip are caught and logged — observability must not change merge
    semantics.

  - New error code `MERGE_PARTIAL_FAILURE` on UxfErrorCode. The thrown
    error also carries a non-typed `skipped` field with the structured
    skip list so callers needing machine-readable details can read it
    via `(err as { skipped }).skipped`.

  - Back-compat at the internal `mergePkg` boundary: legacy positional
    `(target, source, verifiedProofs)` callers keep working via a
    Set-vs-opts shape probe. Public callers (4 in tree: PaymentsModule,
    profile-token-storage-provider, profile/consolidation,
    flush-scheduler) ignore the returned `{ skipped }` and continue to
    use merge as void — behavior unchanged for them.

Tests: 8 new H3 regression tests covering empty-skipped baseline,
captured-skip on resolver throw, MergeSkip metadata fields,
strict-mode hard fail, strict-mode atomicity invariant (target
unchanged on throw), strict-mode happy path, onSkip callback fires
once per skip, and onSkip throw does not change merge semantics.
All 451 existing uxf tests pass unchanged. 8163 total unit tests pass.

Refs: #333 (H3). Stacked on #351 (H2) → #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…e in disposition-engine

modules/payments/transfer/disposition-engine.ts:629 called `verifyProof
(proof, trustBase, requestId)` with the bundle-supplied requestId and
trusted that the un-audited hydrateChain adapter had derived it
canonically. If the adapter erred or a malicious sender hand-crafted
the bundle with a proof anchored to a DIFFERENT transaction's
requestId, the proof would still verify (it IS a genuine on-chain
proof) but it would be incorrectly attributed to this transaction —
a proof-binding forgery.

Fix:

  - New `AssertRequestIdBindingFn` type and `AssertRequestIdBindingResult`
    interface on disposition-engine.ts. The hook canonically re-derives
    RequestId from (authenticator.publicKey, sourceState) and compares
    to the bundle's requestId; production wiring calls RequestId.create
    and compares toJSON.

  - Optional `assertRequestIdBinding?` dep added to
    DispositionEngineInput. When provided, the engine calls it BEFORE
    every verifyProof invocation:
      * `{ ok: true }`  → proof verification proceeds.
      * `{ ok: false }` → cryptoInvalid('proof-invalid') — same
        §5.3 [C](3) class as a clean PATH_INVALID at the verifier,
        because the (proof, requestId) pair does not bind to this tx.
      * throw → structuralInvalid('proof-throw').
    The binding gate fires BEFORE verifyProof so a forged binding
    short-circuits the (potentially expensive) on-chain proof check.

  - Optional shape preserves back-compat with the 66 existing engine
    tests (which do not set the hook) and with all production callers
    that have not yet been updated.

  - Plumbed through legacy-shape-adapter.ts: new optional
    `assertRequestIdBinding?` field on LegacyShapeAdapterInput is
    forwarded to DispositionEngineInput in buildEngineInput. Existing
    callers that don't supply the hook degrade to the pre-fix
    behaviour; new callers pass the production-grade RequestId.create
    comparer.

Tests: 7 new H4 regression tests covering:
  - Back-compat (hook absent → proof verification proceeds)
  - Hook returns ok=true (binding asserted, proof verified)
  - Hook receives both bundle requestId and authenticator unchanged
  - Hook returns ok=false (cryptoInvalid('proof-invalid'),
    verifyProof NOT invoked)
  - Hook throws (structuralInvalid('proof-throw'))
  - Binding gate fires BEFORE verifyProof (ordering invariant)
  - Null-proof chain entries do NOT trigger the binding gate

All 66 existing disposition-engine tests pass unchanged. All 1491
transfer tests pass. 8170 total unit tests pass. tsc clean.

Refs: #333 (H4). Stacked on #352 (H3) → #351 (H2) → #349 (H1) → #348 (C3)
→ #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…permanent

modules/payments/transfer/finalization-worker-sender.ts transitioned
the outbox entry to `failed-permanent` on any hard-fail and never
touched the source tokens that the instant-sender had marked
`transferring`/`pending` at submit time. With orphan auto-recovery
default-OFF, the spender-side balance was permanently locked as
unspendable.

Fix:

  - Added optional `sourceTokenIds?: ReadonlyArray<string>` to
    `UxfTransferOutboxEntry`. The instant-sender now records the
    selected source set at outbox-record construction time via
    `OutboxBuildArgs.sourceTokenIds`. Optional with `[]` semantics
    on `undefined` for back-compat with entries written before H5.

  - Added optional `recoverFailedPermanentSources?(sources, outboxId)`
    hook to `FinalizationWorkerSenderOptions`. When wired, the worker
    invokes it once at the `failed-permanent` transition with the
    entry's recorded source tokenIds. Production wiring flips each
    source back to a spendable state (typically `confirmed`) gated
    on a same-source-no-other-live-outbox check; the audit's gap
    was that the worker provided no signal at all.

  - The hook is best-effort: throws are caught and emitted as a
    `transfer:failed` event for triage, but the `failed-permanent`
    transition itself stands. This preserves the terminal-state
    contract — a recovery hook failure must not re-introduce the
    locked-pending bug it exists to fix.

  - Optional shape preserves back-compat with the 90+ existing
    finalization-worker-sender tests (which do not set the hook).

Tests: 5 new H5 regression tests covering:
  - Hook fires with the recorded source ids on failed-permanent
  - Pre-H5 entries (no sourceTokenIds field) get `[]` (not skipped)
  - Hook absent: worker still transitions cleanly
  - Hook throws: failed-permanent stands, transfer:failed emitted
  - Hook fires AFTER the outbox transition (ordering invariant)

The "hook does NOT fire on success/in-progress" guarantee is
structurally enforced by code placement (inside the
`if (totalFailure > 0)` branch); a behavioural test would require
duplicating the full §6.1 fake-adapter wiring for a property the
code-review already shows.

All 1498 transfer tests pass. 8175 total unit tests pass. tsc clean.

Refs: #333 (H5). Stacked on #353 (H4) → #352 (H3) → #351 (H2) → #349 (H1)
→ #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…nifestCas

ManifestCas.update's precondition check compared `observed.rootHash`
(a string read from the entry) to `prev.contentHash` (the caller's
expected value). Both sides are arbitrary writer-supplied labels — a
CAS pass meant "the labels agree", not "the content under the label
is the content the caller meant". An attacker writing an entry with
a forged `rootHash` field that doesn't match the actual pool content
slipped through unchanged.

Fix:

  - New optional `VerifyEntryRootFn` hook accepted via the ManifestCas
    constructor's `opts.verifyEntryRoot`. When wired, the hook is
    invoked AFTER the label CAS passes and BEFORE the write. The hook
    fetches the content the entry's `rootHash` references (production:
    the UXF pool element), recomputes the hash, and asserts it matches
    the declaration.

  - New `'integrity-failed'` discriminator on `ManifestCasResult.reason`
    with structured `integrityDetail` for triage. The result also
    carries `observed.contentHash` so callers can re-read state.

  - `ManifestCidRewriteCasError.casReason` widened to surface
    `'integrity-failed'`. The worker's outer retry loop already
    treats any throw from manifest-cid-rewrite as a hard CAS failure
    that bubbles up; the new reason rides through unchanged but
    preserves structure for operator triage.

  - Skipped on the `prev === null` (asserts-no-entry) path because
    there is no observed entry to verify. Skipped on label-CAS-mismatch
    because the swap has already been rejected.

  - Optional shape preserves back-compat: all 9 existing manifest-cas
    tests and the 100% of production callers that do not (yet) supply
    the verifier continue to behave identically. Production wiring is
    a follow-up that does not block this PR.

Tests: 6 new H7 regression tests covering:
  - Back-compat (verifier omitted → label-only CAS as before)
  - Happy path (verifier called AFTER label CAS, BEFORE write)
  - Integrity failure (returns 'integrity-failed', no write happens,
    storage unchanged, integrityDetail propagated)
  - Skip on prev=null (no observed entry to verify)
  - Skip on label-CAS-mismatch (short-circuit before verifier)
  - 'integrity-failed' has its own discriminator distinct from
    cas-mismatch / not-found / concurrent-modification

All 1512 transfer tests pass. 8181 total unit tests pass. tsc clean.

Refs: #333 (H7). Stacked on #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2)
→ #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…nservative-sender

conservative-sender.ts had zero locking primitives. instant-sender.ts
declared a module-local sourceLocks map (Wave 5 #171), so conservative
sends and instant-vs-conservative cross-pairs did NOT serialize on
shared source tokens: two concurrent sends could both pass selection,
both commit on-chain, and only the aggregator caught the duplicate-spend
after a source was burned.

Fix has three parts:

  - Extracted the per-source lock registry from instant-sender.ts to a
    new `modules/payments/transfer/source-locks.ts`. The Wave 4-7
    semantics are preserved verbatim (lex-sorted acquisition for
    deadlock prevention, 60 s liveness floor with force-release, Set
    dedup, fail-closed __resetSourceLocksForTesting guard). The shared
    module exports `acquireSourceLocks` and `__resetSourceLocksForTesting`.

  - instant-sender.ts re-exports `__resetSourceLocksForTesting` so the
    existing tests/unit/payments/transfer/instant-sender.test.ts import
    path continues to work.

  - conservative-sender.ts now acquires per-source locks after source
    selection completes (Step 2.5, mirroring the instant pattern) and
    releases them in the outer `finally`. Adds a TEST-ONLY
    __sourceLockMaxHoldMs override on ConservativeSenderDeps, symmetric
    with InstantSenderDeps.

Cross-pair invariant: because both senders now share the SAME
process-global map, an in-flight instant send blocks a concurrent
conservative send sharing a source token, and vice versa.

Tests:

  - 9 new direct-module tests on source-locks.ts covering acquire
    serialization, deadlock prevention via lex-sort, dedup, liveness
    floor (force-release), back-compat re-export identity, fail-closed
    guard preserved.

  - 4 new integration tests on conservative-sender proving the
    pipeline acquires + releases the lock: serialization on shared
    source, release on success, release on failure (try/finally),
    disjoint sources proceed concurrently.

  - All 36 existing instant-sender tests pass unchanged (the extracted
    module is byte-equivalent to the original Wave 7 code modulo the
    new callerLabel parameter, which defaults to 'sender' when omitted).

8147 unit tests pass (2 pre-existing skipped). tsc clean.

Refs: #333 (H1). Stacked on #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…s content.tokenId

The manifest `tokenId → rootHash` map was only regex-shape-checked at the
import boundary (uxf/json.ts:430, uxf/ipld.ts:555) and never asserted
equal to the genesis tokenId encoded in the referenced token-root
element. Pool elements still self-verified (content-hash recomputed),
so a hostile sender could craft `{ "tokenId=A": <hash-of-root-minting-
tokenId-B> }` and pass every existing gate. Downstream consumers that
trust the manifest key (dedup, ownership filtering, balance
computation) would mis-identify the token — an identity-confusion
primitive that the audit specifically called out.

Fix layered defense-in-depth:

  - uxf/verify.ts — primary structural-integrity gate. When iterating
    manifest entries during the DAG walk, assert that the root
    element exists with type ELEMENT_TYPE_TOKEN_ROOT AND that
    element.content.tokenId === manifestKey. Mismatches surface as
    MANIFEST_TOKENID_MISMATCH errors; wrong element types surface as
    MANIFEST_TYPE_MISMATCH. modules/payments/transfer/bundle-verifier
    already calls pkg.verify() so the receive pipeline (the audit's
    primary concern) is covered.

  - uxf/json.ts packageFromJson — fail-fast at the JSON import
    boundary with VERIFICATION_FAILED. Belt-and-braces for consumers
    that bypass verify() (e.g., direct UxfPackage.fromJson() use
    without a downstream bundle-verifier round).

  - uxf/ipld.ts importFromCar — same fail-fast at the CAR import
    boundary. Mirrors the json-path check for the CAR-bound delivery
    flow.

The export side already refuses to write packages with mismatched
manifest keys (the verify() result feeds production callers' export
paths), so this fix focuses on hostile-receive scenarios.

Tests: 8 new H2 regression tests covering verify() positive +
MANIFEST_TOKENID_MISMATCH + MANIFEST_TYPE_MISMATCH, the json import
boundary positive + spoofed-key + wrong-type rejections, and the
CAR import boundary spoofed-key + positive flow. All 443 existing
uxf tests pass unchanged. 8155 total unit tests pass.

Refs: #333 (H2). Stacked on #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…mode

Before this fix, `UxfPackage.merge` (→ `mergePkg`) silently dropped any
tokenId whose per-token resolver threw. A single malformed sibling
element inside a Rule 4 synthetic-rebuild made a legitimately-received
token vanish from the merged manifest with only a `logger.warn`. On the
receive path this manifested as token loss from view: the caller had no
machine-readable signal that anything had gone wrong.

Fix:

  - `UxfPackage.merge()` now returns `{ skipped: MergeSkip[] }`. Each
    `MergeSkip` records the tokenId, the original error, the target's
    prior root (if any), and the source's incoming root that failed
    to incorporate. Callers can react: replay, retry, surface to UI,
    or quarantine.

  - `opts.strict: true` aggregates skipped tokens into a
    `UxfError('MERGE_PARTIAL_FAILURE')` thrown BEFORE the atomic apply
    phase — target state is unchanged on the throw. Use on receive-
    path JOINs where any silent drop is unacceptable; leave default-off
    for opportunistic peer JOINs where partial coverage is fine.

  - `opts.onSkip` fires once per skipped tokenId for callers that want
    telemetry visibility without strict-mode failure. Throws inside
    onSkip are caught and logged — observability must not change merge
    semantics.

  - New error code `MERGE_PARTIAL_FAILURE` on UxfErrorCode. The thrown
    error also carries a non-typed `skipped` field with the structured
    skip list so callers needing machine-readable details can read it
    via `(err as { skipped }).skipped`.

  - Back-compat at the internal `mergePkg` boundary: legacy positional
    `(target, source, verifiedProofs)` callers keep working via a
    Set-vs-opts shape probe. Public callers (4 in tree: PaymentsModule,
    profile-token-storage-provider, profile/consolidation,
    flush-scheduler) ignore the returned `{ skipped }` and continue to
    use merge as void — behavior unchanged for them.

Tests: 8 new H3 regression tests covering empty-skipped baseline,
captured-skip on resolver throw, MergeSkip metadata fields,
strict-mode hard fail, strict-mode atomicity invariant (target
unchanged on throw), strict-mode happy path, onSkip callback fires
once per skip, and onSkip throw does not change merge semantics.
All 451 existing uxf tests pass unchanged. 8163 total unit tests pass.

Refs: #333 (H3). Stacked on #351 (H2) → #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…e in disposition-engine

modules/payments/transfer/disposition-engine.ts:629 called `verifyProof
(proof, trustBase, requestId)` with the bundle-supplied requestId and
trusted that the un-audited hydrateChain adapter had derived it
canonically. If the adapter erred or a malicious sender hand-crafted
the bundle with a proof anchored to a DIFFERENT transaction's
requestId, the proof would still verify (it IS a genuine on-chain
proof) but it would be incorrectly attributed to this transaction —
a proof-binding forgery.

Fix:

  - New `AssertRequestIdBindingFn` type and `AssertRequestIdBindingResult`
    interface on disposition-engine.ts. The hook canonically re-derives
    RequestId from (authenticator.publicKey, sourceState) and compares
    to the bundle's requestId; production wiring calls RequestId.create
    and compares toJSON.

  - Optional `assertRequestIdBinding?` dep added to
    DispositionEngineInput. When provided, the engine calls it BEFORE
    every verifyProof invocation:
      * `{ ok: true }`  → proof verification proceeds.
      * `{ ok: false }` → cryptoInvalid('proof-invalid') — same
        §5.3 [C](3) class as a clean PATH_INVALID at the verifier,
        because the (proof, requestId) pair does not bind to this tx.
      * throw → structuralInvalid('proof-throw').
    The binding gate fires BEFORE verifyProof so a forged binding
    short-circuits the (potentially expensive) on-chain proof check.

  - Optional shape preserves back-compat with the 66 existing engine
    tests (which do not set the hook) and with all production callers
    that have not yet been updated.

  - Plumbed through legacy-shape-adapter.ts: new optional
    `assertRequestIdBinding?` field on LegacyShapeAdapterInput is
    forwarded to DispositionEngineInput in buildEngineInput. Existing
    callers that don't supply the hook degrade to the pre-fix
    behaviour; new callers pass the production-grade RequestId.create
    comparer.

Tests: 7 new H4 regression tests covering:
  - Back-compat (hook absent → proof verification proceeds)
  - Hook returns ok=true (binding asserted, proof verified)
  - Hook receives both bundle requestId and authenticator unchanged
  - Hook returns ok=false (cryptoInvalid('proof-invalid'),
    verifyProof NOT invoked)
  - Hook throws (structuralInvalid('proof-throw'))
  - Binding gate fires BEFORE verifyProof (ordering invariant)
  - Null-proof chain entries do NOT trigger the binding gate

All 66 existing disposition-engine tests pass unchanged. All 1491
transfer tests pass. 8170 total unit tests pass. tsc clean.

Refs: #333 (H4). Stacked on #352 (H3) → #351 (H2) → #349 (H1) → #348 (C3)
→ #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…permanent

modules/payments/transfer/finalization-worker-sender.ts transitioned
the outbox entry to `failed-permanent` on any hard-fail and never
touched the source tokens that the instant-sender had marked
`transferring`/`pending` at submit time. With orphan auto-recovery
default-OFF, the spender-side balance was permanently locked as
unspendable.

Fix:

  - Added optional `sourceTokenIds?: ReadonlyArray<string>` to
    `UxfTransferOutboxEntry`. The instant-sender now records the
    selected source set at outbox-record construction time via
    `OutboxBuildArgs.sourceTokenIds`. Optional with `[]` semantics
    on `undefined` for back-compat with entries written before H5.

  - Added optional `recoverFailedPermanentSources?(sources, outboxId)`
    hook to `FinalizationWorkerSenderOptions`. When wired, the worker
    invokes it once at the `failed-permanent` transition with the
    entry's recorded source tokenIds. Production wiring flips each
    source back to a spendable state (typically `confirmed`) gated
    on a same-source-no-other-live-outbox check; the audit's gap
    was that the worker provided no signal at all.

  - The hook is best-effort: throws are caught and emitted as a
    `transfer:failed` event for triage, but the `failed-permanent`
    transition itself stands. This preserves the terminal-state
    contract — a recovery hook failure must not re-introduce the
    locked-pending bug it exists to fix.

  - Optional shape preserves back-compat with the 90+ existing
    finalization-worker-sender tests (which do not set the hook).

Tests: 5 new H5 regression tests covering:
  - Hook fires with the recorded source ids on failed-permanent
  - Pre-H5 entries (no sourceTokenIds field) get `[]` (not skipped)
  - Hook absent: worker still transitions cleanly
  - Hook throws: failed-permanent stands, transfer:failed emitted
  - Hook fires AFTER the outbox transition (ordering invariant)

The "hook does NOT fire on success/in-progress" guarantee is
structurally enforced by code placement (inside the
`if (totalFailure > 0)` branch); a behavioural test would require
duplicating the full §6.1 fake-adapter wiring for a property the
code-review already shows.

All 1498 transfer tests pass. 8175 total unit tests pass. tsc clean.

Refs: #333 (H5). Stacked on #353 (H4) → #352 (H3) → #351 (H2) → #349 (H1)
→ #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…nifestCas

ManifestCas.update's precondition check compared `observed.rootHash`
(a string read from the entry) to `prev.contentHash` (the caller's
expected value). Both sides are arbitrary writer-supplied labels — a
CAS pass meant "the labels agree", not "the content under the label
is the content the caller meant". An attacker writing an entry with
a forged `rootHash` field that doesn't match the actual pool content
slipped through unchanged.

Fix:

  - New optional `VerifyEntryRootFn` hook accepted via the ManifestCas
    constructor's `opts.verifyEntryRoot`. When wired, the hook is
    invoked AFTER the label CAS passes and BEFORE the write. The hook
    fetches the content the entry's `rootHash` references (production:
    the UXF pool element), recomputes the hash, and asserts it matches
    the declaration.

  - New `'integrity-failed'` discriminator on `ManifestCasResult.reason`
    with structured `integrityDetail` for triage. The result also
    carries `observed.contentHash` so callers can re-read state.

  - `ManifestCidRewriteCasError.casReason` widened to surface
    `'integrity-failed'`. The worker's outer retry loop already
    treats any throw from manifest-cid-rewrite as a hard CAS failure
    that bubbles up; the new reason rides through unchanged but
    preserves structure for operator triage.

  - Skipped on the `prev === null` (asserts-no-entry) path because
    there is no observed entry to verify. Skipped on label-CAS-mismatch
    because the swap has already been rejected.

  - Optional shape preserves back-compat: all 9 existing manifest-cas
    tests and the 100% of production callers that do not (yet) supply
    the verifier continue to behave identically. Production wiring is
    a follow-up that does not block this PR.

Tests: 6 new H7 regression tests covering:
  - Back-compat (verifier omitted → label-only CAS as before)
  - Happy path (verifier called AFTER label CAS, BEFORE write)
  - Integrity failure (returns 'integrity-failed', no write happens,
    storage unchanged, integrityDetail propagated)
  - Skip on prev=null (no observed entry to verify)
  - Skip on label-CAS-mismatch (short-circuit before verifier)
  - 'integrity-failed' has its own discriminator distinct from
    cas-mismatch / not-found / concurrent-modification

All 1512 transfer tests pass. 8181 total unit tests pass. tsc clean.

Refs: #333 (H7). Stacked on #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2)
→ #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…ER test-coverage gap

The soak (manual-test-full-recovery.sh) routinely surfaces
`[ERROR] [V6-RECOVER] Stranded receive <hex> hit permanent recipient-
address mismatch (HD-index recovery exhausted) (no retry):
VerificationError: Recipient address mismatch` at §C → §D. The
existing unit tests cover helpers in isolation
(PaymentsModule.recipient-address-mismatch-recovery.test.ts) and the
error CLASSIFIER (PaymentsModule.proof-polling-persistence.test.ts
#269), but neither proves the composition: "given a sibling HD index
in the recipient's inventory, tryRecover finds it AND the SDK's
verifyRecipient agrees on the derived address". This left CI blind to
the regression mode where tryRecover returns a false-positive
candidate that Token.update later rejects.

This commit closes the gap at three layers:

  L1. Real-SDK integration test
      (tests/integration/payments/v6-recover-real-sdk-recovery.test.ts):
      - Replaces the file-level vi.mock for SigningService /
        UnmaskedPredicate / TokenId / TokenType / HashAlgorithm with
        REAL SDK imports.
      - Builds two real HD-derived signing services.
      - Constructs the sender-targeted DIRECT address via real
        UnmaskedPredicate.create → reference.toAddress (the EXACT path
        the SDK's verifyRecipient uses).
      - Drives `tryRecoverSigningServiceForRecipient` against real
        TokenId / TokenType / Uint8Array salt.
      - 4 tests: happy path, no-match negative, deriveAddressInfo throws,
        and (L2 composition) tryRecover + predicate construction
        round-trip equals the transferTx target address.

  L3. CI workflow (.github/workflows/soak-nightly.yml):
      - Nightly cron + workflow_dispatch trigger.
      - Probes testnet aggregator + Nostr relay; skip-not-fail when
        either is unreachable (third-party outages cannot block
        releases).
      - Checks out sphere-cli alongside sphere-sdk, links the local
        build, runs manual-test-full-recovery.sh with SPHERE_DEBUG=*.
      - Captures soak.log + workspace as artifacts (30-day retention
        on failure, 7-day on success).
      - Surfaces metrics in the workflow summary: V6-RECOVER count,
        Stranded receive count, POINTER_MONOTONICITY_VIOLATION count,
        bcast_pub>0 count, exit code.

What's tested by what (audit roll-up):
  - SDK address-derivation regression in tryRecover → L1 (real SDK).
  - tryRecover ↔ predicate-construction composition → L1 composition
    test (proves verifyRecipient cannot throw against the recovered
    state).
  - Cross-process / Nostr / IPFS / multi-daemon regressions → L3 soak.
  - Helper-level edge cases (mock-friendly) → existing
    PaymentsModule.recipient-address-mismatch-recovery.test.ts.
  - Error-classification on hard fail → existing #269 test in
    PaymentsModule.proof-polling-persistence.test.ts.

Tests pass: 4/4 new integration tests + all 8185 unit tests. tsc clean.

Refs: #333 (V6-RECOVER test-coverage gap, follow-up to H1-H7 stack).
Stacked on #355 (H7) → #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2)
→ #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…s content.tokenId

The manifest `tokenId → rootHash` map was only regex-shape-checked at the
import boundary (uxf/json.ts:430, uxf/ipld.ts:555) and never asserted
equal to the genesis tokenId encoded in the referenced token-root
element. Pool elements still self-verified (content-hash recomputed),
so a hostile sender could craft `{ "tokenId=A": <hash-of-root-minting-
tokenId-B> }` and pass every existing gate. Downstream consumers that
trust the manifest key (dedup, ownership filtering, balance
computation) would mis-identify the token — an identity-confusion
primitive that the audit specifically called out.

Fix layered defense-in-depth:

  - uxf/verify.ts — primary structural-integrity gate. When iterating
    manifest entries during the DAG walk, assert that the root
    element exists with type ELEMENT_TYPE_TOKEN_ROOT AND that
    element.content.tokenId === manifestKey. Mismatches surface as
    MANIFEST_TOKENID_MISMATCH errors; wrong element types surface as
    MANIFEST_TYPE_MISMATCH. modules/payments/transfer/bundle-verifier
    already calls pkg.verify() so the receive pipeline (the audit's
    primary concern) is covered.

  - uxf/json.ts packageFromJson — fail-fast at the JSON import
    boundary with VERIFICATION_FAILED. Belt-and-braces for consumers
    that bypass verify() (e.g., direct UxfPackage.fromJson() use
    without a downstream bundle-verifier round).

  - uxf/ipld.ts importFromCar — same fail-fast at the CAR import
    boundary. Mirrors the json-path check for the CAR-bound delivery
    flow.

The export side already refuses to write packages with mismatched
manifest keys (the verify() result feeds production callers' export
paths), so this fix focuses on hostile-receive scenarios.

Tests: 8 new H2 regression tests covering verify() positive +
MANIFEST_TOKENID_MISMATCH + MANIFEST_TYPE_MISMATCH, the json import
boundary positive + spoofed-key + wrong-type rejections, and the
CAR import boundary spoofed-key + positive flow. All 443 existing
uxf tests pass unchanged. 8155 total unit tests pass.

Refs: #333 (H2). Stacked on #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…mode

Before this fix, `UxfPackage.merge` (→ `mergePkg`) silently dropped any
tokenId whose per-token resolver threw. A single malformed sibling
element inside a Rule 4 synthetic-rebuild made a legitimately-received
token vanish from the merged manifest with only a `logger.warn`. On the
receive path this manifested as token loss from view: the caller had no
machine-readable signal that anything had gone wrong.

Fix:

  - `UxfPackage.merge()` now returns `{ skipped: MergeSkip[] }`. Each
    `MergeSkip` records the tokenId, the original error, the target's
    prior root (if any), and the source's incoming root that failed
    to incorporate. Callers can react: replay, retry, surface to UI,
    or quarantine.

  - `opts.strict: true` aggregates skipped tokens into a
    `UxfError('MERGE_PARTIAL_FAILURE')` thrown BEFORE the atomic apply
    phase — target state is unchanged on the throw. Use on receive-
    path JOINs where any silent drop is unacceptable; leave default-off
    for opportunistic peer JOINs where partial coverage is fine.

  - `opts.onSkip` fires once per skipped tokenId for callers that want
    telemetry visibility without strict-mode failure. Throws inside
    onSkip are caught and logged — observability must not change merge
    semantics.

  - New error code `MERGE_PARTIAL_FAILURE` on UxfErrorCode. The thrown
    error also carries a non-typed `skipped` field with the structured
    skip list so callers needing machine-readable details can read it
    via `(err as { skipped }).skipped`.

  - Back-compat at the internal `mergePkg` boundary: legacy positional
    `(target, source, verifiedProofs)` callers keep working via a
    Set-vs-opts shape probe. Public callers (4 in tree: PaymentsModule,
    profile-token-storage-provider, profile/consolidation,
    flush-scheduler) ignore the returned `{ skipped }` and continue to
    use merge as void — behavior unchanged for them.

Tests: 8 new H3 regression tests covering empty-skipped baseline,
captured-skip on resolver throw, MergeSkip metadata fields,
strict-mode hard fail, strict-mode atomicity invariant (target
unchanged on throw), strict-mode happy path, onSkip callback fires
once per skip, and onSkip throw does not change merge semantics.
All 451 existing uxf tests pass unchanged. 8163 total unit tests pass.

Refs: #333 (H3). Stacked on #351 (H2) → #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…e in disposition-engine

modules/payments/transfer/disposition-engine.ts:629 called `verifyProof
(proof, trustBase, requestId)` with the bundle-supplied requestId and
trusted that the un-audited hydrateChain adapter had derived it
canonically. If the adapter erred or a malicious sender hand-crafted
the bundle with a proof anchored to a DIFFERENT transaction's
requestId, the proof would still verify (it IS a genuine on-chain
proof) but it would be incorrectly attributed to this transaction —
a proof-binding forgery.

Fix:

  - New `AssertRequestIdBindingFn` type and `AssertRequestIdBindingResult`
    interface on disposition-engine.ts. The hook canonically re-derives
    RequestId from (authenticator.publicKey, sourceState) and compares
    to the bundle's requestId; production wiring calls RequestId.create
    and compares toJSON.

  - Optional `assertRequestIdBinding?` dep added to
    DispositionEngineInput. When provided, the engine calls it BEFORE
    every verifyProof invocation:
      * `{ ok: true }`  → proof verification proceeds.
      * `{ ok: false }` → cryptoInvalid('proof-invalid') — same
        §5.3 [C](3) class as a clean PATH_INVALID at the verifier,
        because the (proof, requestId) pair does not bind to this tx.
      * throw → structuralInvalid('proof-throw').
    The binding gate fires BEFORE verifyProof so a forged binding
    short-circuits the (potentially expensive) on-chain proof check.

  - Optional shape preserves back-compat with the 66 existing engine
    tests (which do not set the hook) and with all production callers
    that have not yet been updated.

  - Plumbed through legacy-shape-adapter.ts: new optional
    `assertRequestIdBinding?` field on LegacyShapeAdapterInput is
    forwarded to DispositionEngineInput in buildEngineInput. Existing
    callers that don't supply the hook degrade to the pre-fix
    behaviour; new callers pass the production-grade RequestId.create
    comparer.

Tests: 7 new H4 regression tests covering:
  - Back-compat (hook absent → proof verification proceeds)
  - Hook returns ok=true (binding asserted, proof verified)
  - Hook receives both bundle requestId and authenticator unchanged
  - Hook returns ok=false (cryptoInvalid('proof-invalid'),
    verifyProof NOT invoked)
  - Hook throws (structuralInvalid('proof-throw'))
  - Binding gate fires BEFORE verifyProof (ordering invariant)
  - Null-proof chain entries do NOT trigger the binding gate

All 66 existing disposition-engine tests pass unchanged. All 1491
transfer tests pass. 8170 total unit tests pass. tsc clean.

Refs: #333 (H4). Stacked on #352 (H3) → #351 (H2) → #349 (H1) → #348 (C3)
→ #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…permanent

modules/payments/transfer/finalization-worker-sender.ts transitioned
the outbox entry to `failed-permanent` on any hard-fail and never
touched the source tokens that the instant-sender had marked
`transferring`/`pending` at submit time. With orphan auto-recovery
default-OFF, the spender-side balance was permanently locked as
unspendable.

Fix:

  - Added optional `sourceTokenIds?: ReadonlyArray<string>` to
    `UxfTransferOutboxEntry`. The instant-sender now records the
    selected source set at outbox-record construction time via
    `OutboxBuildArgs.sourceTokenIds`. Optional with `[]` semantics
    on `undefined` for back-compat with entries written before H5.

  - Added optional `recoverFailedPermanentSources?(sources, outboxId)`
    hook to `FinalizationWorkerSenderOptions`. When wired, the worker
    invokes it once at the `failed-permanent` transition with the
    entry's recorded source tokenIds. Production wiring flips each
    source back to a spendable state (typically `confirmed`) gated
    on a same-source-no-other-live-outbox check; the audit's gap
    was that the worker provided no signal at all.

  - The hook is best-effort: throws are caught and emitted as a
    `transfer:failed` event for triage, but the `failed-permanent`
    transition itself stands. This preserves the terminal-state
    contract — a recovery hook failure must not re-introduce the
    locked-pending bug it exists to fix.

  - Optional shape preserves back-compat with the 90+ existing
    finalization-worker-sender tests (which do not set the hook).

Tests: 5 new H5 regression tests covering:
  - Hook fires with the recorded source ids on failed-permanent
  - Pre-H5 entries (no sourceTokenIds field) get `[]` (not skipped)
  - Hook absent: worker still transitions cleanly
  - Hook throws: failed-permanent stands, transfer:failed emitted
  - Hook fires AFTER the outbox transition (ordering invariant)

The "hook does NOT fire on success/in-progress" guarantee is
structurally enforced by code placement (inside the
`if (totalFailure > 0)` branch); a behavioural test would require
duplicating the full §6.1 fake-adapter wiring for a property the
code-review already shows.

All 1498 transfer tests pass. 8175 total unit tests pass. tsc clean.

Refs: #333 (H5). Stacked on #353 (H4) → #352 (H3) → #351 (H2) → #349 (H1)
→ #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…nifestCas

ManifestCas.update's precondition check compared `observed.rootHash`
(a string read from the entry) to `prev.contentHash` (the caller's
expected value). Both sides are arbitrary writer-supplied labels — a
CAS pass meant "the labels agree", not "the content under the label
is the content the caller meant". An attacker writing an entry with
a forged `rootHash` field that doesn't match the actual pool content
slipped through unchanged.

Fix:

  - New optional `VerifyEntryRootFn` hook accepted via the ManifestCas
    constructor's `opts.verifyEntryRoot`. When wired, the hook is
    invoked AFTER the label CAS passes and BEFORE the write. The hook
    fetches the content the entry's `rootHash` references (production:
    the UXF pool element), recomputes the hash, and asserts it matches
    the declaration.

  - New `'integrity-failed'` discriminator on `ManifestCasResult.reason`
    with structured `integrityDetail` for triage. The result also
    carries `observed.contentHash` so callers can re-read state.

  - `ManifestCidRewriteCasError.casReason` widened to surface
    `'integrity-failed'`. The worker's outer retry loop already
    treats any throw from manifest-cid-rewrite as a hard CAS failure
    that bubbles up; the new reason rides through unchanged but
    preserves structure for operator triage.

  - Skipped on the `prev === null` (asserts-no-entry) path because
    there is no observed entry to verify. Skipped on label-CAS-mismatch
    because the swap has already been rejected.

  - Optional shape preserves back-compat: all 9 existing manifest-cas
    tests and the 100% of production callers that do not (yet) supply
    the verifier continue to behave identically. Production wiring is
    a follow-up that does not block this PR.

Tests: 6 new H7 regression tests covering:
  - Back-compat (verifier omitted → label-only CAS as before)
  - Happy path (verifier called AFTER label CAS, BEFORE write)
  - Integrity failure (returns 'integrity-failed', no write happens,
    storage unchanged, integrityDetail propagated)
  - Skip on prev=null (no observed entry to verify)
  - Skip on label-CAS-mismatch (short-circuit before verifier)
  - 'integrity-failed' has its own discriminator distinct from
    cas-mismatch / not-found / concurrent-modification

All 1512 transfer tests pass. 8181 total unit tests pass. tsc clean.

Refs: #333 (H7). Stacked on #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2)
→ #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…ER test-coverage gap

The soak (manual-test-full-recovery.sh) routinely surfaces
`[ERROR] [V6-RECOVER] Stranded receive <hex> hit permanent recipient-
address mismatch (HD-index recovery exhausted) (no retry):
VerificationError: Recipient address mismatch` at §C → §D. The
existing unit tests cover helpers in isolation
(PaymentsModule.recipient-address-mismatch-recovery.test.ts) and the
error CLASSIFIER (PaymentsModule.proof-polling-persistence.test.ts
#269), but neither proves the composition: "given a sibling HD index
in the recipient's inventory, tryRecover finds it AND the SDK's
verifyRecipient agrees on the derived address". This left CI blind to
the regression mode where tryRecover returns a false-positive
candidate that Token.update later rejects.

This commit closes the gap at three layers:

  L1. Real-SDK integration test
      (tests/integration/payments/v6-recover-real-sdk-recovery.test.ts):
      - Replaces the file-level vi.mock for SigningService /
        UnmaskedPredicate / TokenId / TokenType / HashAlgorithm with
        REAL SDK imports.
      - Builds two real HD-derived signing services.
      - Constructs the sender-targeted DIRECT address via real
        UnmaskedPredicate.create → reference.toAddress (the EXACT path
        the SDK's verifyRecipient uses).
      - Drives `tryRecoverSigningServiceForRecipient` against real
        TokenId / TokenType / Uint8Array salt.
      - 4 tests: happy path, no-match negative, deriveAddressInfo throws,
        and (L2 composition) tryRecover + predicate construction
        round-trip equals the transferTx target address.

  L3. CI workflow (.github/workflows/soak-nightly.yml):
      - Nightly cron + workflow_dispatch trigger.
      - Probes testnet aggregator + Nostr relay; skip-not-fail when
        either is unreachable (third-party outages cannot block
        releases).
      - Checks out sphere-cli alongside sphere-sdk, links the local
        build, runs manual-test-full-recovery.sh with SPHERE_DEBUG=*.
      - Captures soak.log + workspace as artifacts (30-day retention
        on failure, 7-day on success).
      - Surfaces metrics in the workflow summary: V6-RECOVER count,
        Stranded receive count, POINTER_MONOTONICITY_VIOLATION count,
        bcast_pub>0 count, exit code.

What's tested by what (audit roll-up):
  - SDK address-derivation regression in tryRecover → L1 (real SDK).
  - tryRecover ↔ predicate-construction composition → L1 composition
    test (proves verifyRecipient cannot throw against the recovered
    state).
  - Cross-process / Nostr / IPFS / multi-daemon regressions → L3 soak.
  - Helper-level edge cases (mock-friendly) → existing
    PaymentsModule.recipient-address-mismatch-recovery.test.ts.
  - Error-classification on hard fail → existing #269 test in
    PaymentsModule.proof-polling-persistence.test.ts.

Tests pass: 4/4 new integration tests + all 8185 unit tests. tsc clean.

Refs: #333 (V6-RECOVER test-coverage gap, follow-up to H1-H7 stack).
Stacked on #355 (H7) → #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2)
→ #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…mode

Before this fix, `UxfPackage.merge` (→ `mergePkg`) silently dropped any
tokenId whose per-token resolver threw. A single malformed sibling
element inside a Rule 4 synthetic-rebuild made a legitimately-received
token vanish from the merged manifest with only a `logger.warn`. On the
receive path this manifested as token loss from view: the caller had no
machine-readable signal that anything had gone wrong.

Fix:

  - `UxfPackage.merge()` now returns `{ skipped: MergeSkip[] }`. Each
    `MergeSkip` records the tokenId, the original error, the target's
    prior root (if any), and the source's incoming root that failed
    to incorporate. Callers can react: replay, retry, surface to UI,
    or quarantine.

  - `opts.strict: true` aggregates skipped tokens into a
    `UxfError('MERGE_PARTIAL_FAILURE')` thrown BEFORE the atomic apply
    phase — target state is unchanged on the throw. Use on receive-
    path JOINs where any silent drop is unacceptable; leave default-off
    for opportunistic peer JOINs where partial coverage is fine.

  - `opts.onSkip` fires once per skipped tokenId for callers that want
    telemetry visibility without strict-mode failure. Throws inside
    onSkip are caught and logged — observability must not change merge
    semantics.

  - New error code `MERGE_PARTIAL_FAILURE` on UxfErrorCode. The thrown
    error also carries a non-typed `skipped` field with the structured
    skip list so callers needing machine-readable details can read it
    via `(err as { skipped }).skipped`.

  - Back-compat at the internal `mergePkg` boundary: legacy positional
    `(target, source, verifiedProofs)` callers keep working via a
    Set-vs-opts shape probe. Public callers (4 in tree: PaymentsModule,
    profile-token-storage-provider, profile/consolidation,
    flush-scheduler) ignore the returned `{ skipped }` and continue to
    use merge as void — behavior unchanged for them.

Tests: 8 new H3 regression tests covering empty-skipped baseline,
captured-skip on resolver throw, MergeSkip metadata fields,
strict-mode hard fail, strict-mode atomicity invariant (target
unchanged on throw), strict-mode happy path, onSkip callback fires
once per skip, and onSkip throw does not change merge semantics.
All 451 existing uxf tests pass unchanged. 8163 total unit tests pass.

Refs: #333 (H3). Stacked on #351 (H2) → #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…e in disposition-engine

modules/payments/transfer/disposition-engine.ts:629 called `verifyProof
(proof, trustBase, requestId)` with the bundle-supplied requestId and
trusted that the un-audited hydrateChain adapter had derived it
canonically. If the adapter erred or a malicious sender hand-crafted
the bundle with a proof anchored to a DIFFERENT transaction's
requestId, the proof would still verify (it IS a genuine on-chain
proof) but it would be incorrectly attributed to this transaction —
a proof-binding forgery.

Fix:

  - New `AssertRequestIdBindingFn` type and `AssertRequestIdBindingResult`
    interface on disposition-engine.ts. The hook canonically re-derives
    RequestId from (authenticator.publicKey, sourceState) and compares
    to the bundle's requestId; production wiring calls RequestId.create
    and compares toJSON.

  - Optional `assertRequestIdBinding?` dep added to
    DispositionEngineInput. When provided, the engine calls it BEFORE
    every verifyProof invocation:
      * `{ ok: true }`  → proof verification proceeds.
      * `{ ok: false }` → cryptoInvalid('proof-invalid') — same
        §5.3 [C](3) class as a clean PATH_INVALID at the verifier,
        because the (proof, requestId) pair does not bind to this tx.
      * throw → structuralInvalid('proof-throw').
    The binding gate fires BEFORE verifyProof so a forged binding
    short-circuits the (potentially expensive) on-chain proof check.

  - Optional shape preserves back-compat with the 66 existing engine
    tests (which do not set the hook) and with all production callers
    that have not yet been updated.

  - Plumbed through legacy-shape-adapter.ts: new optional
    `assertRequestIdBinding?` field on LegacyShapeAdapterInput is
    forwarded to DispositionEngineInput in buildEngineInput. Existing
    callers that don't supply the hook degrade to the pre-fix
    behaviour; new callers pass the production-grade RequestId.create
    comparer.

Tests: 7 new H4 regression tests covering:
  - Back-compat (hook absent → proof verification proceeds)
  - Hook returns ok=true (binding asserted, proof verified)
  - Hook receives both bundle requestId and authenticator unchanged
  - Hook returns ok=false (cryptoInvalid('proof-invalid'),
    verifyProof NOT invoked)
  - Hook throws (structuralInvalid('proof-throw'))
  - Binding gate fires BEFORE verifyProof (ordering invariant)
  - Null-proof chain entries do NOT trigger the binding gate

All 66 existing disposition-engine tests pass unchanged. All 1491
transfer tests pass. 8170 total unit tests pass. tsc clean.

Refs: #333 (H4). Stacked on #352 (H3) → #351 (H2) → #349 (H1) → #348 (C3)
→ #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…permanent

modules/payments/transfer/finalization-worker-sender.ts transitioned
the outbox entry to `failed-permanent` on any hard-fail and never
touched the source tokens that the instant-sender had marked
`transferring`/`pending` at submit time. With orphan auto-recovery
default-OFF, the spender-side balance was permanently locked as
unspendable.

Fix:

  - Added optional `sourceTokenIds?: ReadonlyArray<string>` to
    `UxfTransferOutboxEntry`. The instant-sender now records the
    selected source set at outbox-record construction time via
    `OutboxBuildArgs.sourceTokenIds`. Optional with `[]` semantics
    on `undefined` for back-compat with entries written before H5.

  - Added optional `recoverFailedPermanentSources?(sources, outboxId)`
    hook to `FinalizationWorkerSenderOptions`. When wired, the worker
    invokes it once at the `failed-permanent` transition with the
    entry's recorded source tokenIds. Production wiring flips each
    source back to a spendable state (typically `confirmed`) gated
    on a same-source-no-other-live-outbox check; the audit's gap
    was that the worker provided no signal at all.

  - The hook is best-effort: throws are caught and emitted as a
    `transfer:failed` event for triage, but the `failed-permanent`
    transition itself stands. This preserves the terminal-state
    contract — a recovery hook failure must not re-introduce the
    locked-pending bug it exists to fix.

  - Optional shape preserves back-compat with the 90+ existing
    finalization-worker-sender tests (which do not set the hook).

Tests: 5 new H5 regression tests covering:
  - Hook fires with the recorded source ids on failed-permanent
  - Pre-H5 entries (no sourceTokenIds field) get `[]` (not skipped)
  - Hook absent: worker still transitions cleanly
  - Hook throws: failed-permanent stands, transfer:failed emitted
  - Hook fires AFTER the outbox transition (ordering invariant)

The "hook does NOT fire on success/in-progress" guarantee is
structurally enforced by code placement (inside the
`if (totalFailure > 0)` branch); a behavioural test would require
duplicating the full §6.1 fake-adapter wiring for a property the
code-review already shows.

All 1498 transfer tests pass. 8175 total unit tests pass. tsc clean.

Refs: #333 (H5). Stacked on #353 (H4) → #352 (H3) → #351 (H2) → #349 (H1)
→ #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…nifestCas

ManifestCas.update's precondition check compared `observed.rootHash`
(a string read from the entry) to `prev.contentHash` (the caller's
expected value). Both sides are arbitrary writer-supplied labels — a
CAS pass meant "the labels agree", not "the content under the label
is the content the caller meant". An attacker writing an entry with
a forged `rootHash` field that doesn't match the actual pool content
slipped through unchanged.

Fix:

  - New optional `VerifyEntryRootFn` hook accepted via the ManifestCas
    constructor's `opts.verifyEntryRoot`. When wired, the hook is
    invoked AFTER the label CAS passes and BEFORE the write. The hook
    fetches the content the entry's `rootHash` references (production:
    the UXF pool element), recomputes the hash, and asserts it matches
    the declaration.

  - New `'integrity-failed'` discriminator on `ManifestCasResult.reason`
    with structured `integrityDetail` for triage. The result also
    carries `observed.contentHash` so callers can re-read state.

  - `ManifestCidRewriteCasError.casReason` widened to surface
    `'integrity-failed'`. The worker's outer retry loop already
    treats any throw from manifest-cid-rewrite as a hard CAS failure
    that bubbles up; the new reason rides through unchanged but
    preserves structure for operator triage.

  - Skipped on the `prev === null` (asserts-no-entry) path because
    there is no observed entry to verify. Skipped on label-CAS-mismatch
    because the swap has already been rejected.

  - Optional shape preserves back-compat: all 9 existing manifest-cas
    tests and the 100% of production callers that do not (yet) supply
    the verifier continue to behave identically. Production wiring is
    a follow-up that does not block this PR.

Tests: 6 new H7 regression tests covering:
  - Back-compat (verifier omitted → label-only CAS as before)
  - Happy path (verifier called AFTER label CAS, BEFORE write)
  - Integrity failure (returns 'integrity-failed', no write happens,
    storage unchanged, integrityDetail propagated)
  - Skip on prev=null (no observed entry to verify)
  - Skip on label-CAS-mismatch (short-circuit before verifier)
  - 'integrity-failed' has its own discriminator distinct from
    cas-mismatch / not-found / concurrent-modification

All 1512 transfer tests pass. 8181 total unit tests pass. tsc clean.

Refs: #333 (H7). Stacked on #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2)
→ #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…ER test-coverage gap

The soak (manual-test-full-recovery.sh) routinely surfaces
`[ERROR] [V6-RECOVER] Stranded receive <hex> hit permanent recipient-
address mismatch (HD-index recovery exhausted) (no retry):
VerificationError: Recipient address mismatch` at §C → §D. The
existing unit tests cover helpers in isolation
(PaymentsModule.recipient-address-mismatch-recovery.test.ts) and the
error CLASSIFIER (PaymentsModule.proof-polling-persistence.test.ts
#269), but neither proves the composition: "given a sibling HD index
in the recipient's inventory, tryRecover finds it AND the SDK's
verifyRecipient agrees on the derived address". This left CI blind to
the regression mode where tryRecover returns a false-positive
candidate that Token.update later rejects.

This commit closes the gap at three layers:

  L1. Real-SDK integration test
      (tests/integration/payments/v6-recover-real-sdk-recovery.test.ts):
      - Replaces the file-level vi.mock for SigningService /
        UnmaskedPredicate / TokenId / TokenType / HashAlgorithm with
        REAL SDK imports.
      - Builds two real HD-derived signing services.
      - Constructs the sender-targeted DIRECT address via real
        UnmaskedPredicate.create → reference.toAddress (the EXACT path
        the SDK's verifyRecipient uses).
      - Drives `tryRecoverSigningServiceForRecipient` against real
        TokenId / TokenType / Uint8Array salt.
      - 4 tests: happy path, no-match negative, deriveAddressInfo throws,
        and (L2 composition) tryRecover + predicate construction
        round-trip equals the transferTx target address.

  L3. CI workflow (.github/workflows/soak-nightly.yml):
      - Nightly cron + workflow_dispatch trigger.
      - Probes testnet aggregator + Nostr relay; skip-not-fail when
        either is unreachable (third-party outages cannot block
        releases).
      - Checks out sphere-cli alongside sphere-sdk, links the local
        build, runs manual-test-full-recovery.sh with SPHERE_DEBUG=*.
      - Captures soak.log + workspace as artifacts (30-day retention
        on failure, 7-day on success).
      - Surfaces metrics in the workflow summary: V6-RECOVER count,
        Stranded receive count, POINTER_MONOTONICITY_VIOLATION count,
        bcast_pub>0 count, exit code.

What's tested by what (audit roll-up):
  - SDK address-derivation regression in tryRecover → L1 (real SDK).
  - tryRecover ↔ predicate-construction composition → L1 composition
    test (proves verifyRecipient cannot throw against the recovered
    state).
  - Cross-process / Nostr / IPFS / multi-daemon regressions → L3 soak.
  - Helper-level edge cases (mock-friendly) → existing
    PaymentsModule.recipient-address-mismatch-recovery.test.ts.
  - Error-classification on hard fail → existing #269 test in
    PaymentsModule.proof-polling-persistence.test.ts.

Tests pass: 4/4 new integration tests + all 8185 unit tests. tsc clean.

Refs: #333 (V6-RECOVER test-coverage gap, follow-up to H1-H7 stack).
Stacked on #355 (H7) → #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2)
→ #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…e in disposition-engine

modules/payments/transfer/disposition-engine.ts:629 called `verifyProof
(proof, trustBase, requestId)` with the bundle-supplied requestId and
trusted that the un-audited hydrateChain adapter had derived it
canonically. If the adapter erred or a malicious sender hand-crafted
the bundle with a proof anchored to a DIFFERENT transaction's
requestId, the proof would still verify (it IS a genuine on-chain
proof) but it would be incorrectly attributed to this transaction —
a proof-binding forgery.

Fix:

  - New `AssertRequestIdBindingFn` type and `AssertRequestIdBindingResult`
    interface on disposition-engine.ts. The hook canonically re-derives
    RequestId from (authenticator.publicKey, sourceState) and compares
    to the bundle's requestId; production wiring calls RequestId.create
    and compares toJSON.

  - Optional `assertRequestIdBinding?` dep added to
    DispositionEngineInput. When provided, the engine calls it BEFORE
    every verifyProof invocation:
      * `{ ok: true }`  → proof verification proceeds.
      * `{ ok: false }` → cryptoInvalid('proof-invalid') — same
        §5.3 [C](3) class as a clean PATH_INVALID at the verifier,
        because the (proof, requestId) pair does not bind to this tx.
      * throw → structuralInvalid('proof-throw').
    The binding gate fires BEFORE verifyProof so a forged binding
    short-circuits the (potentially expensive) on-chain proof check.

  - Optional shape preserves back-compat with the 66 existing engine
    tests (which do not set the hook) and with all production callers
    that have not yet been updated.

  - Plumbed through legacy-shape-adapter.ts: new optional
    `assertRequestIdBinding?` field on LegacyShapeAdapterInput is
    forwarded to DispositionEngineInput in buildEngineInput. Existing
    callers that don't supply the hook degrade to the pre-fix
    behaviour; new callers pass the production-grade RequestId.create
    comparer.

Tests: 7 new H4 regression tests covering:
  - Back-compat (hook absent → proof verification proceeds)
  - Hook returns ok=true (binding asserted, proof verified)
  - Hook receives both bundle requestId and authenticator unchanged
  - Hook returns ok=false (cryptoInvalid('proof-invalid'),
    verifyProof NOT invoked)
  - Hook throws (structuralInvalid('proof-throw'))
  - Binding gate fires BEFORE verifyProof (ordering invariant)
  - Null-proof chain entries do NOT trigger the binding gate

All 66 existing disposition-engine tests pass unchanged. All 1491
transfer tests pass. 8170 total unit tests pass. tsc clean.

Refs: #333 (H4). Stacked on #352 (H3) → #351 (H2) → #349 (H1) → #348 (C3)
→ #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…permanent

modules/payments/transfer/finalization-worker-sender.ts transitioned
the outbox entry to `failed-permanent` on any hard-fail and never
touched the source tokens that the instant-sender had marked
`transferring`/`pending` at submit time. With orphan auto-recovery
default-OFF, the spender-side balance was permanently locked as
unspendable.

Fix:

  - Added optional `sourceTokenIds?: ReadonlyArray<string>` to
    `UxfTransferOutboxEntry`. The instant-sender now records the
    selected source set at outbox-record construction time via
    `OutboxBuildArgs.sourceTokenIds`. Optional with `[]` semantics
    on `undefined` for back-compat with entries written before H5.

  - Added optional `recoverFailedPermanentSources?(sources, outboxId)`
    hook to `FinalizationWorkerSenderOptions`. When wired, the worker
    invokes it once at the `failed-permanent` transition with the
    entry's recorded source tokenIds. Production wiring flips each
    source back to a spendable state (typically `confirmed`) gated
    on a same-source-no-other-live-outbox check; the audit's gap
    was that the worker provided no signal at all.

  - The hook is best-effort: throws are caught and emitted as a
    `transfer:failed` event for triage, but the `failed-permanent`
    transition itself stands. This preserves the terminal-state
    contract — a recovery hook failure must not re-introduce the
    locked-pending bug it exists to fix.

  - Optional shape preserves back-compat with the 90+ existing
    finalization-worker-sender tests (which do not set the hook).

Tests: 5 new H5 regression tests covering:
  - Hook fires with the recorded source ids on failed-permanent
  - Pre-H5 entries (no sourceTokenIds field) get `[]` (not skipped)
  - Hook absent: worker still transitions cleanly
  - Hook throws: failed-permanent stands, transfer:failed emitted
  - Hook fires AFTER the outbox transition (ordering invariant)

The "hook does NOT fire on success/in-progress" guarantee is
structurally enforced by code placement (inside the
`if (totalFailure > 0)` branch); a behavioural test would require
duplicating the full §6.1 fake-adapter wiring for a property the
code-review already shows.

All 1498 transfer tests pass. 8175 total unit tests pass. tsc clean.

Refs: #333 (H5). Stacked on #353 (H4) → #352 (H3) → #351 (H2) → #349 (H1)
→ #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…nifestCas

ManifestCas.update's precondition check compared `observed.rootHash`
(a string read from the entry) to `prev.contentHash` (the caller's
expected value). Both sides are arbitrary writer-supplied labels — a
CAS pass meant "the labels agree", not "the content under the label
is the content the caller meant". An attacker writing an entry with
a forged `rootHash` field that doesn't match the actual pool content
slipped through unchanged.

Fix:

  - New optional `VerifyEntryRootFn` hook accepted via the ManifestCas
    constructor's `opts.verifyEntryRoot`. When wired, the hook is
    invoked AFTER the label CAS passes and BEFORE the write. The hook
    fetches the content the entry's `rootHash` references (production:
    the UXF pool element), recomputes the hash, and asserts it matches
    the declaration.

  - New `'integrity-failed'` discriminator on `ManifestCasResult.reason`
    with structured `integrityDetail` for triage. The result also
    carries `observed.contentHash` so callers can re-read state.

  - `ManifestCidRewriteCasError.casReason` widened to surface
    `'integrity-failed'`. The worker's outer retry loop already
    treats any throw from manifest-cid-rewrite as a hard CAS failure
    that bubbles up; the new reason rides through unchanged but
    preserves structure for operator triage.

  - Skipped on the `prev === null` (asserts-no-entry) path because
    there is no observed entry to verify. Skipped on label-CAS-mismatch
    because the swap has already been rejected.

  - Optional shape preserves back-compat: all 9 existing manifest-cas
    tests and the 100% of production callers that do not (yet) supply
    the verifier continue to behave identically. Production wiring is
    a follow-up that does not block this PR.

Tests: 6 new H7 regression tests covering:
  - Back-compat (verifier omitted → label-only CAS as before)
  - Happy path (verifier called AFTER label CAS, BEFORE write)
  - Integrity failure (returns 'integrity-failed', no write happens,
    storage unchanged, integrityDetail propagated)
  - Skip on prev=null (no observed entry to verify)
  - Skip on label-CAS-mismatch (short-circuit before verifier)
  - 'integrity-failed' has its own discriminator distinct from
    cas-mismatch / not-found / concurrent-modification

All 1512 transfer tests pass. 8181 total unit tests pass. tsc clean.

Refs: #333 (H7). Stacked on #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2)
→ #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…ER test-coverage gap

The soak (manual-test-full-recovery.sh) routinely surfaces
`[ERROR] [V6-RECOVER] Stranded receive <hex> hit permanent recipient-
address mismatch (HD-index recovery exhausted) (no retry):
VerificationError: Recipient address mismatch` at §C → §D. The
existing unit tests cover helpers in isolation
(PaymentsModule.recipient-address-mismatch-recovery.test.ts) and the
error CLASSIFIER (PaymentsModule.proof-polling-persistence.test.ts
#269), but neither proves the composition: "given a sibling HD index
in the recipient's inventory, tryRecover finds it AND the SDK's
verifyRecipient agrees on the derived address". This left CI blind to
the regression mode where tryRecover returns a false-positive
candidate that Token.update later rejects.

This commit closes the gap at three layers:

  L1. Real-SDK integration test
      (tests/integration/payments/v6-recover-real-sdk-recovery.test.ts):
      - Replaces the file-level vi.mock for SigningService /
        UnmaskedPredicate / TokenId / TokenType / HashAlgorithm with
        REAL SDK imports.
      - Builds two real HD-derived signing services.
      - Constructs the sender-targeted DIRECT address via real
        UnmaskedPredicate.create → reference.toAddress (the EXACT path
        the SDK's verifyRecipient uses).
      - Drives `tryRecoverSigningServiceForRecipient` against real
        TokenId / TokenType / Uint8Array salt.
      - 4 tests: happy path, no-match negative, deriveAddressInfo throws,
        and (L2 composition) tryRecover + predicate construction
        round-trip equals the transferTx target address.

  L3. CI workflow (.github/workflows/soak-nightly.yml):
      - Nightly cron + workflow_dispatch trigger.
      - Probes testnet aggregator + Nostr relay; skip-not-fail when
        either is unreachable (third-party outages cannot block
        releases).
      - Checks out sphere-cli alongside sphere-sdk, links the local
        build, runs manual-test-full-recovery.sh with SPHERE_DEBUG=*.
      - Captures soak.log + workspace as artifacts (30-day retention
        on failure, 7-day on success).
      - Surfaces metrics in the workflow summary: V6-RECOVER count,
        Stranded receive count, POINTER_MONOTONICITY_VIOLATION count,
        bcast_pub>0 count, exit code.

What's tested by what (audit roll-up):
  - SDK address-derivation regression in tryRecover → L1 (real SDK).
  - tryRecover ↔ predicate-construction composition → L1 composition
    test (proves verifyRecipient cannot throw against the recovered
    state).
  - Cross-process / Nostr / IPFS / multi-daemon regressions → L3 soak.
  - Helper-level edge cases (mock-friendly) → existing
    PaymentsModule.recipient-address-mismatch-recovery.test.ts.
  - Error-classification on hard fail → existing #269 test in
    PaymentsModule.proof-polling-persistence.test.ts.

Tests pass: 4/4 new integration tests + all 8185 unit tests. tsc clean.

Refs: #333 (V6-RECOVER test-coverage gap, follow-up to H1-H7 stack).
Stacked on #355 (H7) → #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2)
→ #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…permanent

modules/payments/transfer/finalization-worker-sender.ts transitioned
the outbox entry to `failed-permanent` on any hard-fail and never
touched the source tokens that the instant-sender had marked
`transferring`/`pending` at submit time. With orphan auto-recovery
default-OFF, the spender-side balance was permanently locked as
unspendable.

Fix:

  - Added optional `sourceTokenIds?: ReadonlyArray<string>` to
    `UxfTransferOutboxEntry`. The instant-sender now records the
    selected source set at outbox-record construction time via
    `OutboxBuildArgs.sourceTokenIds`. Optional with `[]` semantics
    on `undefined` for back-compat with entries written before H5.

  - Added optional `recoverFailedPermanentSources?(sources, outboxId)`
    hook to `FinalizationWorkerSenderOptions`. When wired, the worker
    invokes it once at the `failed-permanent` transition with the
    entry's recorded source tokenIds. Production wiring flips each
    source back to a spendable state (typically `confirmed`) gated
    on a same-source-no-other-live-outbox check; the audit's gap
    was that the worker provided no signal at all.

  - The hook is best-effort: throws are caught and emitted as a
    `transfer:failed` event for triage, but the `failed-permanent`
    transition itself stands. This preserves the terminal-state
    contract — a recovery hook failure must not re-introduce the
    locked-pending bug it exists to fix.

  - Optional shape preserves back-compat with the 90+ existing
    finalization-worker-sender tests (which do not set the hook).

Tests: 5 new H5 regression tests covering:
  - Hook fires with the recorded source ids on failed-permanent
  - Pre-H5 entries (no sourceTokenIds field) get `[]` (not skipped)
  - Hook absent: worker still transitions cleanly
  - Hook throws: failed-permanent stands, transfer:failed emitted
  - Hook fires AFTER the outbox transition (ordering invariant)

The "hook does NOT fire on success/in-progress" guarantee is
structurally enforced by code placement (inside the
`if (totalFailure > 0)` branch); a behavioural test would require
duplicating the full §6.1 fake-adapter wiring for a property the
code-review already shows.

All 1498 transfer tests pass. 8175 total unit tests pass. tsc clean.

Refs: #333 (H5). Stacked on #353 (H4) → #352 (H3) → #351 (H2) → #349 (H1)
→ #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…nifestCas

ManifestCas.update's precondition check compared `observed.rootHash`
(a string read from the entry) to `prev.contentHash` (the caller's
expected value). Both sides are arbitrary writer-supplied labels — a
CAS pass meant "the labels agree", not "the content under the label
is the content the caller meant". An attacker writing an entry with
a forged `rootHash` field that doesn't match the actual pool content
slipped through unchanged.

Fix:

  - New optional `VerifyEntryRootFn` hook accepted via the ManifestCas
    constructor's `opts.verifyEntryRoot`. When wired, the hook is
    invoked AFTER the label CAS passes and BEFORE the write. The hook
    fetches the content the entry's `rootHash` references (production:
    the UXF pool element), recomputes the hash, and asserts it matches
    the declaration.

  - New `'integrity-failed'` discriminator on `ManifestCasResult.reason`
    with structured `integrityDetail` for triage. The result also
    carries `observed.contentHash` so callers can re-read state.

  - `ManifestCidRewriteCasError.casReason` widened to surface
    `'integrity-failed'`. The worker's outer retry loop already
    treats any throw from manifest-cid-rewrite as a hard CAS failure
    that bubbles up; the new reason rides through unchanged but
    preserves structure for operator triage.

  - Skipped on the `prev === null` (asserts-no-entry) path because
    there is no observed entry to verify. Skipped on label-CAS-mismatch
    because the swap has already been rejected.

  - Optional shape preserves back-compat: all 9 existing manifest-cas
    tests and the 100% of production callers that do not (yet) supply
    the verifier continue to behave identically. Production wiring is
    a follow-up that does not block this PR.

Tests: 6 new H7 regression tests covering:
  - Back-compat (verifier omitted → label-only CAS as before)
  - Happy path (verifier called AFTER label CAS, BEFORE write)
  - Integrity failure (returns 'integrity-failed', no write happens,
    storage unchanged, integrityDetail propagated)
  - Skip on prev=null (no observed entry to verify)
  - Skip on label-CAS-mismatch (short-circuit before verifier)
  - 'integrity-failed' has its own discriminator distinct from
    cas-mismatch / not-found / concurrent-modification

All 1512 transfer tests pass. 8181 total unit tests pass. tsc clean.

Refs: #333 (H7). Stacked on #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2)
→ #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…ER test-coverage gap

The soak (manual-test-full-recovery.sh) routinely surfaces
`[ERROR] [V6-RECOVER] Stranded receive <hex> hit permanent recipient-
address mismatch (HD-index recovery exhausted) (no retry):
VerificationError: Recipient address mismatch` at §C → §D. The
existing unit tests cover helpers in isolation
(PaymentsModule.recipient-address-mismatch-recovery.test.ts) and the
error CLASSIFIER (PaymentsModule.proof-polling-persistence.test.ts
#269), but neither proves the composition: "given a sibling HD index
in the recipient's inventory, tryRecover finds it AND the SDK's
verifyRecipient agrees on the derived address". This left CI blind to
the regression mode where tryRecover returns a false-positive
candidate that Token.update later rejects.

This commit closes the gap at three layers:

  L1. Real-SDK integration test
      (tests/integration/payments/v6-recover-real-sdk-recovery.test.ts):
      - Replaces the file-level vi.mock for SigningService /
        UnmaskedPredicate / TokenId / TokenType / HashAlgorithm with
        REAL SDK imports.
      - Builds two real HD-derived signing services.
      - Constructs the sender-targeted DIRECT address via real
        UnmaskedPredicate.create → reference.toAddress (the EXACT path
        the SDK's verifyRecipient uses).
      - Drives `tryRecoverSigningServiceForRecipient` against real
        TokenId / TokenType / Uint8Array salt.
      - 4 tests: happy path, no-match negative, deriveAddressInfo throws,
        and (L2 composition) tryRecover + predicate construction
        round-trip equals the transferTx target address.

  L3. CI workflow (.github/workflows/soak-nightly.yml):
      - Nightly cron + workflow_dispatch trigger.
      - Probes testnet aggregator + Nostr relay; skip-not-fail when
        either is unreachable (third-party outages cannot block
        releases).
      - Checks out sphere-cli alongside sphere-sdk, links the local
        build, runs manual-test-full-recovery.sh with SPHERE_DEBUG=*.
      - Captures soak.log + workspace as artifacts (30-day retention
        on failure, 7-day on success).
      - Surfaces metrics in the workflow summary: V6-RECOVER count,
        Stranded receive count, POINTER_MONOTONICITY_VIOLATION count,
        bcast_pub>0 count, exit code.

What's tested by what (audit roll-up):
  - SDK address-derivation regression in tryRecover → L1 (real SDK).
  - tryRecover ↔ predicate-construction composition → L1 composition
    test (proves verifyRecipient cannot throw against the recovered
    state).
  - Cross-process / Nostr / IPFS / multi-daemon regressions → L3 soak.
  - Helper-level edge cases (mock-friendly) → existing
    PaymentsModule.recipient-address-mismatch-recovery.test.ts.
  - Error-classification on hard fail → existing #269 test in
    PaymentsModule.proof-polling-persistence.test.ts.

Tests pass: 4/4 new integration tests + all 8185 unit tests. tsc clean.

Refs: #333 (V6-RECOVER test-coverage gap, follow-up to H1-H7 stack).
Stacked on #355 (H7) → #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2)
→ #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…nifestCas

ManifestCas.update's precondition check compared `observed.rootHash`
(a string read from the entry) to `prev.contentHash` (the caller's
expected value). Both sides are arbitrary writer-supplied labels — a
CAS pass meant "the labels agree", not "the content under the label
is the content the caller meant". An attacker writing an entry with
a forged `rootHash` field that doesn't match the actual pool content
slipped through unchanged.

Fix:

  - New optional `VerifyEntryRootFn` hook accepted via the ManifestCas
    constructor's `opts.verifyEntryRoot`. When wired, the hook is
    invoked AFTER the label CAS passes and BEFORE the write. The hook
    fetches the content the entry's `rootHash` references (production:
    the UXF pool element), recomputes the hash, and asserts it matches
    the declaration.

  - New `'integrity-failed'` discriminator on `ManifestCasResult.reason`
    with structured `integrityDetail` for triage. The result also
    carries `observed.contentHash` so callers can re-read state.

  - `ManifestCidRewriteCasError.casReason` widened to surface
    `'integrity-failed'`. The worker's outer retry loop already
    treats any throw from manifest-cid-rewrite as a hard CAS failure
    that bubbles up; the new reason rides through unchanged but
    preserves structure for operator triage.

  - Skipped on the `prev === null` (asserts-no-entry) path because
    there is no observed entry to verify. Skipped on label-CAS-mismatch
    because the swap has already been rejected.

  - Optional shape preserves back-compat: all 9 existing manifest-cas
    tests and the 100% of production callers that do not (yet) supply
    the verifier continue to behave identically. Production wiring is
    a follow-up that does not block this PR.

Tests: 6 new H7 regression tests covering:
  - Back-compat (verifier omitted → label-only CAS as before)
  - Happy path (verifier called AFTER label CAS, BEFORE write)
  - Integrity failure (returns 'integrity-failed', no write happens,
    storage unchanged, integrityDetail propagated)
  - Skip on prev=null (no observed entry to verify)
  - Skip on label-CAS-mismatch (short-circuit before verifier)
  - 'integrity-failed' has its own discriminator distinct from
    cas-mismatch / not-found / concurrent-modification

All 1512 transfer tests pass. 8181 total unit tests pass. tsc clean.

Refs: #333 (H7). Stacked on #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2)
→ #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…ER test-coverage gap

The soak (manual-test-full-recovery.sh) routinely surfaces
`[ERROR] [V6-RECOVER] Stranded receive <hex> hit permanent recipient-
address mismatch (HD-index recovery exhausted) (no retry):
VerificationError: Recipient address mismatch` at §C → §D. The
existing unit tests cover helpers in isolation
(PaymentsModule.recipient-address-mismatch-recovery.test.ts) and the
error CLASSIFIER (PaymentsModule.proof-polling-persistence.test.ts
#269), but neither proves the composition: "given a sibling HD index
in the recipient's inventory, tryRecover finds it AND the SDK's
verifyRecipient agrees on the derived address". This left CI blind to
the regression mode where tryRecover returns a false-positive
candidate that Token.update later rejects.

This commit closes the gap at three layers:

  L1. Real-SDK integration test
      (tests/integration/payments/v6-recover-real-sdk-recovery.test.ts):
      - Replaces the file-level vi.mock for SigningService /
        UnmaskedPredicate / TokenId / TokenType / HashAlgorithm with
        REAL SDK imports.
      - Builds two real HD-derived signing services.
      - Constructs the sender-targeted DIRECT address via real
        UnmaskedPredicate.create → reference.toAddress (the EXACT path
        the SDK's verifyRecipient uses).
      - Drives `tryRecoverSigningServiceForRecipient` against real
        TokenId / TokenType / Uint8Array salt.
      - 4 tests: happy path, no-match negative, deriveAddressInfo throws,
        and (L2 composition) tryRecover + predicate construction
        round-trip equals the transferTx target address.

  L3. CI workflow (.github/workflows/soak-nightly.yml):
      - Nightly cron + workflow_dispatch trigger.
      - Probes testnet aggregator + Nostr relay; skip-not-fail when
        either is unreachable (third-party outages cannot block
        releases).
      - Checks out sphere-cli alongside sphere-sdk, links the local
        build, runs manual-test-full-recovery.sh with SPHERE_DEBUG=*.
      - Captures soak.log + workspace as artifacts (30-day retention
        on failure, 7-day on success).
      - Surfaces metrics in the workflow summary: V6-RECOVER count,
        Stranded receive count, POINTER_MONOTONICITY_VIOLATION count,
        bcast_pub>0 count, exit code.

What's tested by what (audit roll-up):
  - SDK address-derivation regression in tryRecover → L1 (real SDK).
  - tryRecover ↔ predicate-construction composition → L1 composition
    test (proves verifyRecipient cannot throw against the recovered
    state).
  - Cross-process / Nostr / IPFS / multi-daemon regressions → L3 soak.
  - Helper-level edge cases (mock-friendly) → existing
    PaymentsModule.recipient-address-mismatch-recovery.test.ts.
  - Error-classification on hard fail → existing #269 test in
    PaymentsModule.proof-polling-persistence.test.ts.

Tests pass: 4/4 new integration tests + all 8185 unit tests. tsc clean.

Refs: #333 (V6-RECOVER test-coverage gap, follow-up to H1-H7 stack).
Stacked on #355 (H7) → #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2)
→ #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…ER test-coverage gap

The soak (manual-test-full-recovery.sh) routinely surfaces
`[ERROR] [V6-RECOVER] Stranded receive <hex> hit permanent recipient-
address mismatch (HD-index recovery exhausted) (no retry):
VerificationError: Recipient address mismatch` at §C → §D. The
existing unit tests cover helpers in isolation
(PaymentsModule.recipient-address-mismatch-recovery.test.ts) and the
error CLASSIFIER (PaymentsModule.proof-polling-persistence.test.ts
#269), but neither proves the composition: "given a sibling HD index
in the recipient's inventory, tryRecover finds it AND the SDK's
verifyRecipient agrees on the derived address". This left CI blind to
the regression mode where tryRecover returns a false-positive
candidate that Token.update later rejects.

This commit closes the gap at three layers:

  L1. Real-SDK integration test
      (tests/integration/payments/v6-recover-real-sdk-recovery.test.ts):
      - Replaces the file-level vi.mock for SigningService /
        UnmaskedPredicate / TokenId / TokenType / HashAlgorithm with
        REAL SDK imports.
      - Builds two real HD-derived signing services.
      - Constructs the sender-targeted DIRECT address via real
        UnmaskedPredicate.create → reference.toAddress (the EXACT path
        the SDK's verifyRecipient uses).
      - Drives `tryRecoverSigningServiceForRecipient` against real
        TokenId / TokenType / Uint8Array salt.
      - 4 tests: happy path, no-match negative, deriveAddressInfo throws,
        and (L2 composition) tryRecover + predicate construction
        round-trip equals the transferTx target address.

  L3. CI workflow (.github/workflows/soak-nightly.yml):
      - Nightly cron + workflow_dispatch trigger.
      - Probes testnet aggregator + Nostr relay; skip-not-fail when
        either is unreachable (third-party outages cannot block
        releases).
      - Checks out sphere-cli alongside sphere-sdk, links the local
        build, runs manual-test-full-recovery.sh with SPHERE_DEBUG=*.
      - Captures soak.log + workspace as artifacts (30-day retention
        on failure, 7-day on success).
      - Surfaces metrics in the workflow summary: V6-RECOVER count,
        Stranded receive count, POINTER_MONOTONICITY_VIOLATION count,
        bcast_pub>0 count, exit code.

What's tested by what (audit roll-up):
  - SDK address-derivation regression in tryRecover → L1 (real SDK).
  - tryRecover ↔ predicate-construction composition → L1 composition
    test (proves verifyRecipient cannot throw against the recovered
    state).
  - Cross-process / Nostr / IPFS / multi-daemon regressions → L3 soak.
  - Helper-level edge cases (mock-friendly) → existing
    PaymentsModule.recipient-address-mismatch-recovery.test.ts.
  - Error-classification on hard fail → existing #269 test in
    PaymentsModule.proof-polling-persistence.test.ts.

Tests pass: 4/4 new integration tests + all 8185 unit tests. tsc clean.

Refs: #333 (V6-RECOVER test-coverage gap, follow-up to H1-H7 stack).
Stacked on #355 (H7) → #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2)
→ #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
vrogojin added a commit that referenced this pull request May 30, 2026
…ER test-coverage gap

The soak (manual-test-full-recovery.sh) routinely surfaces
`[ERROR] [V6-RECOVER] Stranded receive <hex> hit permanent recipient-
address mismatch (HD-index recovery exhausted) (no retry):
VerificationError: Recipient address mismatch` at §C → §D. The
existing unit tests cover helpers in isolation
(PaymentsModule.recipient-address-mismatch-recovery.test.ts) and the
error CLASSIFIER (PaymentsModule.proof-polling-persistence.test.ts
#269), but neither proves the composition: "given a sibling HD index
in the recipient's inventory, tryRecover finds it AND the SDK's
verifyRecipient agrees on the derived address". This left CI blind to
the regression mode where tryRecover returns a false-positive
candidate that Token.update later rejects.

This commit closes the gap at three layers:

  L1. Real-SDK integration test
      (tests/integration/payments/v6-recover-real-sdk-recovery.test.ts):
      - Replaces the file-level vi.mock for SigningService /
        UnmaskedPredicate / TokenId / TokenType / HashAlgorithm with
        REAL SDK imports.
      - Builds two real HD-derived signing services.
      - Constructs the sender-targeted DIRECT address via real
        UnmaskedPredicate.create → reference.toAddress (the EXACT path
        the SDK's verifyRecipient uses).
      - Drives `tryRecoverSigningServiceForRecipient` against real
        TokenId / TokenType / Uint8Array salt.
      - 4 tests: happy path, no-match negative, deriveAddressInfo throws,
        and (L2 composition) tryRecover + predicate construction
        round-trip equals the transferTx target address.

  L3. CI workflow (.github/workflows/soak-nightly.yml):
      - Nightly cron + workflow_dispatch trigger.
      - Probes testnet aggregator + Nostr relay; skip-not-fail when
        either is unreachable (third-party outages cannot block
        releases).
      - Checks out sphere-cli alongside sphere-sdk, links the local
        build, runs manual-test-full-recovery.sh with SPHERE_DEBUG=*.
      - Captures soak.log + workspace as artifacts (30-day retention
        on failure, 7-day on success).
      - Surfaces metrics in the workflow summary: V6-RECOVER count,
        Stranded receive count, POINTER_MONOTONICITY_VIOLATION count,
        bcast_pub>0 count, exit code.

What's tested by what (audit roll-up):
  - SDK address-derivation regression in tryRecover → L1 (real SDK).
  - tryRecover ↔ predicate-construction composition → L1 composition
    test (proves verifyRecipient cannot throw against the recovered
    state).
  - Cross-process / Nostr / IPFS / multi-daemon regressions → L3 soak.
  - Helper-level edge cases (mock-friendly) → existing
    PaymentsModule.recipient-address-mismatch-recovery.test.ts.
  - Error-classification on hard fail → existing #269 test in
    PaymentsModule.proof-polling-persistence.test.ts.

Tests pass: 4/4 new integration tests + all 8185 unit tests. tsc clean.

Refs: #333 (V6-RECOVER test-coverage gap, follow-up to H1-H7 stack).
Stacked on #355 (H7) → #354 (H5) → #353 (H4) → #352 (H3) → #351 (H2)
→ #349 (H1) → #348 (C3) → #347 (C2) → #346 (C1).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant