Skip to content

perf(profile): remaining recovery-path bottlenecks after #363 — pointer-layer over-invocation, applySnapshot during clear, kubo conn cap, Rule-4 design gate #364

@vrogojin

Description

@vrogojin

Context

Follow-up to #363 (post-mortem of #360). The branch chore/perf-instrumentation (c5962f0 + four follow-up commits, unpushed) added measurement coverage and an SPHERE_SKIP_RULE4=1 prototype that demonstrated a −33.6% soak wall-clock improvement (−209 s) and −46% on §D.4 recovery (229 s → 123 s) by bypassing the O(N²) UXF-level pairwise verification loop in ProfileTokenStorageProvider.load().

Full report (497 lines, all raw artifacts preserved locally): /tmp/soak-metrics/issue-363/REPORT.md.

This issue captures the remaining open items the #363 data surfaced — none of them blocked the headline finding, but each is a measurable architectural smell worth its own fix.

Headline numbers (from the #363 SKIP_RULE4 A/B)

section baseline SPHERE_SKIP_RULE4=1 delta
§D.4 recovery 229 s 123 s −46%
Total soak wall 622 s 413 s −33.6%
profile.load.rule4PairwiseMs 222.5 s 0 (bypassed) −222.5 s
profile.applySnapshot fires 80 18 −77%
pointerLayer.recoverLatest fires 372 461 +24% (workflow completes faster ⇒ more cycles per wall second)

Hypotheses already rejected by #363 data

Hypothesis VALIDATED by #363 data

  • "Convert UXF → TXF in memory, verify at TXF level not UXF level." The Rule-4 pairwise loop at profile/profile-token-storage-provider.ts:1521–1533 is 35.8% of the 622 s soak. The graceful-degradation catch path already proves the codebase tolerates verifiedProofs = undefined; making that the default for snapshot-sourced loads is a 5-line change.

Snapshot-path baseline (validates the design expectation)

When the apply-snapshot path is allowed to operate alone (the user's stated design: scan-pointer → fetch-blob → done), each step is fast:

sub-step calls avg ms max ms
applySnapshotCb.fetchRootMs (single-shot block fetch by CID) 18 6.0 24
applySnapshotCb.parseSnapshotMs (CBOR decode of root block) 18 5.5 20
applySnapshotCb.dispatchMs (OrbitDB rebuild from parsed snapshot) 18 385 954
applySnapshotCb.totalMs 18 397 981

So when the path is exercised correctly, full apply-snapshot is ~400 ms, not "dozens of seconds" — confirming the maintainer's design statement.


Open items for THIS issue

1. pointerLayer.recoverLatest over-invocation (P1)

In the SKIP_RULE4 run, §D.4 ran for 123 s and observed 108 recoverLatest calls + 110 findLatestValidVersion calls for a single mnemonic recovery. The design (per the user, and per CLAUDE.md's L4 model) expects exactly one recoverLatest per recovery + exactly one applySnapshot.

Per-second rate: 108/123 = 0.88 recoveries/sec. Each recoverLatest averages 250 ms (4.5 s max), totalling 27.1 s = 22% of §D.4 wall in the SKIP run.

Acceptance criteria:

  • Add a counter for "recoveries served from cache vs cold".
  • Instrument the calling code path (likely lifecycle-manager.ts or sync-loop) to determine why pointer-scan re-runs every cycle.
  • Either cache the recovered pointer head for the duration of the recovery window, or short-circuit subsequent calls inside recoverLatest when the head was just resolved.
  • Verify with a fresh A/B soak that §D.4 drops below 80 s (target: design-correct ≈ 10 s, realistic-floor ≈ 50 s).

2. profile.applySnapshot fires during sphere clear (P2)

In the baseline soak, §D.3 ("sphere clear on all wallets", 41 s wall) recorded 23 profile.applySnapshot invocations out of 80 total for the soak. A clear command applying snapshots is structurally wrong — clear is supposed to delete state, not seed it.

Acceptance criteria:

  • Trace the call graph that fires applySnapshot during clear.
  • Either guard setApplySnapshotCallback's closure against isShuttingDown || hasShutdown (it's already partially guarded — check applySnapshotIfWired and the closure body) or guard clear() against firing snapshot-recovery cycles.
  • Soak with the same env as before; §D.3 should drop below 15 s and applySnapshot.fired calls during §D.3 should be 0.

3. Attribute the remaining 47 s of §D.4 (P2)

After SKIP_RULE4, §D.4 is 123 s wall. Instrumented total accounts for 76 s (62%); the remaining 47 s (38%) is in code that has no counter yet. Top suspects:

  • Token deserialization in serialization/txf-serializer.ts (called per token from parseTokensFromBundle).
  • Post-recovery flush (flushScheduler.flushToIpfs already instrumented — confirm it's not double-counted).
  • OrbitDB internal append/index pipeline during dispatch (the IPLog write inside db.put, which the orbitdb.putEntry counter wraps but may not cover all paths).
  • Helia FsBlockstore writes during snapshot dispatch (no counter yet).

Acceptance criteria:

  • Add per-step counters inside profile-snapshot-dispatcher.runProfileSnapshotJoin (decode, per-writer dispatch, per-key apply).
  • Add a counter around parseTokensFromBundle / extractTokensFromTxfData.
  • Re-soak; instrumented coverage of §D.4 should reach ≥85%.

4. Convert SPHERE_SKIP_RULE4 prototype into a design-correct gate (P1)

The current env flag bypasses Rule-4 unconditionally. The correct design (per #363 conclusion) is: skip Rule-4 pairwise enrichment when the load source is a single trusted snapshot blob (i.e., snapshot-driven recovery), but keep it for genuine cross-device JOIN (multiple device contributions where Rule 4 is the conflict-resolution mechanism).

Acceptance criteria:

  • In ProfileTokenStorageProvider.load(), detect whether loadedBundles originate from a single snapshot blob vs from multiple independent device bundles. Likely signal: presence of a snapshot pointer CID + bundles all referenced from that snapshot.
  • When snapshot-sourced, set verifiedProofs = undefined (matches the existing fallback path on verifier failure).
  • When cross-device-sourced (e.g., live sync merge), keep the pairwise loop.
  • Document the design choice in code + REPORT.md addendum.
  • Remove SPHERE_SKIP_RULE4 env flag once the gate is in.

5. Kubo connection-cap A/B (P3 — operational, no code change)

Current host kubo container: Swarm.ConnMgr.HighWater=128 / LowWater=96. Maintainer's hypothesis: too many unrelated peers degrade kubo, slowing the IPFS-pin step inside flushScheduler.flushToIpfs and ipfs.pinCar.

Acceptance criteria:

  • Reduce to HighWater=32 / LowWater=24 via:
    docker exec ipfs-kubo ipfs config --json Swarm.ConnMgr.HighWater 32
    docker exec ipfs-kubo ipfs config --json Swarm.ConnMgr.LowWater 24
    docker restart ipfs-kubo
  • Re-run the soak with the same SPHERE_PERF instrumentation.
  • Compare flushScheduler.flushToIpfs.totalMs, ipfs.pinCar.totalMs, orbitdb.putEntry.{bundle,other} against the baseline.
  • Decide if the cap should be applied permanently in the host's kubo config or as a documented recommendation.

6. Fix the §D.5 snapshot-diff contamination from SPHERE_PERF=1 (P3 — test harness)

Both #363 SKIP and baseline runs reach §D.5 ("Assertions") and fail because normalize_snapshot() in manual-test-full-recovery.sh:180 strips single-line timestamped logs but the perf-counter snapshot is a multi-line block; continuation lines ( 'counter': { count: ... }) survive the filter and create spurious diffs.

Three fix options (any one is sufficient):

  • (a) Extend normalize_snapshot() to strip the entire [perf-counters] snapshot: { ... } block.
  • (b) Route logger.info('perf', …) to stderr (or to a side-file) instead of stdout.
  • (c) Add a --no-perf-in-snapshots mode to the test that disables auto-dump for sphere status invocations specifically.

Acceptance criteria: A soak run with SPHERE_PERF=1 produces passed=1/1 for manual-test-full-recovery.sh end-to-end.


Files touched in chore/perf-instrumentation (unpushed)

core/perf-counters.ts                                       (new)
profile/orbitdb-adapter.ts                                  (counters)
profile/profile-token-storage/bundle-index.ts               (counters)
profile/profile-token-storage/flush-scheduler.ts            (counters)
profile/ipfs-client.ts                                      (counters)
profile/factory.ts                                          (applySnapshotCb counters)
profile/profile-token-storage-provider.ts                   (load + applySnapshot counters + SKIP_RULE4 env gate)
profile/profile-snapshot-cache.ts                           (read/write counters)
profile/aggregator-pointer/ProfilePointerLayer.ts           (recoverLatest, discover, probe, classify)
profile/aggregator-pointer/discover-algorithm.ts            (findLatestValidVersion)
oracle/UnicityAggregatorProvider.ts                         (aggregator-call counters)
uxf/UxfPackage.ts                                           (computeVerifiedProofs counters)
tests/unit/core/perf-counters.test.ts                       (15 tests, pre-existing)
.tmp/soak-264.sh                                            (grep -c arithmetic fix)

All counters are gated by SPHERE_PERF=1 (zero overhead when off). Build: npm run build produces dist/profile/{index,node,browser}.js with all strings inlined.

Reproducing the A/B

# Baseline (Rule-4 enabled)
ITERATIONS=1 ARTIFACTS_ROOT=/tmp/soak-baseline/soak-runs \\
  SPHERE_PERF=1 SPHERE_DEBUG=perf=info \\
  bash /home/vrogojin/uxf/.tmp/soak-264.sh

# Treatment (Rule-4 bypassed)
ITERATIONS=1 ARTIFACTS_ROOT=/tmp/soak-skip/soak-runs \\
  SPHERE_PERF=1 SPHERE_DEBUG=perf=info SPHERE_SKIP_RULE4=1 \\
  bash /home/vrogojin/uxf/.tmp/soak-264.sh

# Aggregate
node /tmp/soak-metrics/issue-363/aggregate-perf.mjs /tmp/soak-baseline/soak-runs
node /tmp/soak-metrics/issue-363/aggregate-perf.mjs /tmp/soak-skip/soak-runs

Refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions