Skip to content

perf(profile): post-mortem of #360 — static-analysis fixes regressed; instrumentation added for a real profile-driven investigation #363

@vrogojin

Description

@vrogojin

Status

This issue replaces #360.

#360 has been closed as a failed attempt. The 7-finding stack on fix/issue-360-profile-sync-perf (commits b8ba5fca8094e8) was implemented in good faith and then A/B-soak-measured against main. The measurement was unambiguous: the "perf fixes" make things worse and introduce a correctness regression.

This issue exists so the next investigator can pick up cleanly with:

  1. The negative result from perf(profile): sync amplification + O(B²) JOIN + 5× OpLog walks — extreme CPU + apparent memory leak on edge devices #360 in one place, so we don't redo the same dead-end work.
  2. The hypothesis the maintainer wants explored next (batching writes).
  3. The measurement infrastructure that has been added to make the next attempt profile-driven rather than static-analysis-driven.

What was measured

A/B soak on 2026-05-30 using .tmp/soak-264.sh ITERATIONS=1, with SPHERE_DEBUG=* and a 5-second top-style sampler running alongside.

Both runs reach B ≈ 70 active bundles — the same regime #360 cited as worst-case. So the regression cannot be dismissed as "the soak doesn't exercise the targeted scale".

Section Baseline (main) Treatment (#360 fixes) Δ
§1 Peer1 setup 55 s 58 s +3 s
§A.1 Peer2-alice 31 s 31 s 0
§C.1b deliver invoice 6 s 16 s +10 s (2.67×)
§C.2 Bob pays 62 s 148 s +86 s (2.39×)
§C.4 cross-device sync 43 s 75 s +32 s (1.74×)
§D.1 pre-clear snapshots 85 s 127 s +42 s (1.49×)
§D.3 sphere clear 47 s 50 s +3 s
§D.4 IPFS-only recovery 219 s 238 s +19 s
§D.5 Assertions PASS FAIL rc=1
§E ledger 22 s not reached
TOTAL 631 s 805 s +174 s (1.28× slower)

Resource (trimmed to active soak window, 30 s start-skip):

Metric Baseline Treatment
avg RSS 1.21 GB 1.40 GB (+16 %)
avg CPU% 147 % 163 % (+11 %)
max RSS 2.04 GB 4.74 GB*
max CPU% 333 % 1024 %*

*Treatment max RSS / CPU peak is a parallel tsup DTS build of the main baseline worktree that ran during the first ~30 s of the treatment window — not the SDK itself. Discount when comparing peaks.

Correctness regression — §D.5

Treatment failed §D.5 with a real cross-device divergence on Bob's UCT balance after recovery:

ASSERT FAIL (bob-peer1-vs-peer2-after)
-UCT: 99.999999999989 (1 token)    ← peer1, recovered from mnemonic + IPFS
+UCT: 199.999999999989 (2 tokens)  ← peer2, original device

Baseline ran the same flow with identical inputs and all 6 assertions passed.


What went wrong in #360 (post-mortem)

The original issue body looked like a rigorous perf analysis. It is not. It is static analysis of source code with cost models derived by multiplying source-code counts. There is no flamegraph, no node --prof output, no perf record, no allocation sampler, no actual CPU% trace from the symptom window. The "525k Map allocations per load" and "6.5 minutes synchronous network IO" numbers are computations on the structure of the code, not on what the code does at runtime.

Implementing those findings produced the measured regression because:

  1. The pathologies called out were not the actual limiter on the workload. The dominant sections (§C.2 send, §D.1 pre-clear, §D.4 recovery) are network-bound (Nostr round-trips, aggregator HEAD-verify, IPFS gateway propagation), not CPU-bound. Local CPU optimisation cannot shorten a network-bound wait — but it can lengthen it if the optimisation triggers rate-limiting.

  2. Parallel fan-out (Findings DMs not working #2, Feature/nametag enhancements #5, cap=8) likely triggers testnet rate-limiting at B ≈ 70. The pre-fix sequential dispatch may have been pessimal-looking but rate-limit-safe. Going from 1 in-flight → 8 in-flight against a shared aggregator + Kubo gateway turns "slow but steady" into "throttled and worse".

  3. Finding Sphere SDK #1's identity-filter on OrbitDB 'update' events is the prime suspect for the §D.5 divergence. The fix suppresses handleReplication callback for update entries authored by our own identity, on the assumption that "only peer writes warrant a re-sync". But the same update event is also the local state-machine's signal that its own write was durably applied. Filtering it out races the receiver against the snapshot publish, and at edge of timing windows the recipient loses a token.

  4. Net code change is +1500 / −200 across 30 files. We added more code than we removed. Most "simplifications" were rearrangements (WeakMap layer, AggregateError branch, prefix-collision check, extracted helper). Two genuine wins (single concurrency constant, folded size-probe-into-hash-compute) do not move the needle on a network-bound workload.

The takeaway: #360 skipped the measurement step that should precede any perf claim. We should not repeat that.


What the maintainer wants explored next

on every write (one CID) we apparently doing the whole round-trip, while we could simply batch all these writes into a single shot

This is the hypothesis the next investigation should test. Concretely:

  • Every OrbitDbAdapter.putEntry call (which is invoked many times per logical operation — addBundle, OUTBOX state transitions, SENT-ledger appends, tombstone writes, profile-pointer publishes, etc.) currently triggers its own dag-cbor encode → IPFS pin → HEAD-verify → optional Nostr publish round-trip.
  • A logical operation like "Bob pays Alice" emits ~10–50 such writes in close succession.
  • If each one is a full HTTP HEAD-verify round-trip against the gateway (flush-durability deadline 30 s), the wall-clock cost per logical op is N_writes × verify_RTT, not 1 × verify_RTT.
  • Batching writes within a logical operation (or within a debounce window) into a single HEAD-verify could collapse this from O(N) round-trips to O(1).

What "batching" means concretely needs profile data first — see the next section.


Measurement infrastructure (added on chore/perf-instrumentation)

A small counters module has been added so the next investigator does not start from scratch:

  • profile/internal/perf-counters.tsincr(name), observe(name, ms), time(name, fn), dumpAndReset(). Gated by SPHERE_PERF=1 (no overhead when off). When on, dumps a per-counter snapshot every 5 s via the existing logger.info('perf', ...) channel.
  • Wired at the hot paths the perf(profile): sync amplification + O(B²) JOIN + 5× OpLog walks — extreme CPU + apparent memory leak on edge devices #360 findings claimed but never measured:
    • OrbitDbAdapter.putEntry — count + total time. Resolves the per-write-round-trip question directly.
    • OrbitDbAdapter.onReplication callback — fires/sec.
    • BundleIndex.listActiveBundles — calls + walk time.
    • fetchCarFromIpfs — call count, blocks-per-call, total bytes, wall-clock per call.
    • pinCarBlocksToIpfs — block count, total bytes, wall-clock per call.
    • flushToIpfs — call count, blocks, wall-clock.
    • UxfPackage.computeVerifiedProofsAcross — total proofs verified, verifier calls, wall-clock.

To use:

SPHERE_PERF=1 SPHERE_DEBUG=perf=info bash .tmp/soak-264.sh

Every 5 s the daemon emits a counter snapshot. After the soak you can post-process the log to build flame-shaped tables.

This is not a fix. It's the missing measurement step that #360 should have started with.


Suggested next steps

  1. Run a soak with SPHERE_PERF=1. Identify the top-5 counters by total time. Do not propose fixes until that table exists.
  2. From the counter data, decide whether the batching hypothesis is borne out — is putEntry total-time a meaningful fraction of wall-clock? Is fetchCarFromIpfs time dominated by HEAD-verify or by block-fetch? Does flushToIpfs time scale linearly with bundle count?
  3. Propose at most one targeted intervention. Re-run the soak. Compare. Land only if A/B is unambiguously better on both wall-clock and assertion pass-rate.

Artefacts

A/B soak artefacts preserved (NOT in git — local fs only):

  • /tmp/soak-metrics/treatment/fix/issue-360-profile-sync-perf run (805 s, rc=1, §D.5 FAIL).
  • /tmp/soak-metrics/baseline/main run (631 s, rc=0, all assertions PASS).

Both contain soak-runs/run-1/script.log, totals.tsv, samples.tsv, summary-metrics.json, plus the KEEP=1 workspace tree with daemons' on-disk state for any deeper triage.

The dead-end branch is preserved with a DO_NOT_USE_DEAD_END.md marker at the repo root.

cc maintainer for triage / labelling.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions