perf(per-miner): cache reverse cid→seq index so recover doesn't re-scan the allotment per submit#298
Closed
nghetienhiep wants to merge 1 commit into
Closed
Conversation
recover_tier_seq_for re-scanned the full allotment (~10k HMAC-SHA256 per call) on every lookup. Since cathedralai#296 calls it on the submit path whenever the assignment row is replica-lagged — and that runs inside the submit gate slot — a high replica-miss rate makes the O(allotment) re-scan dominate submit latency and spike CPU under load. instance_id is a deterministic HMAC, so the cid->seq map is stable; build it once per (hotkey, epoch, tier) and cache it (lru_cache, bounded by CATHEDRAL_PERMINER_RECOVER_INDEX_CACHE, default 64) for amortized O(1) lookups. Behaviour is unchanged (still identity-bound: a foreign or bogus challenge_id still resolves to None). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
#296 made PM submit tolerate assignment-row replica lag by falling back to
recover_tier_seq_for(...)when_lookup_perminer_assignmentmisses. That recovery re-scans the miner's full allotment, recomputinginstance_id(HMAC-SHA256) for every seq — up to ~allotment_for(tier)(10k by default) HMACs per call.Because that fallback runs on the submit hot path inside the
_submit_slotgate, and fires precisely when the assignment row is replica-lagged (which can be a large share of submits under load), the O(allotment) re-scan can dominate submit latency and spike CPU — holding gate slots longer and lowering submit throughput exactly when the gate is saturated.Change
instance_idis a deterministic HMAC, so thecid → seqmap for a given(hotkey, epoch, tier)is stable. Build it once and cache it (lru_cache, bounded byCATHEDRAL_PERMINER_RECOVER_INDEX_CACHE, default 64 maps) → recovery becomes amortized O(1).Behaviour is unchanged: still identity-bound (a foreign or bogus
challenge_idresolves toNone);recover_seq_forand the #296 replica-lag path are unaffected.