Skip to content

feat(cache): multi-slot LRU MRU partial block cache#1149

Draft
blightbow wants to merge 19 commits into
jundot:mainfrom
blightbow:feat/mru-partial-block-cache
Draft

feat(cache): multi-slot LRU MRU partial block cache#1149
blightbow wants to merge 19 commits into
jundot:mainfrom
blightbow:feat/mru-partial-block-cache

Conversation

@blightbow
Copy link
Copy Markdown
Contributor

@blightbow blightbow commented May 9, 2026

This PR proposes an in-memory cache for the trailing slice of prefill that cannot be committed to the disk cache. It captures the sub-block tail of a completed prefill (anything past the last full block_size boundary) so that a resubmitted prompt skips the avoidable partial recompute that block-aligned caching otherwise forces.

The cache is a bounded LRU dict keyed by parent_hash, with default capacity 4 entries (configurable via --mru-partial-max-entries). Entries are written at the end of store_cache for any prefill that produces a trailing partial, and consumed during cache-hit admission via apply_mru_partial. On a successful splice the entry is promoted to the LRU tail; on capacity overflow the oldest entry is evicted. On any apply-time mismatch (different tail tokens, layer-count mismatch, splice failure), only the mismatched entry is evicted. Sibling entries for other prefixes are preserved.

This is a memory-only structure, kept distinct from the disk and hot caches. No attempt is made to persist partials; the eviction discipline below ensures they cannot accumulate beyond capacity.

Scope

The optimization helps across the full range of single-prompt repeat and concurrent-prompt scenarios:

  • Single-user iteration (swipes, regenerations, A/B testing with the same prompt) gets the "skip the partial recompute" benefit on every repeat.
  • Multi-user or multi-conversation workloads up to mru_partial_max_entries distinct active prefixes get the same benefit per prefix. LRU eviction handles workloads beyond capacity gracefully: the most recently used prefixes stay warm, older ones fall back to a full recompute on next hit (which is the pre-cache baseline, so the worst case is "no benefit," not "regression").

The eviction gate on every mismatch protects correctness regardless of multiplicity; the operational footprint is bounded (see below).

Safety

The cache surface is small but the failure modes are sharp. The implementation:

  • Refuses to stash on hybrid models. Any layer with a non-sliceable cache type (RotatingKVCache, ArraysCache, BatchRotatingKVCache, and so on) disables stashing entirely. Splicing into only the sliceable layers would create per-layer offset skew at decode and silently corrupt generation on Gemma 3, Mistral, and any future hybrid. Gated via an explicit KNOWN_SLICEABLE_CACHE_TYPES whitelist on omlx/cache/type_registry.py. Notably, the registry's supports_block_slicing flag is not trustworthy here. DefaultCacheHandler falls back to KVCache semantics and would report several unregistered batch/pool types as sliceable.
  • Splices transactionally. Phase 1 builds replacement keys/values/offset for every layer with no mutation; phase 2 commits. A failure mid-loop rolls everything back and evicts the entry. No half-mutated cache state can reach a forward pass.
  • Evicts on every miss kind. Token-prefix mismatch, length mismatch, layer-count mismatch, splice failure: each pops only the matching key, leaving siblings alone. BlockAwarePrefixCache.clear() wipes the whole dict (cache-corruption recovery path).
  • Freed-paged-block guard. If the parent paged block has been freed between stash and apply, the apply path returns no-op rather than falling through to a None-keyed lookup, which would falsely match a short-prompt entry against an unrelated request whose parent is gone.
  • Refuses on ambiguous cache layout. Multi-turn requests can produce cache buffers whose length doesn't unambiguously identify whether they're global-indexed or local-indexed; the stash path refuses rather than guess.
  • Functions without disk writes. The gate keys on PagedSSDCacheManager instance presence, not on whether SSD writes are happening. Under hot_cache_only=True (set via settings or OMLX_HOT_CACHE_ONLY env), the disk writer thread is disabled but reconstruct still works via the hot tier's short-circuit; the MRU cache correctly remains active in that mode.

Test coverage in tests/test_prefix_cache.py::TestMRUPartialBlockCache and ::TestMRUPartialMultiSlot pins each of these, including a transactional-rollback test that mocks mx.concatenate to fail mid-loop and verifies no layer was mutated, a structural invariant test that pins kv_data to mx.array storage so future "optimize to CPU copy" regressions get caught, and parameterized LRU mechanics covering capacity eviction, apply-success promotion, and sibling preservation under mismatches.

Operational note: deferred Metal cache clear

The post-completion deferred clear (#435, #557) is suppressed for one extra _DEFERRED_CLEAR_DELAY window when any MRU entry is warm at the deadline. Warm partials are a strong predictor that the same prompt will return immediately and would benefit from the still-resident lazy KV tensors. The suppression is bounded at one suppression per deferral epoch (a fresh budget arms only on transition from _deferred_clear_at is None); the next deadline fires regardless of MRU state. This avoids the failure mode where hot-prompt repeats refresh the budget faster than _DEFERRED_CLEAR_DELAY and the pool-bloat mitigation (#411) is silently defeated.

Memory accounting

Each entry holds real mx.array allocations (via mx.copy) and counts automatically against mx.get_active_memory(). Every runtime memory enforcement and telemetry path in this codebase (process enforcer, scheduler limit checks, periodic-clear threshold) reads from there. Worst case is one block_size of KV per entry, held alive between completions and admissions.

Under hot_cache_only=True, the hot cache and the MRU dict share the same in-memory KV headroom envelope. Operators running that mode at high --mru-partial-max-entries should size --hot-cache-max-size accordingly: the two settings are co-tenants of the same budget, not independent dials. Default 4 is conservative enough that this rarely matters in practice.

Configuration

--mru-partial-max-entries N (default 4, matching the dflash max_entries precedent in #1120) sets the maximum simultaneous entries. 0 disables the feature entirely (silent fallback to "no MRU" behavior, mirroring the --hot-cache-max-size 0 convention). Also available via mru_partial_max_entries in CacheSettings. Operators with high-concurrency workloads can opt up; operators with memory pressure can opt down or off.

Admin endpoint symmetry

The /api/ssd-cache/clear admin endpoint now also wipes MRU partials for each loaded scheduler, via a new BlockAwarePrefixCache.clear_mru_partials() method. Without this, partials would chain from paged-block hashes whose KV bytes were just flushed by the endpoint, violating the operator's "drop all warm caches" intent. The same clear_mru_partials() hook is the seam intended for the future /api/hot-cache/clear endpoint introduced in #1183: a one-line addition at the same loop once that PR lands.

Observability

The MRU partial cache plugs into the same observability surface #1183 established for the prefix and hot/disk tiers, so operators tune --mru-partial-max-entries from the same dashboard they use for memory and disk hit rates.

New counters on PrefixCacheStats: mru_partial_stashes (entry writes, including same-key replacements), mru_partial_hits (successful splices), mru_partial_evictions (capacity-overflow + apply-miss + admin-clear wipes; the cache-corruption clear() path intentionally zeros all counters via reset_stats() rather than incrementing here), mru_partial_tokens_saved (the direct compute-saved measure: prefill tokens that did not have to be re-run). New gauges: mru_partial_entries and mru_partial_max_entries. Derived mru_partial_hit_rate = hits / stashes — the "stash payoff" ratio — surfaces both windowed and cumulative in CacheRateTracker.

The dashboard mirrors the hot-cache pattern: a header gauge ("MRU tails N/M entries"), rate-strip cells for the hit rate and tokens-saved counter, and a per-model column. All gated on mru_partial_max_entries > 0 so default-off configurations don't see the surface.

Base

This branch is based on #1183 to use the cache-tier observability and hot-cache architecture as the integration substrate. Targeting merge after #1183.

@blightbow blightbow marked this pull request as draft May 13, 2026 20:54
@blightbow blightbow marked this pull request as draft May 13, 2026 20:54
@blightbow
Copy link
Copy Markdown
Contributor Author

Converting to a draft PR. I was attempting to rebase this on unmerged PR #1183 to implement MRU in the warm memory cache. During the rebase I identified test regressions in #1183 that need addressing. (TestRuntimeCacheObservability) This PR is now a draft pending resolution in #1183.

ivaniguarans and others added 10 commits May 13, 2026 17:09
Server-side snapshot differencing via CacheRateTracker: stores the last
90 snapshots of cumulative cache counters (10s intervals, 15 min window)
and computes rates via start/end differencing. Zero hot-path changes —
snapshots are lazy, driven by dashboard polling cadence.

New metrics exposed in /api/stats under cache_observability:
- prefix_hit_rate (cumulative + windowed)
- eviction count, ssd_hot_rate
- per-model and weighted aggregate across models

Dashboard: new "Cache Breakdown" card below Average Speed showing hit
rate, evictions, and hot cache hits. Session-only (hidden in All-Time
view since counters reset on model reload).
The new disk_max aggregation reads get_ssd_cache_max_size_bytes which
the existing tests left unconfigured, producing a MagicMock that raises
on max(MagicMock, int).  Also add the missing max_size_bytes key to the
expected model payloads.
The paged SSD cache only persists full block_size blocks; the trailing
sub-block tail (e.g. 139 of 256 tokens) is otherwise re-prefilled on
every repeat request.  For Kimi K2.5-class models this adds ~1.5-2s
of avoidable TTFT per submission of an identical prompt.

Add a single-slot in-memory stash for that tail.  After every
store_cache that produces a trailing partial we keep its KV state under
the parent block's hash.  The next admission whose remaining_tokens
start with the stashed tokens splices the partial onto the reconstructed
cache and shrinks remaining_tokens by the partial's length, eliminating
the tail prefill.

This is a from-scratch rewrite of the archived feat/mru-partial-block-
cache branch (now archive/mru-partial-block-cache-v1).  The original
landed three structural bugs that the test suite never exercised:

  1. The duck-typed splice gate (hasattr(cache_obj, 'keys') and
     hasattr(cache_obj, 'offset')) misclassified RotatingKVCache as
     sliceable.  RotatingKVCache HAS those attributes, so the gate
     would concatenate the full rotating-window state onto the new
     request's cache, blowing past max_size and leaving _idx stale.
     Hybrid models (Gemma 3, Mistral, anything with sliding window)
     would have been silently corrupted on every repeat.

  2. The store-side extraction passed is_last_block=True, which makes
     _extract_block_tensor_slice return the *full state* (not a token
     slice) for non-sliceable layers.  Wrong intent for partial
     extraction; compounded jundot#1.

  3. The splice's try/except wrapped the whole layer loop, so a
     concatenate failure on layer N>0 left layers 0..N-1 already
     mutated (offset += n_partial, keys/values overwritten) while
     the caller was told zero tokens were applied.  Half-mutated
     caches are silent generation corruption.

Companion bug in the original deferred-clear suppression: the
suppression had no upper bound, so a hot-prompt workload (each repeat
stashes a fresh MRU before the prior is consumed) could defer the
Metal cache clear forever, defeating the pool-bloat mitigation
(jundot#411).

Safety properties of the rewrite:

  - Hybrid refusal.  Stash and apply both gate on uniform layer
    sliceability via CacheTypeRegistry.get_handler_by_class_name(...)
    .supports_block_slicing.  If any layer is non-sliceable
    (RotatingKVCache, ArraysCache, etc.) the slot is left empty.
    Splicing only the sliceable layers in a hybrid would create
    per-layer offset skew at decode -- undefined behaviour at the
    model level -- so refusal is the only correct policy.

  - Transactional splice.  apply_mru_partial runs in two phases.
    Phase 1 materialises the replacement keys/values for every layer
    without touching the cache; phase 2 commits the writes.  A
    concatenate failure during phase 1 returns (cache, remaining, 0)
    with no layer mutated.  The slot is evicted on failure so a
    consistently-failing partial does not get re-attempted.

  - Eviction on every miss kind.  Parent-hash mismatch, token
    mismatch, length mismatch, layer-count mismatch, splice failure
    all clear the slot.  A stale or mistargeted partial cannot
    survive into a future apply.

  - Bounded deferred-clear suppression.  Each completion's
    _cleanup_finished arms a one-shot
    _mru_clear_suppression_available budget alongside the existing
    _deferred_clear_at target.  At the deadline, if the budget is
    intact and the cache reports has_mru_partial(), the deadline is
    pushed out by one more _DEFERRED_CLEAR_DELAY window and the
    budget is spent.  The next deadline fires regardless, bounding
    total deferral at 2x _DEFERRED_CLEAR_DELAY (~10-40 ms today).
    Patched against _deferred_clear_at, the post-jundot#557 gate -- the
    original was patching the obsolete _deferred_clear_steps path.

Tests (25 new, all passing):

  TestMRUPartialBlockCache (19) -- init state, stash semantics,
    no-stash on block alignment, slot replacement on subsequent
    store, parent-hash linkage, hybrid refusal (KVCache +
    RotatingKVCache and pure rotating), real round-trip via
    store_cache through apply for exact and prefix matches, every
    eviction reason, no-op on empty remaining, layer-count mismatch
    eviction, transactional rollback under a mocked mx.concatenate
    failure on layer 1 (asserts no layer's offset/shape changed),
    multi-turn correctness with existing_tokens > 0 and distinct
    fill values to verify the right slice was captured.

  TestHasMRUPartial (1) -- the public accessor used by the
    scheduler reflects slot transitions.

  TestMRUDeferredClearSuppression (5) -- budget armed by completion,
    clear fires at deadline without MRU, suppressed once with MRU
    (deadline pushed by exactly one DELAY), clear fires at second
    deadline even if MRU still warm, fires immediately after MRU
    eviction, fresh completion refreshes spent budget.

Suite results on this commit:
  189 passed (tests/test_prefix_cache.py + tests/test_scheduler.py)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adversarial review of the prior commit (a642e04) caught five concrete
holes the rewrite introduced or carried over from its predecessor.
Each is fixed below with a regression test that fails on the prior
commit and passes here.

C1 — gate replaced registry lookup with explicit whitelist.

  _all_layers_sliceable consulted CacheTypeRegistry, whose
  get_handler_by_class_name falls through to DefaultCacheHandler for
  any class name without a registered handler.  DefaultCacheHandler
  inherits from KVCacheHandler and reports
  supports_block_slicing=True.  Several real non-sliceable types are
  mapped in _class_name_map but have no registered handler:

    - BatchRotatingKVCache (BATCH_ROTATING_KVCACHE enum, no handler)
    - BatchPoolingCache, PoolingCache (registered only when the
      deepseek_v4 patch is applied)

  The registry would silently classify these as sliceable, recreating
  exactly the silent-corruption hazard the rewrite was supposed to
  close, just from a different angle.  The fix consults the existing
  KNOWN_SLICEABLE_CACHE_TYPES whitelist (the same list the rest of
  the scheduler trusts for snapshot-skip and partial-extraction
  decisions), promoted from a private alias in scheduler.py to a
  public constant on omlx.cache.type_registry so both modules share
  one source of truth.

  Test: test_refuse_stash_when_layer_falls_through_to_default_handler
  asserts the registry would have lied about BatchRotatingKVCache and
  the new gate refuses it.

C2 — clear() wipes the MRU slot.

  BlockAwarePrefixCache.clear() reset _request_tables, _prefix_index,
  and the paged cache, but left _mru_partial alive.  Scheduler.reset()
  and Scheduler._recover_from_cache_error() both route through
  clear() — meaning a stale partial would survive exactly the
  cache-corruption recovery path that exists *because* something was
  wrong.  After such a recovery, a future request that happens to
  reproduce the same prompt prefix would get its compute_block_hash
  matching the partial's parent_hash and the splice would fire
  against a freshly-reconstructed cache.

  Test: test_clear_wipes_mru_partial.

C3 — suppression budget arms only on transition from None.

  The prior commit's docstring claimed "total deferral bounded at 2x
  _DEFERRED_CLEAR_DELAY."  False under hot-prompt repeats: every
  completion landing while a deferral was pending re-armed
  _mru_clear_suppression_available = True.  Workloads whose
  completions arrive faster than _DEFERRED_CLEAR_DELAY (the very
  workload this feature targets) keep refreshing the budget after
  it's spent, deferring the clear forever and defeating the
  pool-bloat mitigation (jundot#411).

  Fix: arm the budget only when starting a new deferral epoch
  (_deferred_clear_at transitions from None).  Subsequent completions
  in the same epoch may still extend the deadline (the jundot#557
  invariant) but do not refresh the budget.  One suppression per
  epoch, enforced.

  Test:
  test_completion_within_open_epoch_does_not_refresh_budget
  drives two sequential completions, simulates the budget being
  spent between them, and asserts the second completion extends the
  deadline but leaves the budget at False.  The renamed
  test_new_epoch_completion_arms_budget pins the converse.

H2 — global-vs-local indices are now classified, not heuristic'd.

  cache_uses_global_indices = (existing_tokens > 0 and cache_seq_len
  >= existing_tokens + 1) silently classified ambiguous lengths as
  "local."  In multi-turn requests where cache_data was extracted at
  a boundary equalling the prior turn's length, the cache is global
  but cache_seq_len falls between local_len and global_end — the old
  predicate said "local" and the partial was sliced from the prefix
  region instead of the trailing tail.  parent_hash still matched on
  the next request, and a future apply spliced wrong KV.  Silent
  generation corruption — exactly the failure class the rewrite was
  supposed to close.

  Replaced with an explicit three-way classification:

    cache_seq_len >= partial_global_end    -> global indices
    cache_seq_len == local_len             -> local indices
    otherwise                              -> refuse to stash

  Refusing the ambiguous case is strictly safer than guessing.

  Test: test_refuse_stash_on_ambiguous_cache_layout drives the
  boundary directly with cache_seq_len strictly between local_len
  and global_end and asserts no stash.

H3 — stash gated on paged_ssd_cache presence.

  In paged-SSD-only configurations (the only configuration this
  class supports for production reconstruction), reconstruct_cache
  returns None when paged_ssd_cache is None, which means
  apply_mru_partial is unreachable from the scheduler.  Without the
  gate, the stash held a multi-MB tensor reference dead in memory
  until the next store_cache overwrote it — wasted memory scaling
  with model size.

  Test: test_no_stash_when_paged_ssd_cache_is_none.

H4 — accounting divergence documented (no behaviour change).

  After a successful splice, cached_tokens is advanced by the
  partial length but shared_prefix_blocks is not (the partial is
  not a stored paged block).  The relaxed invariant
  cached_tokens >= shared_prefix_blocks * block_size is now
  documented at the scheduler call site, with a guard against the
  most likely future misuse (indexing block_table.block_ids by
  shared_prefix_blocks while bounding the loop with cached_tokens).

Per-test-class layout:

  TestMRUPartialBlockCache         18 -> 22 (+ C1, C2, H2, H3)
  TestHasMRUPartial                 1
  TestMRUDeferredClearSuppression   6 -> 7  (+ C3 invariant test;
                                              previous "fresh
                                              completion" test
                                              renamed to reflect the
                                              new contract)

Suite results on this commit:
  194 passed in 0.57s

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a docstring explaining how the MRU partial slot's memory cost
flows through the existing memory enforcement machinery, and a test
that pins the invariant the implicit accounting depends on.

Why this is documentation, not behaviour:

  Tracing the budgeting paths in this codebase shows that all KV
  memory enforcement reads from mx.get_active_memory():

    - process_memory_enforcer.py:217,232 (process-level enforcer)
    - scheduler.py:1567 (prefill mid-loop limit check)
    - scheduler.py:3328 (prefill pre-flight peak check)
    - scheduler.py:3377 (generation admission guard)
    - scheduler.py:_periodic_clear_threshold_bytes (periodic clear)
    - optimizations.py:65 (telemetry)

  There is no separate up-front KV budget that the MRU could escape.
  In paged-SSD-only mode (the only mode this codebase supports),
  _calculate_max_blocks() returns a fixed 100k block-metadata count,
  not a memory budget — paged blocks live on SSD, not GPU memory.
  The estimator helpers (estimate_block_memory, estimate_prompt_kv_bytes)
  are deltas computed against the current mx.get_active_memory()
  baseline, which already includes the MRU.

  _clone_tensor (prefix_cache.py:1207-1221) uses mx.copy(tensor),
  producing real mx.array allocations.  MLX counts these in active
  memory automatically.  So the MRU slot's ~one-block-worth of KV
  (~17 MiB Kimi K2.5 / DeepSeek MLA, ~41 MiB Llama 3 70B full
  attention) is already enforced against the same limits as the
  in-flight request caches.

  No behaviour change required.  The user's "apples to apples"
  intuition holds.

What this commit does add:

  1. _MRUPartialBlock docstring gains a "Memory accounting" section
     enumerating the enforcement paths and the implicit invariant
     (kv_data holds mx.array instances).  Future maintainers reading
     the MRU code will see why no separate accounting hook exists
     and not be tempted to add one.

  2. test_kv_data_holds_mlx_arrays_for_active_memory_accounting
     asserts the invariant directly.  A "helpful" future change that
     stored CPU-side copies (np.ndarray to dodge a perceived
     GPU-memory cost) would silently escape every existing memory
     limit and only manifest as system OOM under load.  The test
     fails fast if that regression is introduced.

Suite results on this commit:
  195 passed in 0.25s (tests/test_prefix_cache.py + tests/test_scheduler.py)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sanity-checking the budgeting paths revealed a third memory-budget
layer worth mentioning in the MRU docstring: engine_pool's pre-load
admission gate (engine_pool.py:355-373) reserves a fraction of each
model's weight size as KV headroom and logs "Loading {model_id}
without KV headroom" when eviction can't free enough.

The MRU partial is one tenant of that headroom alongside the in-flight
prompt caches, but it is not separately reserved because at one
block_size of KV per cache instance (~17 MiB Kimi K2.5 / ~41 MiB
Llama 3 70B), the slot is dominated by the concurrent in-flight
caches the headroom was sized for.

Quantification across model classes:

  Kimi K2.5 (MLA, ~200 GB quant): 25% headroom = ~50 GB,
    MRU = 17 MiB → 0.0003%
  Llama 3 70B (Q4, ~35 GB):       25% headroom = ~9 GB,
    MRU = 41 MiB → 0.5%
  Llama 3 8B (Q4, ~4.5 GB):       25% headroom = ~1.1 GB,
    MRU = 10 MiB → 1%
  Qwen 0.5B (~1 GB):              25% headroom = 256 MiB,
    MRU = 5 MiB → 2%

The pre-load layer's granularity (gigabytes) makes the MRU partial
invisible at every model scale.  The runtime enforcer catches any
overrun via mx.get_active_memory() regardless.  Approach unchanged;
documentation is just more complete.

The 25% percentage itself is intentionally not quoted in the
docstring — it could change in engine_pool without invalidating the
MRU's accounting model.

Suite results on this commit:
  195 passed in 0.23s

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three cleanups surfaced by the simplify review.  No behaviour change.

Doc bit-rot: drop file:line references from comments.

  The MRU stack picked up several docstring/comment references that
  cite specific scheduler.py line numbers and a hardcoded "100k"
  block-metadata count.  Both will rot — line numbers shift on any
  edit above them, and the 100k constant lives in
  _calculate_max_blocks() and could change without invalidating the
  MRU's accounting model.

  - _MRUPartialBlock docstring: enumerated scheduler.py:1567, 3328,
    3377, _periodic_clear_threshold_bytes.  Replaced with prose
    naming the same gates symbolically.  Dropped the "100k" magic
    number; the doc now says "fixed block-metadata count" since the
    specific number is irrelevant to the MRU's invariant.

  - H4 accounting note in scheduler.py: referenced "logging at 2828,
    2835" inside source.  Replaced with prose ("the scheduler's
    prefill-completion log lines downstream") so a future search by
    log message still finds them after motion.

Test factory collapse:

  tests/test_prefix_cache.py: _kv_layer and _rotating_layer were
  near-identical (one differed only in the cache_type/class_name
  string and a fill kwarg).  Extracted shared _layer factory taking
  class_name as a kwarg; the two existing helpers now delegate to
  it.  Keeps the call sites readable while removing the copy-paste.
  Same factory naturally extends to BatchRotatingKVCache and other
  cache types when those tests grow.

Findings deferred (separate PRs warranted):

  - Per-block loop in store_cache (prefix_cache.py:553-555) shares
    the cache_seq_len >= existing_tokens + 1 heuristic the H2 fix
    retired in _update_mru_partial.  Extracting a shared
    _classify_cache_indexing helper and routing the per-block loop
    through it would close the same hazard at its other site.
    Needs its own safety analysis (when does the
    cache_seq_len == existing_tokens boundary actually arise) and
    regression tests scoped to the per-block path; out of scope for
    the MRU branch.

  - paged_cache.allocated_blocks.get(...) is a leaky abstraction at
    9+ pre-existing sites in prefix_cache.py.  Encapsulating it
    behind a public PagedCacheManager.get_block_by_id() method is a
    wider refactor that should not piggyback on MRU work.

  - KNOWN_SLICEABLE_CACHE_TYPES → CacheType enum has a TurboQuant
    caveat (_class_name_map collapses TurboQuantKVCache and
    BatchTurboQuantKVCache to KVCACHE).  Conversion needs a
    deliberate decision on whether to lose the explicit gate
    strings.

  - Per-layer mx.concatenate dispatch in apply_mru_partial's phase 1
    could potentially be batched.  Per the prefill-perf principle,
    this needs an M3 Ultra measurement before changing — both to
    establish that the cost is significant and to confirm batched
    dispatch is actually faster on the platform.

Suite results on this commit:
  195 passed in 0.42s

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the MRU partial cache peer review.  The H3 stash gate at
prefix_cache.py:806 and the canonical reconstruct guard at :1710 both
check ``self.paged_ssd_cache is None`` to decide whether reconstruct
can possibly return non-None.  Co-locating the two predicates as a
single ``_can_reconstruct`` helper makes the lockstep explicit — a
future fetch path that bypasses PagedSSDCacheManager (alternate
backends, memory-only modes that detach from the manager) updates
exactly one predicate, not two.

Behaviour is unchanged.  The clarification is in the docstring.

The MRU stash docstring previously said the slot is cleared when
"no SSD configured."  That phrasing is misleading on
``hot_cache_only=True`` configurations (set via settings.json or
OMLX_HOT_CACHE_ONLY env): the manager IS present in that mode — only
the disk writer thread and directory init are skipped.  The reconstruct
path still works because PagedSSDCacheManager.load_block_with_metadata
short-circuits to the hot tier without ever calling mx.load.  In that
mode the MRU stash IS expected to populate, and the gate correctly
permits it.

The gate fires only when no PagedSSDCacheManager instance exists at
all — typically a test/dev scenario.  The new docstring on
``_can_reconstruct`` enumerates the predicate's semantics and the
hot_cache_only case explicitly so a reader looking at either site
arrives at the same understanding.

Test changes:

  - ``test_no_stash_when_paged_ssd_cache_is_none`` keeps its name and
    behaviour (covers the no-manager case) but its docstring now
    distinguishes that case from ``hot_cache_only=True``, where the
    MRU IS expected to populate.

  - New ``test_can_reconstruct_helper_reflects_manager_presence``
    pins the predicate's contract directly.

Suite results on this commit:
  197 passed in 0.38s

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Upgrade the MRU partial-block cache from a single slot to a bounded
LRU dict keyed by parent_hash.  Multiple concurrent "warm" partials
coexist up to ``mru_partial_max_entries``; LRU eviction pops the
oldest when capacity is reached.  Single-slot mode could not absorb
interleaving (multi-user / multi-conversation workloads); every
``store_cache`` overwrote the lone slot.  On M3 Ultra — where prefill
is firmly compute-bound at ~26 TFLOPS effective FP16 with no MatMul
accelerator (see ``feedback_apple_silicon_perf.md``) — leaving prefill
compute on the table for interleaved workloads is exactly the case
that should not be deferred pending metrics.

Data structure
--------------
``BlockAwarePrefixCache._mru_partials: OrderedDict[bytes | None,
_MRUPartialBlock]`` mirrors the LRU pattern from
``PagedSSDCacheManager._hot_cache``.  No internal lock — the prefix
cache relies on the scheduler's single-threaded executor model
(same as today's single slot).

Public API stays compatible.  ``has_mru_partial()`` returns
``bool(self._mru_partials)``; the scheduler's deferred-clear
suppression budget reads the same boolean predicate it did before.
``apply_mru_partial()`` and ``_update_mru_partial()`` retain their
signatures.

Eviction discipline
-------------------
- On stash: if key exists, pop and re-insert at tail; if over
  capacity, ``popitem(last=False)``.
- On apply success: ``move_to_end(key)`` — promote to LRU tail.
- On apply miss for a found key (token-prefix, layer-count, or
  splice failure): ``pop(key, None)`` — evict only that key.
- On ``clear()`` (cache-corruption recovery): wipe the dict.
- "No eligible tail" branches in ``_update_mru_partial`` no longer
  wipe the dict — they bare-return.  A local "nothing to stash this
  time" signal is unrelated to the validity of other entries.  This
  is the behavioural change that lets distinct-prefix entries coexist
  under interleaving.

Freed-paged-block guard (new)
-----------------------------
If ``block_table.block_ids`` is non-empty but the parent paged block
has been freed between stash and apply, ``apply_mru_partial`` returns
no-op rather than falling through to a ``None``-keyed dict lookup.
That fall-through would falsely match a short-prompt entry against a
request whose parent is just gone.  The race is structurally new in
multi-slot mode; single-slot tolerated it because there was only ever
one entry to match against.

Short-prompt entries (prefix < block_size, parent_hash=None) share
one slot via the ``None`` key — same multi-tenant constraint as the
single-slot design, but only for the short-prompt subset.

Capacity & plumbing
-------------------
``mru_partial_max_entries`` threads from ``CacheSettings`` →
``--mru-partial-max-entries`` CLI flag → ``SchedulerConfig`` → both
``BlockAwarePrefixCache(...)`` construction sites (main at
``scheduler.py:804`` and SpecPrefill draft at ``scheduler.py:3473``).
Default 4 matches the dflash ``max_entries`` precedent (PR jundot#1120).
``0`` disables stashing (silent fallback to "no MRU" behaviour,
mirroring the ``hot_cache_max_size="0"`` convention).

Memory worst-case at default 4: ~68 MiB MLA / ~165 MiB GQA per cache
instance.  With two cache instances (main + SpecPrefill draft),
~136 MiB / ~330 MiB total.  All inside the engine pool's 25% KV
headroom envelope.  Documented in the ``_MRUPartialBlock`` docstring
including the ``hot_cache_only=True`` coexistence note: the hot
cache and MRU dict both live in the same envelope under that mode
and should be tuned together.

Test surface
------------
Existing single-slot tests adapted via a test-only ``_get_mru_partial``
helper (production class surface stays clean; tests are decoupled
from the internal container shape).

New ``TestMRUPartialMultiSlot`` covers the multi-slot mechanics:

- ``test_distinct_prefixes_coexist_as_separate_entries``
- parameterized ``test_lru_capacity_bounds`` (evict-oldest-at-capacity
  + under-capacity-keeps-all)
- ``test_apply_success_promotes_entry_to_lru_tail``
- ``test_max_entries_zero_disables_stashing``
- ``test_clear_mru_partials_wipes_only_partials``
- ``test_apply_noop_when_parent_block_freed`` (the new guard)
- ``test_short_prompt_none_key_coexists_with_block_aligned_entry``

Existing tests adapted, with a few semantically inverted:

- ``test_stash_replaced_on_subsequent_store`` →
  ``test_same_prefix_store_replaces_entry``: same prefix → same key
  → correct LRU put behaviour (replace).
- ``test_stash_clears_when_subsequent_store_is_block_aligned`` →
  ``test_no_eligible_tail_does_not_evict_siblings``: the inverse
  behavioural change from single-slot.
- ``test_apply_evicts_on_parent_hash_mismatch`` →
  ``test_apply_noop_on_parent_hash_mismatch_preserves_sibling``:
  no-op + sibling preservation, not eviction.
- ``test_clear_wipes_mru_partial`` → ``test_clear_wipes_mru_partials``.

Suite results on this commit:
  211 passed: cache + scheduler + admin clear-symmetry tests
  9 passed:   test_settings.py::TestCacheSettings

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ``/api/ssd-cache/clear`` admin endpoint at
``omlx/admin/routes.py:3975`` wipes the SSD-backed paged blocks per
loaded scheduler but did not touch ``BlockAwarePrefixCache._mru_partials``.
Surviving MRU partials then chain from paged-block hashes whose KV
bytes were just flushed, violating the operator's "drop all warm
caches" intent.  Under ``hot_cache_only=True`` (where the hot tier IS
the only persistent store) the same hazard would also apply to PR
jundot#1183's forthcoming ``/api/hot-cache/clear`` endpoint when it lands;
that wiring is a one-line follow-up at the same loop using the same
method.

Wire ``block_aware_cache.clear_mru_partials()`` into the per-scheduler
loop alongside ``ssd_manager.clear()``, with the same defensive
try/except wrapper.  A failure clearing one scheduler's MRU does not
prevent siblings from being cleared.

The standalone ``clear_mru_partials()`` method (added in the previous
commit, see ``omlx/cache/prefix_cache.py``) is the public seam.  Its
own unit coverage lives in
``TestMRUPartialMultiSlot::test_clear_mru_partials_wipes_only_partials``;
this commit adds two endpoint-level tests in
``TestClearSSDCacheAlsoWipesMRUPartials`` that pin the wiring:

  - ``test_endpoint_calls_clear_mru_partials_on_each_scheduler``
    confirms both ``ssd_manager.clear()`` and
    ``block_aware_cache.clear_mru_partials()`` fire for every
    loaded scheduler.
  - ``test_mru_clear_failure_does_not_block_other_scheduler``
    pins the defensive try/except: an exception in one scheduler's
    clear path must not stop the loop.

Suite results on this commit:
  62 passed: tests/test_admin_api_key.py (1 pre-existing unrelated
             failure remains)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@blightbow blightbow force-pushed the feat/mru-partial-block-cache branch from a0f24fb to cce198f Compare May 13, 2026 21:48
blightbow and others added 2 commits May 13, 2026 18:13
Plug the multi-slot MRU partial cache into the same observability
surface that jundot#1183 established for the prefix and hot/disk tiers.
Without these counters, operators have no signal for whether
``--mru-partial-max-entries`` is tuned right; with them, the
dashboard answers "is the MRU paying off" with the same shape it
answers for memory and disk hit rates.

Counters (cumulative, on PrefixCacheStats)
------------------------------------------
- ``mru_partial_stashes`` — every successful entry write, including
  same-key replacements.  Same-key replacement does NOT count as an
  eviction; operator sees stash payoff via the hits/stashes ratio.
- ``mru_partial_hits`` — every successful splice via apply_mru_partial.
- ``mru_partial_evictions`` — total entries removed.  Includes
  capacity-overflow LRU evictions, apply-time mismatch pops (token,
  layer-count, splice failure), and ``clear_mru_partials()`` wipes.
  Does NOT include full ``clear()`` (cache-corruption recovery) —
  that path also calls ``reset_stats()``, so incrementing evictions
  there would be incoherent (the increment gets zeroed immediately).
  Operators tracking partial-only wipes use ``clear_mru_partials()``.
- ``mru_partial_tokens_saved`` — sum of ``n_partial`` across hits.
  The direct compute-saved measure: each unit is one token of prefill
  forward-pass that did NOT have to run.

Gauges (live state, on PrefixCacheStats)
----------------------------------------
- ``mru_partial_entries`` — current dict length.
- ``mru_partial_max_entries`` — configured capacity (operator-facing).

Plumbing
--------
Counters thread through ``Scheduler._collect_cache_counters`` into
the existing ``CacheRateTracker``.  ``observability._compute_window``
and ``_compute_cumulative`` add per-counter deltas plus a derived
``mru_partial_hit_rate`` (= hits / stashes, with the same
zero-stashes-no-NaN guard the other ratios use).  The admin
``_build_runtime_cache_observability`` emits per-model entries/
max_entries gauges and aggregates them at the payload level the
same way ``hot_cache_entries`` and ``hot_cache_max_bytes`` do.

Dashboard
---------
Mirrors jundot#1183's hot-cache surface:

- **Header gauge** "MRU tails N/M entries" next to the Memory and
  SSD gauges, visible only when ``mru_partial_max_entries > 0``.
- **Rate strip** gains "MRU Tail Hit Rate" and "MRU Tokens Saved"
  cells.  Grid expands from 4 cells (hot cache only) to 6 cells
  (both tiers).  When only one or the other tier is enabled, the
  layout stays at 4 cells with the disabled tier's cells hidden via
  ``x-show``.
- **Per-model table** gains an "MRU Tails" column showing
  ``entries / max_entries`` for each loaded model.

Test coverage (10 new cases)
----------------------------
``TestMRUPartialCounters`` (8 cases) — initial zeros, stash-counter
bumps, same-key-replacement-is-not-eviction, capacity-overflow
eviction count, apply-success bumps hits+tokens_saved, apply-miss
eviction, ``clear_mru_partials()`` bulk eviction count, ``clear()``
zeros everything semantics, ``reset_stats()`` zeros cumulative
counters but preserves live entries.

``TestCacheRateTrackerRates`` (3 new cases) — mru_partial_hit_rate
windowed + cumulative, zero-stashes no-NaN guard, tokens_saved
delta accumulation.

``TestRuntimeCacheObservability`` updated to reflect the two new
per-model payload keys (``mru_partial_entries``,
``mru_partial_max_entries``).

Suite results on this commit:
  250 passed: cache + scheduler + observability + admin + settings

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s, Alpine getters

Five small simplifications surfaced by a three-agent code review on
the MRU partial cache commit stack.  No behaviour change.

**Hoist test helpers to module level.**  Three MRU test classes
(``TestMRUPartialBlockCache``, ``TestMRUPartialMultiSlot``,
``TestMRUPartialCounters``) each redefined ``_layer``, ``_kv_layer``,
``_rotating_layer``, ``_make_reconstructed_cache``, ``_stash_with_prefix``,
and ``_cache`` (factory).  Hoist to module-level helpers in
``tests/test_prefix_cache.py`` (alongside the existing
``_get_mru_partial`` accessor); the duplicate methods come out, ~120
lines of repetition collapses, call sites switch from
``self._kv_layer(...)`` to ``_kv_layer(...)``.  The factory for
custom capacity is renamed ``_make_mru_cache(paged_cache, mock_ssd,
max_entries, num_layers)``.  Per-class fixtures (``mx``,
``paged_cache``, ``mock_ssd``) stay class-local to avoid leaking
fixture names into unrelated test classes in the same module.

**Extract ``_evict_miss`` helper in ``apply_mru_partial``.**  Five
arms of ``self._mru_partials.pop(last_hash, None); self._mru_partial_evictions
+= 1; return cache, remaining_tokens, 0`` collapse into a single
inner function.  Each call site is now one line, the eviction-counter
bookkeeping lives in one place, and the rollback contract is harder
to break by accident.

**Dashboard Alpine getters.**  Three getters added to the dashboard
root: ``mruEnabled``, ``hotCacheEnabled``, ``cacheRatesGridCols``.
The previous expressions repeated ``stats.runtime_cache?.mru_partial_max_entries
> 0`` in 8 places and a three-arm ternary chain in the rate-strip
grid-class binding.  The HTML now reads
``x-show="mruEnabled && stats.runtime_cache?.models?.length > 0"``
and ``:class="cacheRatesGridCols"`` at the relevant sites.

**``_make_counters`` driven by a key tuple.**  Previously took 16
explicit kwargs and re-listed every key in the returned dict body.
Now: a module-level ``_COUNTER_KEYS`` tuple is the single source of
truth; the helper builds a zero-initialised dict and applies
``**overrides``.  Unknown keys raise (catches the typos the explicit
signature used to catch).  Adding a new observability counter is
now one tuple entry instead of three coordinated changes (signature,
dict body, and any call sites that wanted the default).

**Prune ``_MRUPartialBlock`` docstring rot-prone bits.**  Dropped
specific MiB numbers (~17/41/68-165 MiB) and the PR# reference to
jundot#1120 from the memory-accounting section.  The numbers were
informative when written but would have aged past the model and
config landscape they assumed.  Kept the invariant statement and the
test reference; removed the calibration table.

**Findings reviewed and skipped:**

- Aggregating ``mru_partial_max_entries`` across loaded models was
  flagged as "wrong arithmetic" — actually correct, matches the
  deliberate hot-cache convention from jundot#1183 (each model has its own
  budget, the dashboard gauge shows fleet fill = sum of entries /
  sum of capacities).
- ``_get_cache_seq_len`` per-block redundancy in ``store_cache`` —
  pre-existing pattern, not regressed by this stack, defer.
- Phase-1 ``mx.concatenate`` × N layers × 2 dispatch shape —
  defer pending M3 Ultra measurement (per
  ``feedback_apple_silicon_perf.md`` memory: never estimate prefill
  costs).
- ``paged_cache.allocated_blocks.get(...)`` direct dict access at
  9+ pre-existing sites — wider refactor, out of scope.
- ``_all_layers_sliceable`` vs ``_prompt_cache_needs_snapshots``
  co-location — different inputs (class-name strings vs live cache
  objects), unification would be scope-creep.

Suite results on this commit:
  250 passed: cache + scheduler + observability + admin + settings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@blightbow blightbow changed the title feat(cache): MRU partial block cache feat(cache): multi-slot LRU MRU partial block cache May 13, 2026
@blightbow
Copy link
Copy Markdown
Contributor Author

blightbow commented May 13, 2026

The PR has been rescoped to function alongside the hot cache in addition to the SSD cache. To avoid commit churn, I rebased on ivaniguarans's #1183 and used it to surface dashboard metrics for the LRU cache.

Doing some additional local testing, then marking this as ready for review. (gated by merge of #1183)

blightbow and others added 7 commits May 14, 2026 18:41
The admin dashboard reads MRU partial cache state via
``Scheduler.get_ssd_cache_stats`` -> ``BlockAwarePrefixCache
.get_stats_dict``, which silently dropped every MRU field added
by the observability-counters commit.  With
``mru_partial_max_entries`` aggregating to 0 in the admin
payload, the dashboard's ``mruEnabled`` gate stayed false and
hid every panel — header gauge, rate-strip cells, and the
per-model "MRU Tails" column — even when operators had the
feature configured.

The counter delta path was unaffected:
``Scheduler._collect_cache_counters`` reads the
``PrefixCacheStats`` dataclass via ``get_stats()``, so
``cache_rates.cumulative`` already carried
``mru_partial_hit_rate`` and ``mru_partial_tokens_saved``.  The
dashboard simply refused to render those cells because the
gauge gate was false.

Adds six fields to ``get_stats_dict()``:

- ``mru_partial_stashes``       (counter)
- ``mru_partial_hits``          (counter)
- ``mru_partial_evictions``     (counter)
- ``mru_partial_tokens_saved``  (counter)
- ``mru_partial_entries``       (gauge, len(_mru_partials))
- ``mru_partial_max_entries``   (gauge, configured capacity)

Regression test
---------------
Adds ``TestMRUPartialCounters
::test_get_stats_dict_mirrors_dataclass_after_round_trip``,
following the Pattern B mandate the class docstring already
established (real ``store_cache`` round-trip rather than
hand-built ``_MRUPartialBlock`` state).  Exercises three
stashes plus one successful apply against ``max_entries=2`` so
every counter and gauge moves off zero, then asserts each MRU
field on ``get_stats_dict()`` matches the corresponding field
on the ``PrefixCacheStats`` dataclass.

Sibling MRU counter tests already covered each counter
individually via ``get_stats()`` (the dataclass), which is why
the missing-keys regression slipped past them — none asserted
on the dict surface.  This test closes that loop and was
verified to fail cleanly without the fix
(``mru_partial_stashes missing from get_stats_dict()``).

Suite: 139 passed (cache + observability).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The MRU partial-block stash safety gate refuses any layer set
containing a non-sliceable cache type (``RotatingKVCache``,
``PoolingCache``, ``ArraysCache``, ``CacheList``, etc.) — splicing
a partial into a sliceable subset only would cause per-layer
offset skew at decode (silent generation corruption).  For
affected models the entire MRU feature is structurally
unavailable, but the dashboard previously rendered a misleading
"0/N entries" gauge that left operators puzzling over an apparent
config bug.  Concrete case: DeepSeek-V4-Flash (every layer is
``CacheList(RotatingKVCache, PoolingCache, PoolingCache)``).

Detection
---------
Adds a tri-state ``mru_partial_supported`` flag to
``BlockAwarePrefixCache`` and ``PrefixCacheStats``:

- ``None``  → unknown (no introspection has resolved it yet)
- ``True``  → every observed layer is sliceable
- ``False`` → at least one non-sliceable layer observed; every
  future stash attempt is refused at the safety gate

Two detection paths feed the flag:

1. **Eager (load time):** ``_check_mru_eligibility_at_init``
   calls ``model.make_cache()`` once at construction, extracts
   type names via ``ModelCacheConfig.from_cache_list``, and
   resolves the flag immediately.  Best-effort: if make_cache is
   absent or raises, falls back to lazy detection without
   crashing.  Cache instances are dropped via ``del`` after
   inspection — no tensor buffers are allocated (those arrive
   on first prefill), and Python wrappers are GC-reclaimed.

2. **Lazy (first inference):** ``_update_mru_partial`` checks
   ``_all_layers_sliceable(layer_cache_types)`` on each call.
   First non-sliceable observation latches the flag and emits
   the warning; first sliceable observation latches True.

The warning fires exactly once per cache instance via
``_mru_partial_warn_emitted``.  Operator log message includes
the offending types and the sliceable whitelist so it's
grep-actionable:

  WARNING omlx.cache.prefix_cache: MRU tail cache disabled for
  this model: layer types ['RotatingKVCache', 'PoolingCache']
  are not in the sliceable whitelist [...]. Splicing a partial
  into a non-sliceable subset would cause per-layer offset skew
  at decode (silent generation corruption), so every stash
  attempt will be refused. The admin dashboard's per-model 'MRU
  Tails' cell will display 'N/A (see log)'.

Dashboard surface
-----------------
Per-model "MRU Tails" cell renders ``N/A (see log)`` when
``mru_partial_supported === false``; otherwise renders
``entries / max_entries`` as before.  Hover tooltip references
the server log.  Global rate-strip cells (MRU Tail Hit Rate,
MRU Tokens Saved) stay as aggregates — if any loaded model is
compatible, those cells still surface its payoff.

Tests (8 new in TestMRUPartialEligibility)
------------------------------------------
- ``supported_is_none_without_make_cache_and_no_inference``
- ``supported_latches_true_on_sliceable_observation``
- ``supported_latches_false_lazy_on_non_sliceable`` (round-trip
  via ``store_cache`` with ``_rotating_layer`` factory)
- ``warning_does_not_repeat_on_subsequent_non_sliceable``
- ``eager_check_latches_false_at_init_with_non_sliceable_make_cache``
- ``eager_check_latches_true_at_init_with_sliceable_make_cache``
- ``eager_check_skipped_when_feature_disabled``
- ``eager_check_survives_make_cache_failure``

Existing ``TestRuntimeCacheObservability::test_runtime_cache
_uses_model_scoped_ssd_stats`` updated to include the new
per-model payload key (``mru_partial_supported: None`` for
mocks that don't populate ``prefix_cache``).

Suite: 215 passed (cache + observability + admin).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pulls in 8 upstream commits, most relevantly:

- 386e16f fix(tests): repair pre-existing upstream test failures
  and import guards (jundot#1244) — restores list-shaped GitHub
  releases payload in test_admin_update_check / test_admin_auth
  fixtures.  Was committed upstream 2026-05-14 10:21, after this
  branch's previous merge of main and before the next.  Branch
  was unknowingly running with these 4 tests failing the entire
  time.
- 4fe004d feat: add Hermes Agent quick launch (jundot#1250)
- ccfba1d fix(load): VLM model loading fixes for oQ-quantized
  checkpoints (jundot#1247)
- 51907f0 fix(oq): restore MTP head attach for VLM sensitivity
- and others

Without jundot#1244 we keep inheriting the broken admin-auth /
update-check tests as branch-only baseline failures.  The fix
landed 8 hours before today's MRU work and was never picked up
because the branch hadn't merged main since.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The model-load warning added in 5848c49 read like a developer
note: it dumped the internal sliceable whitelist, explained the
splice mechanism and the offset-skew failure mode, raised a
"silent generation corruption" alarm, and narrated which
dashboard cell would change.

Rewrite it in the in-tree load-phase warning voice — "condition
+ consequence, plain words" — matching the ``mtp_enabled`` warning
in ``utils/model_loading.py`` and the L2 warning in
``engine/dflash.py``:

  MRU tail cache enabled but this model is incompatible
  (cache layers: RotatingKVCache, PoolingCache); MRU tails
  will be inactive for this model.

- Drop the whitelist dump — operators don't tune against it.
- Drop the splice-mechanism rationale — that's an engineering
  explanation, not an operator decision point.  It still lives
  in the ``_all_layers_sliceable`` docstring where developers
  read it; ``_record_mru_unsupported``'s docstring now points
  there.
- Drop the dashboard self-reference — the dashboard already
  renders 'N/A (see log)'; the log shouldn't narrate the UI.
- ``", ".join(offenders)`` instead of raw list repr, and
  ``"unknown"`` instead of ``"<unknown>"`` for the fallback.

Tests assert on the new wording ("MRU tails will be inactive",
"incompatible").

Would have folded into 5848c49 as an amend, but the merge of
main now sits between that commit and HEAD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The MRU partial-block cache stashed the trailing partial of the
stored sequence (``prompt + output``) keyed by that sequence's
last full block, but a repeat request resubmits the *prompt*
only and ``apply_mru_partial`` looks the entry up by the
prompt's last full block.  Output tokens shift every block
boundary past the prompt's tail, so the two keys never
coincide: ``_mru_partials.get(last_hash)`` always returned
None.  The feature produced zero hits for ordinary chat
completions — observed live as "MRU tails 3/4 entries, MRU
Tail Hit Rate 0.0%" with zero evictions (a key that is never
found is never evicted).  It only worked for reasoning models
whose prompt ends with an open ``<think>`` tag, where
``needs_think_prefix`` makes ``store_cache`` persist the prompt
alone.

Fix
---
Thread the prompt token count from the scheduler into
``store_cache`` -> ``_update_mru_partial``.  The stash now keys
off the prompt's last full block (block index
``prompt_len // block_size - 1``) and slices the prompt's
trailing partial, not the stored sequence's.  That block's hash
is identical whether the sequence is blocked as ``prompt`` or
``prompt + output`` — block hashes are content-chained and the
chain is byte-identical up to the prompt's partial tail — so
the key the stash writes is exactly the key a prompt-only
resubmission's ``apply_mru_partial`` computes.

The arithmetic runs in the existing global-coordinate frame and
accounts for ``existing_tokens > 0`` (on a resubmission
``store_cache`` appends to the fetched prefix block table and
works in ``new_tokens`` space): ``partial_start =
prompt_partial_start - existing_tokens``, with a guard for a
prompt already fully covered by cached full blocks.  Edge cases
degrade cleanly — block-aligned prompt: no stash; prompt
shorter than one block: ``None`` key (short-prompt path);
``prompt_token_count=None`` (generic ``CacheManager.store``
path): falls back to the whole stored sequence, reproducing the
pre-fix behavior for verbatim-repeat callers.

``apply_mru_partial`` is unchanged — only the stash side was
wrong.

Draft cache
-----------
``_draft_prefix_cache`` (SpecPrefill) is constructed with
``mru_partial_max_entries=0``.  ``apply_mru_partial`` is only
ever called on the main ``block_aware_cache``, so a draft-cache
stash was dead work that never paid off.

Tests
-----
New ``TestMRUPromptBoundaryStash`` (5 cases):

- ``prompt_boundary_stash_hits_on_prompt_only_resubmit`` — the
  round-trip the original feature shipped without: store
  ``prompt + output`` with the boundary, resubmit prompt-only,
  assert a hit.
- ``whole_sequence_stash_misses_on_prompt_only_resubmit`` —
  pins the original bug via the ``prompt_token_count=None``
  path (0 hits, 0 evictions).
- ``block_aligned_prompt_does_not_stash``
- ``short_prompt_stashes_under_none_key``
- ``prompt_boundary_stash_with_existing_cached_prefix`` — the
  ``existing_tokens > 0`` resubmission path.

Diagnosis and approach were adversarially peer-reviewed; the
review caught an ``existing_tokens``-relative indexing error in
the first draft of the fix, corrected here.

Full unit suite: 4401 passed (5 pre-existing upstream baseline
failures unrelated to cache).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``a42542f`` made ``apply_mru_partial`` produce hits for the
first time — and immediately exposed a latent threading bug in
the splice path, which had been dead code since the feature
shipped.

``_extract_block_tensor_slice`` builds the partial's tensors
via ``_clone_tensor`` (``mx.copy``), which is a *lazy* op.  That
op is created on the ``omlx-store-cache`` worker thread (where
``store_cache`` runs) and bound to that thread's MLX stream.
``apply_mru_partial`` splices the partial into a live cache on
the separate ``mlx-global`` inference thread; generation's
``mx.async_eval`` then walks the compute graph back to the
worker thread's stream, which the inference thread cannot see:

  RuntimeError: There is no Stream(gpu, 4) in current thread.

The source ``_extracted_cache`` is already materialized before
the worker runs (the inference thread batches
``_collect_arrays_from_extracted_cache`` through ``mx.async_eval``
and the worker calls ``mx.synchronize()``), so the fix is just
to finalize the freshly-sliced copies: ``_update_mru_partial``
now calls ``_materialize_mru_kv`` on the extracted partial
before stashing it.  Because the inputs are already resident
this is a small memcpy of the tail KV — no recompute — and it
collapses the lazy ``mx.copy`` into concrete, stream-free data
safe to splice and evaluate from any thread.

``apply_mru_partial`` itself is unchanged: ``add_request`` (the
splice) and ``step`` (generation) both run on the single
``mlx-global`` worker, so the splice result never crosses
threads — only the stashed input did.

Tests
-----
New ``TestMRUPartialCrossThreadSafety``:

- ``materialize_mru_kv_handles_extract_shapes`` — the helper
  evaluates array leaves across the plain ``(keys, values)``
  and TurboQuant ``(tag, (k, v))`` shapes and tolerates the
  non-array tag and an empty list.
- ``stashed_partial_splices_across_threads`` — extract+stash on
  a worker thread, splice+evaluate on the main thread.
  Verified to fail without the fix (``no Stream(gpu, N) in
  current thread`` at the splice eval) and pass with it.  The
  test pre-materializes ``cache_data`` to mirror production,
  where the inference thread always hands the worker an
  already-evaluated extracted cache.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The header gauge in RUNTIME CACHE OBSERVABILITY sat alongside
the Memory and SSD gauges, but it did not belong there.  Memory
and SSD each measure one exhaustible budget shared across every
loaded model, so an aggregate fill bar is meaningful.  MRU tail
slots are allocated per-model — ``--mru-partial-max-entries``
applies to each model's own cache — so summing entries and
max-entries across models produces a number that corresponds to
no real resource.  Per-model occupancy is already shown in the
"MRU Tails" column of the per-model table, which is the correct
granularity.

- Remove the header gauge block from ``_status.html``.
- Remove the now-dead ``runtimeMruPartialPercent`` getter from
  ``dashboard.js``.
- Drop the payload-level ``mru_partial_entries`` aggregate in
  ``_build_runtime_cache_observability``.  ``mru_partial_max_
  entries`` is kept as a sum solely as the ``mruEnabled``
  feature-on gate (drives the rate strip and the per-model
  column); it is no longer surfaced as a gauge value.

Per-model ``mru_partial_entries`` / ``mru_partial_max_entries``
on each ``models[]`` entry are unchanged, as are the global
"MRU Tail Hit Rate" and "MRU Tokens Saved" rate-strip cells
(those are rates/counters, legitimately global).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jundot
Copy link
Copy Markdown
Owner

jundot commented May 19, 2026

Thanks for the careful work, the transactional splice and hybrid carve-out shows you thought through the failure modes.

My take is that avoiding at most ~2K tokens of partial-tail re-prefill (block_size upper bound) feels like a narrow win for the amount of new code this brings. Default 4 slots, hybrid models excluded, plus the permanent surface of transactional rollback, deferred-clear interaction, and hot-cache memory cotenancy is a lot for an optimization that prefix cache already covers at block boundaries.

This is just my read though, curious how you see the tradeoff. If partial-tail recompute is hurting a specific workload you're targeting, that context would help me look at this differently.

@blightbow
Copy link
Copy Markdown
Contributor Author

blightbow commented May 19, 2026

@jundot Prefill is the main performance weakness on Apple silicon: if there is any edge to be shaven off of prefill, no matter how marginal, I attack it on principle. 😀

My goal was to solidify the approach, measure performance difference with and without, then make a judgement on worth. I know it's a lot of code that might not justify the effort, but worst case scenario others will know not to try.

The code approach has solidified. Just need to sit down and do the profiling...been distracted the last few days. I'll mark as ready to review when I have evidence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants