feat(cache): add per-model cache hit-rate observability#1183
Conversation
0a18048 to
6c91508
Compare
|
Heads-up: three $ git checkout 172be39 # current upstream/main
$ uv run pytest tests/test_admin_api_key.py::TestRuntimeCacheObservability -q
3 passed
$ git checkout 6c91508 # this PR's HEAD
$ uv run pytest tests/test_admin_api_key.py::TestRuntimeCacheObservability -q
3 failedAll three failures share the same signature: omlx/admin/routes.py:3624: TypeError: '>' not supported between instances of 'int' and 'MagicMock'
disk_max = max(disk_max, m.get("max_size_bytes", 0))Root cause is the new aggregation block at One related observation: even after the TypeError is fixed, the test's |
6c91508 to
b49963b
Compare
|
Thanks for the detailed report — spot on. The new |
The ``/api/ssd-cache/clear`` admin endpoint at ``omlx/admin/routes.py:3975`` wipes the SSD-backed paged blocks per loaded scheduler but did not touch ``BlockAwarePrefixCache._mru_partials``. Surviving MRU partials then chain from paged-block hashes whose KV bytes were just flushed, violating the operator's "drop all warm caches" intent. Under ``hot_cache_only=True`` (where the hot tier IS the only persistent store) the same hazard would also apply to PR jundot#1183's forthcoming ``/api/hot-cache/clear`` endpoint when it lands; that wiring is a one-line follow-up at the same loop using the same method. Wire ``block_aware_cache.clear_mru_partials()`` into the per-scheduler loop alongside ``ssd_manager.clear()``, with the same defensive try/except wrapper. A failure clearing one scheduler's MRU does not prevent siblings from being cleared. The standalone ``clear_mru_partials()`` method (added in the previous commit, see ``omlx/cache/prefix_cache.py``) is the public seam. Its own unit coverage lives in ``TestMRUPartialMultiSlot::test_clear_mru_partials_wipes_only_partials``; this commit adds two endpoint-level tests in ``TestClearSSDCacheAlsoWipesMRUPartials`` that pin the wiring: - ``test_endpoint_calls_clear_mru_partials_on_each_scheduler`` confirms both ``ssd_manager.clear()`` and ``block_aware_cache.clear_mru_partials()`` fire for every loaded scheduler. - ``test_mru_clear_failure_does_not_block_other_scheduler`` pins the defensive try/except: an exception in one scheduler's clear path must not stop the loop. Suite results on this commit: 62 passed: tests/test_admin_api_key.py (1 pre-existing unrelated failure remains) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plug the multi-slot MRU partial cache into the same observability surface that jundot#1183 established for the prefix and hot/disk tiers. Without these counters, operators have no signal for whether ``--mru-partial-max-entries`` is tuned right; with them, the dashboard answers "is the MRU paying off" with the same shape it answers for memory and disk hit rates. Counters (cumulative, on PrefixCacheStats) ------------------------------------------ - ``mru_partial_stashes`` — every successful entry write, including same-key replacements. Same-key replacement does NOT count as an eviction; operator sees stash payoff via the hits/stashes ratio. - ``mru_partial_hits`` — every successful splice via apply_mru_partial. - ``mru_partial_evictions`` — total entries removed. Includes capacity-overflow LRU evictions, apply-time mismatch pops (token, layer-count, splice failure), and ``clear_mru_partials()`` wipes. Does NOT include full ``clear()`` (cache-corruption recovery) — that path also calls ``reset_stats()``, so incrementing evictions there would be incoherent (the increment gets zeroed immediately). Operators tracking partial-only wipes use ``clear_mru_partials()``. - ``mru_partial_tokens_saved`` — sum of ``n_partial`` across hits. The direct compute-saved measure: each unit is one token of prefill forward-pass that did NOT have to run. Gauges (live state, on PrefixCacheStats) ---------------------------------------- - ``mru_partial_entries`` — current dict length. - ``mru_partial_max_entries`` — configured capacity (operator-facing). Plumbing -------- Counters thread through ``Scheduler._collect_cache_counters`` into the existing ``CacheRateTracker``. ``observability._compute_window`` and ``_compute_cumulative`` add per-counter deltas plus a derived ``mru_partial_hit_rate`` (= hits / stashes, with the same zero-stashes-no-NaN guard the other ratios use). The admin ``_build_runtime_cache_observability`` emits per-model entries/ max_entries gauges and aggregates them at the payload level the same way ``hot_cache_entries`` and ``hot_cache_max_bytes`` do. Dashboard --------- Mirrors jundot#1183's hot-cache surface: - **Header gauge** "MRU tails N/M entries" next to the Memory and SSD gauges, visible only when ``mru_partial_max_entries > 0``. - **Rate strip** gains "MRU Tail Hit Rate" and "MRU Tokens Saved" cells. Grid expands from 4 cells (hot cache only) to 6 cells (both tiers). When only one or the other tier is enabled, the layout stays at 4 cells with the disabled tier's cells hidden via ``x-show``. - **Per-model table** gains an "MRU Tails" column showing ``entries / max_entries`` for each loaded model. Test coverage (10 new cases) ---------------------------- ``TestMRUPartialCounters`` (8 cases) — initial zeros, stash-counter bumps, same-key-replacement-is-not-eviction, capacity-overflow eviction count, apply-success bumps hits+tokens_saved, apply-miss eviction, ``clear_mru_partials()`` bulk eviction count, ``clear()`` zeros everything semantics, ``reset_stats()`` zeros cumulative counters but preserves live entries. ``TestCacheRateTrackerRates`` (3 new cases) — mru_partial_hit_rate windowed + cumulative, zero-stashes no-NaN guard, tokens_saved delta accumulation. ``TestRuntimeCacheObservability`` updated to reflect the two new per-model payload keys (``mru_partial_entries``, ``mru_partial_max_entries``). Suite results on this commit: 250 passed: cache + scheduler + observability + admin + settings Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s, Alpine getters Five small simplifications surfaced by a three-agent code review on the MRU partial cache commit stack. No behaviour change. **Hoist test helpers to module level.** Three MRU test classes (``TestMRUPartialBlockCache``, ``TestMRUPartialMultiSlot``, ``TestMRUPartialCounters``) each redefined ``_layer``, ``_kv_layer``, ``_rotating_layer``, ``_make_reconstructed_cache``, ``_stash_with_prefix``, and ``_cache`` (factory). Hoist to module-level helpers in ``tests/test_prefix_cache.py`` (alongside the existing ``_get_mru_partial`` accessor); the duplicate methods come out, ~120 lines of repetition collapses, call sites switch from ``self._kv_layer(...)`` to ``_kv_layer(...)``. The factory for custom capacity is renamed ``_make_mru_cache(paged_cache, mock_ssd, max_entries, num_layers)``. Per-class fixtures (``mx``, ``paged_cache``, ``mock_ssd``) stay class-local to avoid leaking fixture names into unrelated test classes in the same module. **Extract ``_evict_miss`` helper in ``apply_mru_partial``.** Five arms of ``self._mru_partials.pop(last_hash, None); self._mru_partial_evictions += 1; return cache, remaining_tokens, 0`` collapse into a single inner function. Each call site is now one line, the eviction-counter bookkeeping lives in one place, and the rollback contract is harder to break by accident. **Dashboard Alpine getters.** Three getters added to the dashboard root: ``mruEnabled``, ``hotCacheEnabled``, ``cacheRatesGridCols``. The previous expressions repeated ``stats.runtime_cache?.mru_partial_max_entries > 0`` in 8 places and a three-arm ternary chain in the rate-strip grid-class binding. The HTML now reads ``x-show="mruEnabled && stats.runtime_cache?.models?.length > 0"`` and ``:class="cacheRatesGridCols"`` at the relevant sites. **``_make_counters`` driven by a key tuple.** Previously took 16 explicit kwargs and re-listed every key in the returned dict body. Now: a module-level ``_COUNTER_KEYS`` tuple is the single source of truth; the helper builds a zero-initialised dict and applies ``**overrides``. Unknown keys raise (catches the typos the explicit signature used to catch). Adding a new observability counter is now one tuple entry instead of three coordinated changes (signature, dict body, and any call sites that wanted the default). **Prune ``_MRUPartialBlock`` docstring rot-prone bits.** Dropped specific MiB numbers (~17/41/68-165 MiB) and the PR# reference to jundot#1120 from the memory-accounting section. The numbers were informative when written but would have aged past the model and config landscape they assumed. Kept the invariant statement and the test reference; removed the calibration table. **Findings reviewed and skipped:** - Aggregating ``mru_partial_max_entries`` across loaded models was flagged as "wrong arithmetic" — actually correct, matches the deliberate hot-cache convention from jundot#1183 (each model has its own budget, the dashboard gauge shows fleet fill = sum of entries / sum of capacities). - ``_get_cache_seq_len`` per-block redundancy in ``store_cache`` — pre-existing pattern, not regressed by this stack, defer. - Phase-1 ``mx.concatenate`` × N layers × 2 dispatch shape — defer pending M3 Ultra measurement (per ``feedback_apple_silicon_perf.md`` memory: never estimate prefill costs). - ``paged_cache.allocated_blocks.get(...)`` direct dict access at 9+ pre-existing sites — wider refactor, out of scope. - ``_all_layers_sliceable`` vs ``_prompt_cache_needs_snapshots`` co-location — different inputs (class-name strings vs live cache objects), unification would be scope-creep. Suite results on this commit: 250 passed: cache + scheduler + observability + admin + settings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Server-side snapshot differencing via CacheRateTracker: stores the last 90 snapshots of cumulative cache counters (10s intervals, 15 min window) and computes rates via start/end differencing. Zero hot-path changes — snapshots are lazy, driven by dashboard polling cadence. New metrics exposed in /api/stats under cache_observability: - prefix_hit_rate (cumulative + windowed) - eviction count, ssd_hot_rate - per-model and weighted aggregate across models Dashboard: new "Cache Breakdown" card below Average Speed showing hit rate, evictions, and hot cache hits. Session-only (hidden in All-Time view since counters reset on model reload).
The new disk_max aggregation reads get_ssd_cache_max_size_bytes which the existing tests left unconfigured, producing a MagicMock that raises on max(MagicMock, int). Also add the missing max_size_bytes key to the expected model payloads.
b49963b to
f79d450
Compare
|
nice, the per-tier gauges + rate breakdown beat the single cumulative %, and CacheRateTracker stays off the hot path (snapshot-driven). Tests pass. Merging. |
Summary
The dashboard shows Cache Efficiency as a single cumulative percentage — useful at a glance, but it doesn't answer the questions operators actually ask: Is the prefix cache hitting? Are hits served from memory or disk? How full is each cache tier? Are evictions building up? Is it safe to flush the hot tier without restarting?
This turns the Runtime Cache Observability card into a complete operational surface for both cache tiers — memory (hot cache) and SSD (disk) — with per-model breakdowns, capacity gauges, rate metrics, and safe clear controls.
What changed
Capacity gauges — the card header now shows memory and SSD usage bars with live fill percentages. Memory gauge appears only when hot cache is enabled and models are loaded. Both tiers have clear buttons: the existing SSD clear, and a new memory-only clear that flushes the in-memory hot tier without touching persistent SSD blocks.
Cache rate metrics — a 4-metric strip inside the card shows session-scoped rates: Prefix Hit Rate, Memory Hit Rate, Prefix Evictions, and Memory Evictions. Hidden in All-Time view since the counters are session-scoped. When filtering by model, the strip shows that model's rates; in All Models view, each rate is computed from its own aggregate numerator/denominator (prefix hit rate from total prefix hits/misses, memory hit rate from total hot hits/disk loads — no cross-weighting).
Per-model table columns — the existing SSD Files/Size columns are joined by Memory Entries/Size columns (gated on hot cache being enabled), so operators can see per-model cache footprint at a glance.
Hot cache disabled — when
--hot-cache-max-sizeis 0 (the default), the Memory gauge, Memory Entries/Size columns, and Memory rate metrics hide automatically. Only prefix and SSD metrics are shown.Hot cache clear endpoint —
POST /api/hot-cache/clearmirrors the existing SSD clear route. Iterates loaded model schedulers via a new_iter_loaded_schedulers()helper (which also de-duplicates the getattr chain from the SSD clear route), callsclear_hot_cache()on each manager, and resets the rate tracker so window rates don't carry stale pre-clear history.Token-level match efficiency — prefix cache accounting now tracks
tokens_matched_totalandtokens_requested_totalalongside the existing hit/miss counts. A 2-token match on a 2048-token request is very different from a 2000-token match;prefix_match_efficiencycaptures that distinction. Exposed in the API response for scripting consumers.How it works
CacheRateTracker(149 lines,omlx/cache/observability.py) stores up to 90 snapshots of cumulative cache counters at ~10-second intervals. Snapshots are lazy — taken only when the dashboard polls/api/stats— so there are zero hot-path changes beyond two counter increments infetch_cache(). The scheduler's_collect_cache_countersreads from the publicget_stats()API onPagedSSDCacheManager, avoiding private attribute access.The tracker computes:
The SSD tier avoids double-counting by computing
ssd_disk_loads = total_loads - hot_cache_hits, sinceloadsincludes both hot and disk reads. On scheduler reset, the tracker is cleared alongside prefix cache stats.Known behavior surfaced by this PR
While validating the gauges against real multi-model workloads, we observed two pre-existing cache behaviors worth tracking separately:
Hot cache capacity is per loaded model, not process-global. Each scheduler creates its own
PagedSSDCacheManagerwith an independenthot_cache_max_bytesbudget. With 3 models loaded and--hot-cache-max-size 4GB, the dashboard correctly reports 12 GB total hot-cache capacity because the gauge sums per-model pools. The CLI help text says "Maximum in-memory hot cache size" without clarifying it applies per model — users setting 4 GB expect a 4 GB process budget, not 4 GB × N.SSD usage can exceed the configured limit with multiple models. All model schedulers write to the same cache directory, but each manager enforces
max_sizeindependently. Under eviction pressure, the background writer queue can fill up, causing eviction deletes to be dropped ("SSD write queue full, dropping evicted block") while writes continue. We observed 68.8 GB on a 50 GB configured limit with all models idle.This PR reports those values truthfully but does not change cache ownership or enforcement semantics — that belongs in separate issues.
Changes
Test plan
uv run pytest tests/test_cache_observability.py tests/test_hot_cache.py -v— 37 tests covering snapshot mechanics, rate computation (steady-state, zero-activity, eviction rates, SSD hot ratio, token match efficiency), thread safety under concurrent access, clear/reset behavior, hot cache lifecycle, LRU eviction, byte accounting, and write-back to SSDFollow-ups
Two extensions that build on this without expanding scope:
tokens_matched / hitsto distinguish shallow matches from deep ones. The counters are already tracked server-side; this would be a dashboard-only addition.Related: #905 (paged KV occupancy — complementary surface), #1171 (hot cache byte accounting — no conflict)