feat(cache): add per-model cache hit-rate observability by ivaniguarans · Pull Request #1183 · jundot/omlx

ivaniguarans · 2026-05-11T17:10:46Z

Summary

The dashboard shows Cache Efficiency as a single cumulative percentage — useful at a glance, but it doesn't answer the questions operators actually ask: Is the prefix cache hitting? Are hits served from memory or disk? How full is each cache tier? Are evictions building up? Is it safe to flush the hot tier without restarting?

This turns the Runtime Cache Observability card into a complete operational surface for both cache tiers — memory (hot cache) and SSD (disk) — with per-model breakdowns, capacity gauges, rate metrics, and safe clear controls.

What changed

Capacity gauges — the card header now shows memory and SSD usage bars with live fill percentages. Memory gauge appears only when hot cache is enabled and models are loaded. Both tiers have clear buttons: the existing SSD clear, and a new memory-only clear that flushes the in-memory hot tier without touching persistent SSD blocks.

Cache rate metrics — a 4-metric strip inside the card shows session-scoped rates: Prefix Hit Rate, Memory Hit Rate, Prefix Evictions, and Memory Evictions. Hidden in All-Time view since the counters are session-scoped. When filtering by model, the strip shows that model's rates; in All Models view, each rate is computed from its own aggregate numerator/denominator (prefix hit rate from total prefix hits/misses, memory hit rate from total hot hits/disk loads — no cross-weighting).

Per-model table columns — the existing SSD Files/Size columns are joined by Memory Entries/Size columns (gated on hot cache being enabled), so operators can see per-model cache footprint at a glance.

Hot cache disabled — when --hot-cache-max-size is 0 (the default), the Memory gauge, Memory Entries/Size columns, and Memory rate metrics hide automatically. Only prefix and SSD metrics are shown.

Hot cache clear endpoint — POST /api/hot-cache/clear mirrors the existing SSD clear route. Iterates loaded model schedulers via a new _iter_loaded_schedulers() helper (which also de-duplicates the getattr chain from the SSD clear route), calls clear_hot_cache() on each manager, and resets the rate tracker so window rates don't carry stale pre-clear history.

Token-level match efficiency — prefix cache accounting now tracks tokens_matched_total and tokens_requested_total alongside the existing hit/miss counts. A 2-token match on a 2048-token request is very different from a 2000-token match; prefix_match_efficiency captures that distinction. Exposed in the API response for scripting consumers.

How it works

CacheRateTracker (149 lines, omlx/cache/observability.py) stores up to 90 snapshots of cumulative cache counters at ~10-second intervals. Snapshots are lazy — taken only when the dashboard polls /api/stats — so there are zero hot-path changes beyond two counter increments in fetch_cache(). The scheduler's _collect_cache_counters reads from the public get_stats() API on PagedSSDCacheManager, avoiding private attribute access.

The tracker computes:

Windowed rates (1m/5m/15m) via start/end snapshot differencing — available in the API response for scripting consumers
Cumulative session rates — what the dashboard displays, since windowed rates go stale too fast on a single-user local server

The SSD tier avoids double-counting by computing ssd_disk_loads = total_loads - hot_cache_hits, since loads includes both hot and disk reads. On scheduler reset, the tracker is cleared alongside prefix cache stats.

Known behavior surfaced by this PR

While validating the gauges against real multi-model workloads, we observed two pre-existing cache behaviors worth tracking separately:

Hot cache capacity is per loaded model, not process-global. Each scheduler creates its own PagedSSDCacheManager with an independent hot_cache_max_bytes budget. With 3 models loaded and --hot-cache-max-size 4GB, the dashboard correctly reports 12 GB total hot-cache capacity because the gauge sums per-model pools. The CLI help text says "Maximum in-memory hot cache size" without clarifying it applies per model — users setting 4 GB expect a 4 GB process budget, not 4 GB × N.

SSD usage can exceed the configured limit with multiple models. All model schedulers write to the same cache directory, but each manager enforces max_size independently. Under eviction pressure, the background writer queue can fill up, causing eviction deletes to be dropped ("SSD write queue full, dropping evicted block") while writes continue. We observed 68.8 GB on a 50 GB configured limit with all models idle.

This PR reports those values truthfully but does not change cache ownership or enforcement semantics — that belongs in separate issues.

Changes

omlx/cache/observability.py          | 149 new    — CacheRateTracker module
tests/test_cache_observability.py    | 202 new    — 15 tests (snapshots, rates, threads, clear)
omlx/cache/prefix_cache.py          |  13 +      — token-level match counters in fetch_cache()
omlx/cache/stats.py                 |   4 +      — counter fields in PrefixCacheStats
omlx/cache/paged_ssd_cache.py       |  14 +      — clear_hot_cache() method
omlx/scheduler.py                   |  42 +/-    — counter collection via public API + tracker integration
omlx/admin/routes.py                | 180 +/-    — _iter_loaded_schedulers(), hot-cache clear endpoint, cache aggregation
omlx/admin/static/js/dashboard.js   |  64 +      — cacheObsCumulative aggregation, gauge getters, clear handler
omlx/admin/templates/_status.html   |  77 +      — gauges, rate strip, per-model memory columns
omlx/admin/static/css/dashboard.css |   4 +      — gauge track dark mode

Test plan

uv run pytest tests/test_cache_observability.py tests/test_hot_cache.py -v — 37 tests covering snapshot mechanics, rate computation (steady-state, zero-activity, eviction rates, SSD hot ratio, token match efficiency), thread safety under concurrent access, clear/reset behavior, hot cache lifecycle, LRU eviction, byte accounting, and write-back to SSD
Live-tested on three models simultaneously (Qwen3.6-27B-oQ6-mtp, Qwen3.6-27B-oQ8-mtp, Qwen3.6-35B-A3B-oQ4-mtp) — verified per-model filtering, aggregate rate computation, gauge accuracy, and both clear endpoints
Verified Session/All-Time toggle hides the rate strip correctly while keeping gauges visible
Verified memory gauge and columns hide when hot cache is disabled
Verified hot cache clear flushes entries and resets gauges without affecting SSD
36 pre-existing test failures in unrelated files (server endpoints, grammar, VLM, MTP, oQ, admin update check) — none in our changed files

Follow-ups

Two extensions that build on this without expanding scope:

Average prefix match depth — tokens_matched / hits to distinguish shallow matches from deep ones. The counters are already tracked server-side; this would be a dashboard-only addition.
Windowed rate selector — the backend already computes 1m/5m/15m windowed rates. A dropdown in the dashboard would let operators see recent trends instead of only cumulative session values.

Related: #905 (paged KV occupancy — complementary surface), #1171 (hot cache byte accounting — no conflict)

blightbow · 2026-05-13T21:04:21Z

Heads-up: three TestRuntimeCacheObservability tests pass on main but fail on this PR's HEAD. This surfaced when attempting to rebase my own PR #1149 onto 6c91508.

$ git checkout 172be39  # current upstream/main
$ uv run pytest tests/test_admin_api_key.py::TestRuntimeCacheObservability -q
3 passed

$ git checkout 6c91508  # this PR's HEAD
$ uv run pytest tests/test_admin_api_key.py::TestRuntimeCacheObservability -q
3 failed

All three failures share the same signature:

 omlx/admin/routes.py:3624: TypeError: '>' not supported between instances of 'int' and 'MagicMock'
    disk_max = max(disk_max, m.get("max_size_bytes", 0))

Root cause is the new aggregation block at routes.py:3617-3625 (Aggregate hot-cache and disk-max across models). It initializes disk_max = payload["disk_max_bytes"], whose value flows from global_settings.cache.get_ssd_cache_max_size_bytes(...). The test mocks global_settings with MagicMock() and only configures a couple of explicit return values; get_ssd_cache_max_size_bytes is autospec'd and returns a MagicMock. The per-model m.get("max_size_bytes", 0) is correctly an int (the dict construction at routes.py:3591 already does the int(... or 0) coercion), but max(MagicMock, int) raises.

One related observation: even after the TypeError is fixed, the test's assert payload["models"] == [...] expected-dict at tests/test_admin_api_key.py:739-768 doesn't include the new max_size_bytes key that routes.py:3591 writes, so the assertion will still fail on key-set mismatch. The hot_cache_max_bytes / hot_cache_size_bytes / hot_cache_entries keys are present in the expected dict; max_size_bytes is the lone omission. Worth one pass to make sure the expected payload tracks what the implementation produces.

ivaniguarans · 2026-05-13T21:16:28Z

Thanks for the detailed report — spot on. The new disk_max aggregation block reads get_ssd_cache_max_size_bytes which those tests left unconfigured, so MagicMock leaked into max(). Fixed that and added the missing max_size_bytes key to the expected payloads. Also rebased onto current main. Should be green now — you should be unblocked for #1149.

The ``/api/ssd-cache/clear`` admin endpoint at ``omlx/admin/routes.py:3975`` wipes the SSD-backed paged blocks per loaded scheduler but did not touch ``BlockAwarePrefixCache._mru_partials``. Surviving MRU partials then chain from paged-block hashes whose KV bytes were just flushed, violating the operator's "drop all warm caches" intent. Under ``hot_cache_only=True`` (where the hot tier IS the only persistent store) the same hazard would also apply to PR jundot#1183's forthcoming ``/api/hot-cache/clear`` endpoint when it lands; that wiring is a one-line follow-up at the same loop using the same method. Wire ``block_aware_cache.clear_mru_partials()`` into the per-scheduler loop alongside ``ssd_manager.clear()``, with the same defensive try/except wrapper. A failure clearing one scheduler's MRU does not prevent siblings from being cleared. The standalone ``clear_mru_partials()`` method (added in the previous commit, see ``omlx/cache/prefix_cache.py``) is the public seam. Its own unit coverage lives in ``TestMRUPartialMultiSlot::test_clear_mru_partials_wipes_only_partials``; this commit adds two endpoint-level tests in ``TestClearSSDCacheAlsoWipesMRUPartials`` that pin the wiring: - ``test_endpoint_calls_clear_mru_partials_on_each_scheduler`` confirms both ``ssd_manager.clear()`` and ``block_aware_cache.clear_mru_partials()`` fire for every loaded scheduler. - ``test_mru_clear_failure_does_not_block_other_scheduler`` pins the defensive try/except: an exception in one scheduler's clear path must not stop the loop. Suite results on this commit: 62 passed: tests/test_admin_api_key.py (1 pre-existing unrelated failure remains) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Plug the multi-slot MRU partial cache into the same observability surface that jundot#1183 established for the prefix and hot/disk tiers. Without these counters, operators have no signal for whether ``--mru-partial-max-entries`` is tuned right; with them, the dashboard answers "is the MRU paying off" with the same shape it answers for memory and disk hit rates. Counters (cumulative, on PrefixCacheStats) ------------------------------------------ - ``mru_partial_stashes`` — every successful entry write, including same-key replacements. Same-key replacement does NOT count as an eviction; operator sees stash payoff via the hits/stashes ratio. - ``mru_partial_hits`` — every successful splice via apply_mru_partial. - ``mru_partial_evictions`` — total entries removed. Includes capacity-overflow LRU evictions, apply-time mismatch pops (token, layer-count, splice failure), and ``clear_mru_partials()`` wipes. Does NOT include full ``clear()`` (cache-corruption recovery) — that path also calls ``reset_stats()``, so incrementing evictions there would be incoherent (the increment gets zeroed immediately). Operators tracking partial-only wipes use ``clear_mru_partials()``. - ``mru_partial_tokens_saved`` — sum of ``n_partial`` across hits. The direct compute-saved measure: each unit is one token of prefill forward-pass that did NOT have to run. Gauges (live state, on PrefixCacheStats) ---------------------------------------- - ``mru_partial_entries`` — current dict length. - ``mru_partial_max_entries`` — configured capacity (operator-facing). Plumbing -------- Counters thread through ``Scheduler._collect_cache_counters`` into the existing ``CacheRateTracker``. ``observability._compute_window`` and ``_compute_cumulative`` add per-counter deltas plus a derived ``mru_partial_hit_rate`` (= hits / stashes, with the same zero-stashes-no-NaN guard the other ratios use). The admin ``_build_runtime_cache_observability`` emits per-model entries/ max_entries gauges and aggregates them at the payload level the same way ``hot_cache_entries`` and ``hot_cache_max_bytes`` do. Dashboard --------- Mirrors jundot#1183's hot-cache surface: - **Header gauge** "MRU tails N/M entries" next to the Memory and SSD gauges, visible only when ``mru_partial_max_entries > 0``. - **Rate strip** gains "MRU Tail Hit Rate" and "MRU Tokens Saved" cells. Grid expands from 4 cells (hot cache only) to 6 cells (both tiers). When only one or the other tier is enabled, the layout stays at 4 cells with the disabled tier's cells hidden via ``x-show``. - **Per-model table** gains an "MRU Tails" column showing ``entries / max_entries`` for each loaded model. Test coverage (10 new cases) ---------------------------- ``TestMRUPartialCounters`` (8 cases) — initial zeros, stash-counter bumps, same-key-replacement-is-not-eviction, capacity-overflow eviction count, apply-success bumps hits+tokens_saved, apply-miss eviction, ``clear_mru_partials()`` bulk eviction count, ``clear()`` zeros everything semantics, ``reset_stats()`` zeros cumulative counters but preserves live entries. ``TestCacheRateTrackerRates`` (3 new cases) — mru_partial_hit_rate windowed + cumulative, zero-stashes no-NaN guard, tokens_saved delta accumulation. ``TestRuntimeCacheObservability`` updated to reflect the two new per-model payload keys (``mru_partial_entries``, ``mru_partial_max_entries``). Suite results on this commit: 250 passed: cache + scheduler + observability + admin + settings Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s, Alpine getters Five small simplifications surfaced by a three-agent code review on the MRU partial cache commit stack. No behaviour change. **Hoist test helpers to module level.** Three MRU test classes (``TestMRUPartialBlockCache``, ``TestMRUPartialMultiSlot``, ``TestMRUPartialCounters``) each redefined ``_layer``, ``_kv_layer``, ``_rotating_layer``, ``_make_reconstructed_cache``, ``_stash_with_prefix``, and ``_cache`` (factory). Hoist to module-level helpers in ``tests/test_prefix_cache.py`` (alongside the existing ``_get_mru_partial`` accessor); the duplicate methods come out, ~120 lines of repetition collapses, call sites switch from ``self._kv_layer(...)`` to ``_kv_layer(...)``. The factory for custom capacity is renamed ``_make_mru_cache(paged_cache, mock_ssd, max_entries, num_layers)``. Per-class fixtures (``mx``, ``paged_cache``, ``mock_ssd``) stay class-local to avoid leaking fixture names into unrelated test classes in the same module. **Extract ``_evict_miss`` helper in ``apply_mru_partial``.** Five arms of ``self._mru_partials.pop(last_hash, None); self._mru_partial_evictions += 1; return cache, remaining_tokens, 0`` collapse into a single inner function. Each call site is now one line, the eviction-counter bookkeeping lives in one place, and the rollback contract is harder to break by accident. **Dashboard Alpine getters.** Three getters added to the dashboard root: ``mruEnabled``, ``hotCacheEnabled``, ``cacheRatesGridCols``. The previous expressions repeated ``stats.runtime_cache?.mru_partial_max_entries > 0`` in 8 places and a three-arm ternary chain in the rate-strip grid-class binding. The HTML now reads ``x-show="mruEnabled && stats.runtime_cache?.models?.length > 0"`` and ``:class="cacheRatesGridCols"`` at the relevant sites. **``_make_counters`` driven by a key tuple.** Previously took 16 explicit kwargs and re-listed every key in the returned dict body. Now: a module-level ``_COUNTER_KEYS`` tuple is the single source of truth; the helper builds a zero-initialised dict and applies ``**overrides``. Unknown keys raise (catches the typos the explicit signature used to catch). Adding a new observability counter is now one tuple entry instead of three coordinated changes (signature, dict body, and any call sites that wanted the default). **Prune ``_MRUPartialBlock`` docstring rot-prone bits.** Dropped specific MiB numbers (~17/41/68-165 MiB) and the PR# reference to jundot#1120 from the memory-accounting section. The numbers were informative when written but would have aged past the model and config landscape they assumed. Kept the invariant statement and the test reference; removed the calibration table. **Findings reviewed and skipped:** - Aggregating ``mru_partial_max_entries`` across loaded models was flagged as "wrong arithmetic" — actually correct, matches the deliberate hot-cache convention from jundot#1183 (each model has its own budget, the dashboard gauge shows fleet fill = sum of entries / sum of capacities). - ``_get_cache_seq_len`` per-block redundancy in ``store_cache`` — pre-existing pattern, not regressed by this stack, defer. - Phase-1 ``mx.concatenate`` × N layers × 2 dispatch shape — defer pending M3 Ultra measurement (per ``feedback_apple_silicon_perf.md`` memory: never estimate prefill costs). - ``paged_cache.allocated_blocks.get(...)`` direct dict access at 9+ pre-existing sites — wider refactor, out of scope. - ``_all_layers_sliceable`` vs ``_prompt_cache_needs_snapshots`` co-location — different inputs (class-name strings vs live cache objects), unification would be scope-creep. Suite results on this commit: 250 passed: cache + scheduler + observability + admin + settings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Server-side snapshot differencing via CacheRateTracker: stores the last 90 snapshots of cumulative cache counters (10s intervals, 15 min window) and computes rates via start/end differencing. Zero hot-path changes — snapshots are lazy, driven by dashboard polling cadence. New metrics exposed in /api/stats under cache_observability: - prefix_hit_rate (cumulative + windowed) - eviction count, ssd_hot_rate - per-model and weighted aggregate across models Dashboard: new "Cache Breakdown" card below Average Speed showing hit rate, evictions, and hot cache hits. Session-only (hidden in All-Time view since counters reset on model reload).

The new disk_max aggregation reads get_ssd_cache_max_size_bytes which the existing tests left unconfigured, producing a MagicMock that raises on max(MagicMock, int). Also add the missing max_size_bytes key to the expected model payloads.

jundot · 2026-05-19T05:29:47Z

nice, the per-tier gauges + rate breakdown beat the single cumulative %, and CacheRateTracker stays off the hot path (snapshot-driven). Tests pass. Merging.

ivaniguarans force-pushed the feat/cache-hit-rate-observability branch 4 times, most recently from 0a18048 to 6c91508 Compare May 12, 2026 19:22

ivaniguarans mentioned this pull request May 13, 2026

feat(scheduler): split store_cache_main_prep into sub-phase timers #1243

Merged

blightbow mentioned this pull request May 13, 2026

feat(cache): multi-slot LRU MRU partial block cache #1149

Draft

ivaniguarans force-pushed the feat/cache-hit-rate-observability branch from 6c91508 to b49963b Compare May 13, 2026 21:10

ivaniguarans added 2 commits May 18, 2026 14:24

ivaniguarans force-pushed the feat/cache-hit-rate-observability branch from b49963b to f79d450 Compare May 18, 2026 18:25

jundot merged commit 8d0c08a into jundot:main May 19, 2026

ivaniguarans deleted the feat/cache-hit-rate-observability branch May 19, 2026 10:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cache): add per-model cache hit-rate observability#1183

feat(cache): add per-model cache hit-rate observability#1183
jundot merged 2 commits into
jundot:mainfrom
ivaniguarans:feat/cache-hit-rate-observability

ivaniguarans commented May 11, 2026 •

edited

Loading

Uh oh!

blightbow commented May 13, 2026

Uh oh!

ivaniguarans commented May 13, 2026

Uh oh!

jundot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ivaniguarans commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

How it works

Known behavior surfaced by this PR

Changes

Test plan

Follow-ups

Uh oh!

blightbow commented May 13, 2026

Uh oh!

ivaniguarans commented May 13, 2026

Uh oh!

jundot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ivaniguarans commented May 11, 2026 •

edited

Loading