Skip to content

feat(cache): add per-model cache hit-rate observability#1183

Merged
jundot merged 2 commits into
jundot:mainfrom
ivaniguarans:feat/cache-hit-rate-observability
May 19, 2026
Merged

feat(cache): add per-model cache hit-rate observability#1183
jundot merged 2 commits into
jundot:mainfrom
ivaniguarans:feat/cache-hit-rate-observability

Conversation

@ivaniguarans
Copy link
Copy Markdown
Contributor

@ivaniguarans ivaniguarans commented May 11, 2026

Summary

The dashboard shows Cache Efficiency as a single cumulative percentage — useful at a glance, but it doesn't answer the questions operators actually ask: Is the prefix cache hitting? Are hits served from memory or disk? How full is each cache tier? Are evictions building up? Is it safe to flush the hot tier without restarting?

This turns the Runtime Cache Observability card into a complete operational surface for both cache tiers — memory (hot cache) and SSD (disk) — with per-model breakdowns, capacity gauges, rate metrics, and safe clear controls.

What changed

Capacity gauges — the card header now shows memory and SSD usage bars with live fill percentages. Memory gauge appears only when hot cache is enabled and models are loaded. Both tiers have clear buttons: the existing SSD clear, and a new memory-only clear that flushes the in-memory hot tier without touching persistent SSD blocks.

1-new-card

Cache rate metrics — a 4-metric strip inside the card shows session-scoped rates: Prefix Hit Rate, Memory Hit Rate, Prefix Evictions, and Memory Evictions. Hidden in All-Time view since the counters are session-scoped. When filtering by model, the strip shows that model's rates; in All Models view, each rate is computed from its own aggregate numerator/denominator (prefix hit rate from total prefix hits/misses, memory hit rate from total hot hits/disk loads — no cross-weighting).

2-per-model-filter 3-all-time-filter

Per-model table columns — the existing SSD Files/Size columns are joined by Memory Entries/Size columns (gated on hot cache being enabled), so operators can see per-model cache footprint at a glance.

4-single-model-full-view

Hot cache disabled — when --hot-cache-max-size is 0 (the default), the Memory gauge, Memory Entries/Size columns, and Memory rate metrics hide automatically. Only prefix and SSD metrics are shown.

6-no-hot-cache

Hot cache clear endpointPOST /api/hot-cache/clear mirrors the existing SSD clear route. Iterates loaded model schedulers via a new _iter_loaded_schedulers() helper (which also de-duplicates the getattr chain from the SSD clear route), calls clear_hot_cache() on each manager, and resets the rate tracker so window rates don't carry stale pre-clear history.

5-clear-cache

Token-level match efficiency — prefix cache accounting now tracks tokens_matched_total and tokens_requested_total alongside the existing hit/miss counts. A 2-token match on a 2048-token request is very different from a 2000-token match; prefix_match_efficiency captures that distinction. Exposed in the API response for scripting consumers.

How it works

CacheRateTracker (149 lines, omlx/cache/observability.py) stores up to 90 snapshots of cumulative cache counters at ~10-second intervals. Snapshots are lazy — taken only when the dashboard polls /api/stats — so there are zero hot-path changes beyond two counter increments in fetch_cache(). The scheduler's _collect_cache_counters reads from the public get_stats() API on PagedSSDCacheManager, avoiding private attribute access.

The tracker computes:

  • Windowed rates (1m/5m/15m) via start/end snapshot differencing — available in the API response for scripting consumers
  • Cumulative session rates — what the dashboard displays, since windowed rates go stale too fast on a single-user local server

The SSD tier avoids double-counting by computing ssd_disk_loads = total_loads - hot_cache_hits, since loads includes both hot and disk reads. On scheduler reset, the tracker is cleared alongside prefix cache stats.

Known behavior surfaced by this PR

While validating the gauges against real multi-model workloads, we observed two pre-existing cache behaviors worth tracking separately:

Hot cache capacity is per loaded model, not process-global. Each scheduler creates its own PagedSSDCacheManager with an independent hot_cache_max_bytes budget. With 3 models loaded and --hot-cache-max-size 4GB, the dashboard correctly reports 12 GB total hot-cache capacity because the gauge sums per-model pools. The CLI help text says "Maximum in-memory hot cache size" without clarifying it applies per model — users setting 4 GB expect a 4 GB process budget, not 4 GB × N.

SSD usage can exceed the configured limit with multiple models. All model schedulers write to the same cache directory, but each manager enforces max_size independently. Under eviction pressure, the background writer queue can fill up, causing eviction deletes to be dropped ("SSD write queue full, dropping evicted block") while writes continue. We observed 68.8 GB on a 50 GB configured limit with all models idle.

This PR reports those values truthfully but does not change cache ownership or enforcement semantics — that belongs in separate issues.

Changes

omlx/cache/observability.py          | 149 new    — CacheRateTracker module
tests/test_cache_observability.py    | 202 new    — 15 tests (snapshots, rates, threads, clear)
omlx/cache/prefix_cache.py          |  13 +      — token-level match counters in fetch_cache()
omlx/cache/stats.py                 |   4 +      — counter fields in PrefixCacheStats
omlx/cache/paged_ssd_cache.py       |  14 +      — clear_hot_cache() method
omlx/scheduler.py                   |  42 +/-    — counter collection via public API + tracker integration
omlx/admin/routes.py                | 180 +/-    — _iter_loaded_schedulers(), hot-cache clear endpoint, cache aggregation
omlx/admin/static/js/dashboard.js   |  64 +      — cacheObsCumulative aggregation, gauge getters, clear handler
omlx/admin/templates/_status.html   |  77 +      — gauges, rate strip, per-model memory columns
omlx/admin/static/css/dashboard.css |   4 +      — gauge track dark mode

Test plan

  • uv run pytest tests/test_cache_observability.py tests/test_hot_cache.py -v — 37 tests covering snapshot mechanics, rate computation (steady-state, zero-activity, eviction rates, SSD hot ratio, token match efficiency), thread safety under concurrent access, clear/reset behavior, hot cache lifecycle, LRU eviction, byte accounting, and write-back to SSD
  • Live-tested on three models simultaneously (Qwen3.6-27B-oQ6-mtp, Qwen3.6-27B-oQ8-mtp, Qwen3.6-35B-A3B-oQ4-mtp) — verified per-model filtering, aggregate rate computation, gauge accuracy, and both clear endpoints
  • Verified Session/All-Time toggle hides the rate strip correctly while keeping gauges visible
  • Verified memory gauge and columns hide when hot cache is disabled
  • Verified hot cache clear flushes entries and resets gauges without affecting SSD
  • 36 pre-existing test failures in unrelated files (server endpoints, grammar, VLM, MTP, oQ, admin update check) — none in our changed files

Follow-ups

Two extensions that build on this without expanding scope:

  1. Average prefix match depthtokens_matched / hits to distinguish shallow matches from deep ones. The counters are already tracked server-side; this would be a dashboard-only addition.
  2. Windowed rate selector — the backend already computes 1m/5m/15m windowed rates. A dropdown in the dashboard would let operators see recent trends instead of only cumulative session values.

Related: #905 (paged KV occupancy — complementary surface), #1171 (hot cache byte accounting — no conflict)

@blightbow
Copy link
Copy Markdown
Contributor

Heads-up: three TestRuntimeCacheObservability tests pass on main but fail on this PR's HEAD. This surfaced when attempting to rebase my own PR #1149 onto 6c91508.

$ git checkout 172be39  # current upstream/main
$ uv run pytest tests/test_admin_api_key.py::TestRuntimeCacheObservability -q
3 passed

$ git checkout 6c91508  # this PR's HEAD
$ uv run pytest tests/test_admin_api_key.py::TestRuntimeCacheObservability -q
3 failed

All three failures share the same signature:

 omlx/admin/routes.py:3624: TypeError: '>' not supported between instances of 'int' and 'MagicMock'
    disk_max = max(disk_max, m.get("max_size_bytes", 0))

Root cause is the new aggregation block at routes.py:3617-3625 (Aggregate hot-cache and disk-max across models). It initializes disk_max = payload["disk_max_bytes"], whose value flows from global_settings.cache.get_ssd_cache_max_size_bytes(...). The test mocks global_settings with MagicMock() and only configures a couple of explicit return values; get_ssd_cache_max_size_bytes is autospec'd and returns a MagicMock. The per-model m.get("max_size_bytes", 0) is correctly an int (the dict construction at routes.py:3591 already does the int(... or 0) coercion), but max(MagicMock, int) raises.

One related observation: even after the TypeError is fixed, the test's assert payload["models"] == [...] expected-dict at tests/test_admin_api_key.py:739-768 doesn't include the new max_size_bytes key that routes.py:3591 writes, so the assertion will still fail on key-set mismatch. The hot_cache_max_bytes / hot_cache_size_bytes / hot_cache_entries keys are present in the expected dict; max_size_bytes is the lone omission. Worth one pass to make sure the expected payload tracks what the implementation produces.

@ivaniguarans ivaniguarans force-pushed the feat/cache-hit-rate-observability branch from 6c91508 to b49963b Compare May 13, 2026 21:10
@ivaniguarans
Copy link
Copy Markdown
Contributor Author

Thanks for the detailed report — spot on. The new disk_max aggregation block reads get_ssd_cache_max_size_bytes which those tests left unconfigured, so MagicMock leaked into max(). Fixed that and added the missing max_size_bytes key to the expected payloads. Also rebased onto current main. Should be green now — you should be unblocked for #1149.

blightbow added a commit to blightbow/omlx that referenced this pull request May 13, 2026
The ``/api/ssd-cache/clear`` admin endpoint at
``omlx/admin/routes.py:3975`` wipes the SSD-backed paged blocks per
loaded scheduler but did not touch ``BlockAwarePrefixCache._mru_partials``.
Surviving MRU partials then chain from paged-block hashes whose KV
bytes were just flushed, violating the operator's "drop all warm
caches" intent.  Under ``hot_cache_only=True`` (where the hot tier IS
the only persistent store) the same hazard would also apply to PR
jundot#1183's forthcoming ``/api/hot-cache/clear`` endpoint when it lands;
that wiring is a one-line follow-up at the same loop using the same
method.

Wire ``block_aware_cache.clear_mru_partials()`` into the per-scheduler
loop alongside ``ssd_manager.clear()``, with the same defensive
try/except wrapper.  A failure clearing one scheduler's MRU does not
prevent siblings from being cleared.

The standalone ``clear_mru_partials()`` method (added in the previous
commit, see ``omlx/cache/prefix_cache.py``) is the public seam.  Its
own unit coverage lives in
``TestMRUPartialMultiSlot::test_clear_mru_partials_wipes_only_partials``;
this commit adds two endpoint-level tests in
``TestClearSSDCacheAlsoWipesMRUPartials`` that pin the wiring:

  - ``test_endpoint_calls_clear_mru_partials_on_each_scheduler``
    confirms both ``ssd_manager.clear()`` and
    ``block_aware_cache.clear_mru_partials()`` fire for every
    loaded scheduler.
  - ``test_mru_clear_failure_does_not_block_other_scheduler``
    pins the defensive try/except: an exception in one scheduler's
    clear path must not stop the loop.

Suite results on this commit:
  62 passed: tests/test_admin_api_key.py (1 pre-existing unrelated
             failure remains)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
blightbow added a commit to blightbow/omlx that referenced this pull request May 13, 2026
Plug the multi-slot MRU partial cache into the same observability
surface that jundot#1183 established for the prefix and hot/disk tiers.
Without these counters, operators have no signal for whether
``--mru-partial-max-entries`` is tuned right; with them, the
dashboard answers "is the MRU paying off" with the same shape it
answers for memory and disk hit rates.

Counters (cumulative, on PrefixCacheStats)
------------------------------------------
- ``mru_partial_stashes`` — every successful entry write, including
  same-key replacements.  Same-key replacement does NOT count as an
  eviction; operator sees stash payoff via the hits/stashes ratio.
- ``mru_partial_hits`` — every successful splice via apply_mru_partial.
- ``mru_partial_evictions`` — total entries removed.  Includes
  capacity-overflow LRU evictions, apply-time mismatch pops (token,
  layer-count, splice failure), and ``clear_mru_partials()`` wipes.
  Does NOT include full ``clear()`` (cache-corruption recovery) —
  that path also calls ``reset_stats()``, so incrementing evictions
  there would be incoherent (the increment gets zeroed immediately).
  Operators tracking partial-only wipes use ``clear_mru_partials()``.
- ``mru_partial_tokens_saved`` — sum of ``n_partial`` across hits.
  The direct compute-saved measure: each unit is one token of prefill
  forward-pass that did NOT have to run.

Gauges (live state, on PrefixCacheStats)
----------------------------------------
- ``mru_partial_entries`` — current dict length.
- ``mru_partial_max_entries`` — configured capacity (operator-facing).

Plumbing
--------
Counters thread through ``Scheduler._collect_cache_counters`` into
the existing ``CacheRateTracker``.  ``observability._compute_window``
and ``_compute_cumulative`` add per-counter deltas plus a derived
``mru_partial_hit_rate`` (= hits / stashes, with the same
zero-stashes-no-NaN guard the other ratios use).  The admin
``_build_runtime_cache_observability`` emits per-model entries/
max_entries gauges and aggregates them at the payload level the
same way ``hot_cache_entries`` and ``hot_cache_max_bytes`` do.

Dashboard
---------
Mirrors jundot#1183's hot-cache surface:

- **Header gauge** "MRU tails N/M entries" next to the Memory and
  SSD gauges, visible only when ``mru_partial_max_entries > 0``.
- **Rate strip** gains "MRU Tail Hit Rate" and "MRU Tokens Saved"
  cells.  Grid expands from 4 cells (hot cache only) to 6 cells
  (both tiers).  When only one or the other tier is enabled, the
  layout stays at 4 cells with the disabled tier's cells hidden via
  ``x-show``.
- **Per-model table** gains an "MRU Tails" column showing
  ``entries / max_entries`` for each loaded model.

Test coverage (10 new cases)
----------------------------
``TestMRUPartialCounters`` (8 cases) — initial zeros, stash-counter
bumps, same-key-replacement-is-not-eviction, capacity-overflow
eviction count, apply-success bumps hits+tokens_saved, apply-miss
eviction, ``clear_mru_partials()`` bulk eviction count, ``clear()``
zeros everything semantics, ``reset_stats()`` zeros cumulative
counters but preserves live entries.

``TestCacheRateTrackerRates`` (3 new cases) — mru_partial_hit_rate
windowed + cumulative, zero-stashes no-NaN guard, tokens_saved
delta accumulation.

``TestRuntimeCacheObservability`` updated to reflect the two new
per-model payload keys (``mru_partial_entries``,
``mru_partial_max_entries``).

Suite results on this commit:
  250 passed: cache + scheduler + observability + admin + settings

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
blightbow added a commit to blightbow/omlx that referenced this pull request May 13, 2026
…s, Alpine getters

Five small simplifications surfaced by a three-agent code review on
the MRU partial cache commit stack.  No behaviour change.

**Hoist test helpers to module level.**  Three MRU test classes
(``TestMRUPartialBlockCache``, ``TestMRUPartialMultiSlot``,
``TestMRUPartialCounters``) each redefined ``_layer``, ``_kv_layer``,
``_rotating_layer``, ``_make_reconstructed_cache``, ``_stash_with_prefix``,
and ``_cache`` (factory).  Hoist to module-level helpers in
``tests/test_prefix_cache.py`` (alongside the existing
``_get_mru_partial`` accessor); the duplicate methods come out, ~120
lines of repetition collapses, call sites switch from
``self._kv_layer(...)`` to ``_kv_layer(...)``.  The factory for
custom capacity is renamed ``_make_mru_cache(paged_cache, mock_ssd,
max_entries, num_layers)``.  Per-class fixtures (``mx``,
``paged_cache``, ``mock_ssd``) stay class-local to avoid leaking
fixture names into unrelated test classes in the same module.

**Extract ``_evict_miss`` helper in ``apply_mru_partial``.**  Five
arms of ``self._mru_partials.pop(last_hash, None); self._mru_partial_evictions
+= 1; return cache, remaining_tokens, 0`` collapse into a single
inner function.  Each call site is now one line, the eviction-counter
bookkeeping lives in one place, and the rollback contract is harder
to break by accident.

**Dashboard Alpine getters.**  Three getters added to the dashboard
root: ``mruEnabled``, ``hotCacheEnabled``, ``cacheRatesGridCols``.
The previous expressions repeated ``stats.runtime_cache?.mru_partial_max_entries
> 0`` in 8 places and a three-arm ternary chain in the rate-strip
grid-class binding.  The HTML now reads
``x-show="mruEnabled && stats.runtime_cache?.models?.length > 0"``
and ``:class="cacheRatesGridCols"`` at the relevant sites.

**``_make_counters`` driven by a key tuple.**  Previously took 16
explicit kwargs and re-listed every key in the returned dict body.
Now: a module-level ``_COUNTER_KEYS`` tuple is the single source of
truth; the helper builds a zero-initialised dict and applies
``**overrides``.  Unknown keys raise (catches the typos the explicit
signature used to catch).  Adding a new observability counter is
now one tuple entry instead of three coordinated changes (signature,
dict body, and any call sites that wanted the default).

**Prune ``_MRUPartialBlock`` docstring rot-prone bits.**  Dropped
specific MiB numbers (~17/41/68-165 MiB) and the PR# reference to
jundot#1120 from the memory-accounting section.  The numbers were
informative when written but would have aged past the model and
config landscape they assumed.  Kept the invariant statement and the
test reference; removed the calibration table.

**Findings reviewed and skipped:**

- Aggregating ``mru_partial_max_entries`` across loaded models was
  flagged as "wrong arithmetic" — actually correct, matches the
  deliberate hot-cache convention from jundot#1183 (each model has its own
  budget, the dashboard gauge shows fleet fill = sum of entries /
  sum of capacities).
- ``_get_cache_seq_len`` per-block redundancy in ``store_cache`` —
  pre-existing pattern, not regressed by this stack, defer.
- Phase-1 ``mx.concatenate`` × N layers × 2 dispatch shape —
  defer pending M3 Ultra measurement (per
  ``feedback_apple_silicon_perf.md`` memory: never estimate prefill
  costs).
- ``paged_cache.allocated_blocks.get(...)`` direct dict access at
  9+ pre-existing sites — wider refactor, out of scope.
- ``_all_layers_sliceable`` vs ``_prompt_cache_needs_snapshots``
  co-location — different inputs (class-name strings vs live cache
  objects), unification would be scope-creep.

Suite results on this commit:
  250 passed: cache + scheduler + observability + admin + settings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Server-side snapshot differencing via CacheRateTracker: stores the last
90 snapshots of cumulative cache counters (10s intervals, 15 min window)
and computes rates via start/end differencing. Zero hot-path changes —
snapshots are lazy, driven by dashboard polling cadence.

New metrics exposed in /api/stats under cache_observability:
- prefix_hit_rate (cumulative + windowed)
- eviction count, ssd_hot_rate
- per-model and weighted aggregate across models

Dashboard: new "Cache Breakdown" card below Average Speed showing hit
rate, evictions, and hot cache hits. Session-only (hidden in All-Time
view since counters reset on model reload).
The new disk_max aggregation reads get_ssd_cache_max_size_bytes which
the existing tests left unconfigured, producing a MagicMock that raises
on max(MagicMock, int).  Also add the missing max_size_bytes key to the
expected model payloads.
@ivaniguarans ivaniguarans force-pushed the feat/cache-hit-rate-observability branch from b49963b to f79d450 Compare May 18, 2026 18:25
@jundot
Copy link
Copy Markdown
Owner

jundot commented May 19, 2026

nice, the per-tier gauges + rate breakdown beat the single cumulative %, and CacheRateTracker stays off the hot path (snapshot-driven). Tests pass. Merging.

@jundot jundot merged commit 8d0c08a into jundot:main May 19, 2026
@ivaniguarans ivaniguarans deleted the feat/cache-hit-rate-observability branch May 19, 2026 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants