Skip to content

feat(cache): add ssd_write_drops counter for write-queue saturation#1406

Merged
jundot merged 1 commit into
jundot:mainfrom
ivaniguarans:feat/ssd-write-drops-counter
May 26, 2026
Merged

feat(cache): add ssd_write_drops counter for write-queue saturation#1406
jundot merged 1 commit into
jundot:mainfrom
ivaniguarans:feat/ssd-write-drops-counter

Conversation

@ivaniguarans

Copy link
Copy Markdown
Contributor

Summary

Adds a runtime counter ssd_write_drops to PagedSSDCacheManager that increments at every site where the background writer queue is saturated and a cache write is dropped or skipped. The new metric pairs with the existing read-path counters (hits, loads, hot_cache_hits, etc.) in PagedSSDCacheStats, giving operators a first write-path health signal. WARN logs at these sites already exist; this PR makes the drops aggregable so they can be graphed, alerted on, or surfaced through the runtime cache observability surface added in #1183.

Purely additive: no change to drop behaviour, no change to any cache code path, no new dependency.

Changes

  • omlx/cache/stats.py: new ssd_write_drops: int = 0 field on PagedSSDCacheStats, grouped with the other operation counters (saves, loads, errors); added to reset() for parity with the other runtime counters.
  • omlx/cache/paged_ssd_cache.py: "ssd_write_drops": 0 initialised in _stats; incremented at all three queue-saturation drop sites in this module — the hot-cache eviction path in _enqueue_ssd_write, the cold-store preflight _write_queue.full() guard, and the cold-store late except queue.Full fallback. Passed through in get_stats() and get_stats_for_model(). get_stats_dict() already spreads **self._stats, so no manual update needed there.
  • tests/test_hot_cache.py: new TestSSDWriteDrops class with five tests — dataclass default + reset(), wiring round-trip through both stats accessors, and one test per drop site. The drop-site tests use unittest.mock.patch.object on the queue methods for deterministic firing rather than depending on writer-thread timing.

The bulk-eviction unlink path is intentionally excluded — that fallback runs an inline unlink() when the unlink queue saturates, but the file already exists on disk so no cache write is dropped. Counting it would conflate cleanup-queue backpressure with actual write loss.

Test plan

  • uv run pytest tests/ -x — full suite passes
  • uv run pytest tests/test_hot_cache.py::TestSSDWriteDrops -v — 5/5 pass
  • Optional manual verification: load a small model (e.g. Qwen3-0.6B-4bit or Qwen3-Coder-Next-6bit per the project's cache micro-benchmarking convention), reduce _MAX_PENDING_WRITES to force queue saturation under a write-heavy workload, confirm ssd_write_drops appears and increments in /admin/api/stats.

Adds a runtime counter that increments at every site where
PagedSSDCacheManager skips or drops a cache write because the
background writer queue is saturated. Pairs the new metric with
the existing read-path metrics in PagedSSDCacheStats.

Three increment sites in paged_ssd_cache.py:
- _enqueue_ssd_write: hot-cache eviction -> queue.Full
- save_block: preflight _write_queue.full() guard (common case)
- save_block: late except queue.Full fallback

Excludes the bulk-eviction unlink fallback (separate signal -
file already exists on disk, no cache write is dropped).

Tests cover all three sites (mirrors test_queue_full_cleans_pending_buffer
for the hot-cache path; unittest.mock.patch.object for the deterministic
late-exception cold-store path).
@jundot

jundot commented May 26, 2026

Copy link
Copy Markdown
Owner

Thanks for this. Verified the 3 drop sites cover real write loss, and the new field shows up in admin stats without extra wiring. Merging.

@jundot jundot merged commit 1b666af into jundot:main May 26, 2026
jonpspri pushed a commit to jonpspri/omlx that referenced this pull request Jun 12, 2026
…undot#1406)

Adds a runtime counter that increments at every site where
PagedSSDCacheManager skips or drops a cache write because the
background writer queue is saturated. Pairs the new metric with
the existing read-path metrics in PagedSSDCacheStats.

Three increment sites in paged_ssd_cache.py:
- _enqueue_ssd_write: hot-cache eviction -> queue.Full
- save_block: preflight _write_queue.full() guard (common case)
- save_block: late except queue.Full fallback

Excludes the bulk-eviction unlink fallback (separate signal -
file already exists on disk, no cache write is dropped).

Tests cover all three sites (mirrors test_queue_full_cleans_pending_buffer
for the hot-cache path; unittest.mock.patch.object for the deterministic
late-exception cold-store path).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants