feat(cache): add ssd_write_drops counter for write-queue saturation#1406
Merged
Merged
Conversation
Adds a runtime counter that increments at every site where PagedSSDCacheManager skips or drops a cache write because the background writer queue is saturated. Pairs the new metric with the existing read-path metrics in PagedSSDCacheStats. Three increment sites in paged_ssd_cache.py: - _enqueue_ssd_write: hot-cache eviction -> queue.Full - save_block: preflight _write_queue.full() guard (common case) - save_block: late except queue.Full fallback Excludes the bulk-eviction unlink fallback (separate signal - file already exists on disk, no cache write is dropped). Tests cover all three sites (mirrors test_queue_full_cleans_pending_buffer for the hot-cache path; unittest.mock.patch.object for the deterministic late-exception cold-store path).
Owner
|
Thanks for this. Verified the 3 drop sites cover real write loss, and the new field shows up in admin stats without extra wiring. Merging. |
jonpspri
pushed a commit
to jonpspri/omlx
that referenced
this pull request
Jun 12, 2026
…undot#1406) Adds a runtime counter that increments at every site where PagedSSDCacheManager skips or drops a cache write because the background writer queue is saturated. Pairs the new metric with the existing read-path metrics in PagedSSDCacheStats. Three increment sites in paged_ssd_cache.py: - _enqueue_ssd_write: hot-cache eviction -> queue.Full - save_block: preflight _write_queue.full() guard (common case) - save_block: late except queue.Full fallback Excludes the bulk-eviction unlink fallback (separate signal - file already exists on disk, no cache write is dropped). Tests cover all three sites (mirrors test_queue_full_cleans_pending_buffer for the hot-cache path; unittest.mock.patch.object for the deterministic late-exception cold-store path).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a runtime counter
ssd_write_dropstoPagedSSDCacheManagerthat increments at every site where the background writer queue is saturated and a cache write is dropped or skipped. The new metric pairs with the existing read-path counters (hits,loads,hot_cache_hits, etc.) inPagedSSDCacheStats, giving operators a first write-path health signal. WARN logs at these sites already exist; this PR makes the drops aggregable so they can be graphed, alerted on, or surfaced through the runtime cache observability surface added in #1183.Purely additive: no change to drop behaviour, no change to any cache code path, no new dependency.
Changes
omlx/cache/stats.py: newssd_write_drops: int = 0field onPagedSSDCacheStats, grouped with the other operation counters (saves,loads,errors); added toreset()for parity with the other runtime counters.omlx/cache/paged_ssd_cache.py:"ssd_write_drops": 0initialised in_stats; incremented at all three queue-saturation drop sites in this module — the hot-cache eviction path in_enqueue_ssd_write, the cold-store preflight_write_queue.full()guard, and the cold-store lateexcept queue.Fullfallback. Passed through inget_stats()andget_stats_for_model().get_stats_dict()already spreads**self._stats, so no manual update needed there.tests/test_hot_cache.py: newTestSSDWriteDropsclass with five tests — dataclass default +reset(), wiring round-trip through both stats accessors, and one test per drop site. The drop-site tests useunittest.mock.patch.objecton the queue methods for deterministic firing rather than depending on writer-thread timing.The bulk-eviction unlink path is intentionally excluded — that fallback runs an inline
unlink()when the unlink queue saturates, but the file already exists on disk so no cache write is dropped. Counting it would conflate cleanup-queue backpressure with actual write loss.Test plan
uv run pytest tests/ -x— full suite passesuv run pytest tests/test_hot_cache.py::TestSSDWriteDrops -v— 5/5 passQwen3-0.6B-4bitorQwen3-Coder-Next-6bitper the project's cache micro-benchmarking convention), reduce_MAX_PENDING_WRITESto force queue saturation under a write-heavy workload, confirmssd_write_dropsappears and increments in/admin/api/stats.