fix(ssd-cache): inline LRU unlinks so eviction frees queue capacity by cfbraun · Pull Request #1451 · jundot/omlx

cfbraun · 2026-05-27T07:32:28Z

Summary

_enforce_size_limit_for_new_block enqueues evicted file unlinks as ("unlink", path) items onto _write_queue — the same bounded queue that carries pending writes. Combined with the pre-eviction _write_queue.full() short-circuit at the top of save_block, this creates a deadlock under sustained save pressure:

Writer is saturated → _write_queue is full.
save_block's pre-eviction check sees full() → returns False immediately, before calling _enforce_size_limit_for_new_block.
Eviction never runs → cache stays at the size cap.
Every subsequent save drops; ssd_write_drops climbs forever while _total_size sits pinned at the cap.

Fix: inline the unlinks on the eviction-calling thread instead. Eviction typically removes a single block per save (evict_until_size stops as soon as total_size <= target), so this is one syscall per save in steady state. The deferred-unlink justification ("avoid blocking the inference thread with N file delete syscalls") doesn't materialise under normal load, and inlining removes the bounded-queue contention entirely.

Bounded inline burst. The ENOSPC-recovery path invalidates the 30 s disk-usage cache, which can shrink _get_effective_max_size sharply — evict_until_size would then return hundreds of entries at once and the inline-unlink loop would stall the inference thread on a syscall storm. Cap the burst at _MAX_INLINE_UNLINKS_PER_SAVE = 32 and reinsert the deferred metadata into the index so subsequent saves drain the remainder.

Also adds an evict_unlink_failures stats counter (eviction now decrements the index before the on-disk unlink; if Path.unlink raises OSError, surfacing the counter lets operators see when on-disk size has drifted above what the index reports).

The dead writer-thread ("unlink", file_path) dispatch branch is removed since no path enqueues such tuples anymore.

Test plan

pytest tests/test_paged_ssd_cache.py::TestInlineLRUUnlinks — 4 passed
pytest tests/test_paged_ssd_cache.py tests/test_hot_cache.py — 125 passed (4 new + 121 existing)

``_enforce_size_limit_for_new_block`` enqueues evicted file unlinks as ``("unlink", path)`` items onto ``_write_queue`` — the same bounded queue that carries pending writes. Combined with the pre-eviction ``_write_queue.full()`` short-circuit at the top of ``save_block``, this creates a deadlock under sustained save pressure: 1. Writer is saturated → ``_write_queue`` is full. 2. ``save_block``'s pre-eviction check sees ``full()`` → returns False immediately, BEFORE calling ``_enforce_size_limit_for_new_block``. 3. Eviction never runs → cache stays at the size cap. 4. Every subsequent save drops; ``ssd_write_drops`` climbs forever while ``_total_size`` sits pinned at the cap. Inline the unlinks on the eviction-calling thread instead. Eviction typically removes a single block per save (``evict_until_size`` stops as soon as ``total_size <= target``), so this is one syscall per save in steady state. The deferred-unlink justification ("avoid blocking the inference thread with N file delete syscalls") doesn't materialise under normal load, and inlining removes the bounded-queue contention entirely. Bounded inline burst. The ENOSPC-recovery path invalidates the 30 s disk-usage cache, which can shrink ``_get_effective_max_size`` sharply on the next save — ``evict_until_size`` would then return hundreds of entries at once and the inline-unlink loop would stall the inference thread on a syscall storm. Cap the burst at ``_MAX_INLINE_UNLINKS_PER_SAVE = 32`` and reinsert the deferred metadata into the index so subsequent saves drain the remainder. Bounds per-call latency at the cost of taking multiple saves to fully reconverge. ``evict_unlink_failures`` stats counter. Eviction now decrements the index before the on-disk unlink; if ``Path.unlink`` raises ``OSError``, the previous "log a warning and move on" pattern silently lost the signal. Surfacing the counter lets operators see that the on-disk size has drifted above what the index reports. Tests (tests/test_paged_ssd_cache.py::TestInlineLRUUnlinks): - test_eviction_does_not_enqueue_unlink_tasks: sentinel-patch on ``put_nowait`` asserts no ``("unlink", ...)`` items ever enter the queue. - test_eviction_frees_capacity_under_pressure: with the writer busy, eviction still keeps ``_index.total_size`` near the configured cap. - test_inline_eviction_burst_is_capped: forced mass-eviction removes at most ``_MAX_INLINE_UNLINKS_PER_SAVE`` entries; the rest reinsert so subsequent saves can drain. - test_unlink_failure_increments_counter: a patched ``OSError`` from ``Path.unlink`` increments ``evict_unlink_failures``. The dead writer-thread ``("unlink", file_path)`` dispatch branch is removed since no path enqueues such tuples anymore. 87 existing paged_ssd_cache tests + 34 hot_cache tests + 4 new tests pass.

jundot · 2026-06-06T16:36:53Z

Thanks for tracking this down. The root cause makes sense: eviction was sharing the bounded write queue with pending writes, so once the queue saturated, cache pressure could not drain cleanly.

This looks good to me, and I am going to merge it. I verified tests/test_paged_ssd_cache.py and tests/test_hot_cache.py locally, and I will fold two small follow-ups into main after merge: preserving LRU order by sorting _lru by last_access after capped inline eviction reinserts deferred entries, and threading evict_unlink_failures through PagedSSDCacheStats so it shows up in the runtime/admin stats path.

Resolves conflicts in favor of the merged upstream shape for: - jundot#1628 (max_context_window_policy): upstream merged the reworked separate-nullable-field design, not the original "reuse max_context_window" design that still sat on this branch. Drop the 1_000_000 default and adopt the 32768 fallback + optional policy cap jundot ultimately wanted. - jundot#1451 (inline LRU unlinks): upstream's slice+reinsert variant of the burst cap landed (with observability counters from 021162b). Drop the local push-cap-into-evict_until_size variant. - Test suites realigned to the merged shapes (TestInlineLRUUnlinks, policy-cap test names, 32768 fallback assertions).

cfbraun force-pushed the pr/inline-lru-unlinks branch 2 times, most recently from a9792b3 to 6577237 Compare May 27, 2026 09:30

jundot mentioned this pull request May 28, 2026

fix(cache): boundary snapshots dropped during prefill because uid mapping not yet populated #1471

Closed

5 tasks

cfbraun force-pushed the pr/inline-lru-unlinks branch 11 times, most recently from 0a8a1d0 to 1357c6a Compare June 3, 2026 11:08

cfbraun force-pushed the pr/inline-lru-unlinks branch 2 times, most recently from 869206a to 22e86ec Compare June 5, 2026 16:13

cfbraun force-pushed the pr/inline-lru-unlinks branch from 22e86ec to af0537d Compare June 6, 2026 08:48

jundot merged commit 09a0558 into jundot:main Jun 6, 2026
4 checks passed

jundot added a commit that referenced this pull request Jun 6, 2026

fix(ssd-cache): preserve capped eviction observability (#1451)

021162b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ssd-cache): inline LRU unlinks so eviction frees queue capacity#1451

fix(ssd-cache): inline LRU unlinks so eviction frees queue capacity#1451
jundot merged 1 commit into
jundot:mainfrom
cfbraun:pr/inline-lru-unlinks

cfbraun commented May 27, 2026

Uh oh!

jundot commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cfbraun commented May 27, 2026

Summary

Test plan

Uh oh!

jundot commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants