Skip to content

fix(ssd-cache): inline LRU unlinks so eviction frees queue capacity#1451

Merged
jundot merged 1 commit into
jundot:mainfrom
cfbraun:pr/inline-lru-unlinks
Jun 6, 2026
Merged

fix(ssd-cache): inline LRU unlinks so eviction frees queue capacity#1451
jundot merged 1 commit into
jundot:mainfrom
cfbraun:pr/inline-lru-unlinks

Conversation

@cfbraun
Copy link
Copy Markdown
Contributor

@cfbraun cfbraun commented May 27, 2026

Summary

_enforce_size_limit_for_new_block enqueues evicted file unlinks as ("unlink", path) items onto _write_queue — the same bounded queue that carries pending writes. Combined with the pre-eviction _write_queue.full() short-circuit at the top of save_block, this creates a deadlock under sustained save pressure:

  1. Writer is saturated → _write_queue is full.
  2. save_block's pre-eviction check sees full() → returns False immediately, before calling _enforce_size_limit_for_new_block.
  3. Eviction never runs → cache stays at the size cap.
  4. Every subsequent save drops; ssd_write_drops climbs forever while _total_size sits pinned at the cap.

Fix: inline the unlinks on the eviction-calling thread instead. Eviction typically removes a single block per save (evict_until_size stops as soon as total_size <= target), so this is one syscall per save in steady state. The deferred-unlink justification ("avoid blocking the inference thread with N file delete syscalls") doesn't materialise under normal load, and inlining removes the bounded-queue contention entirely.

Bounded inline burst. The ENOSPC-recovery path invalidates the 30 s disk-usage cache, which can shrink _get_effective_max_size sharply — evict_until_size would then return hundreds of entries at once and the inline-unlink loop would stall the inference thread on a syscall storm. Cap the burst at _MAX_INLINE_UNLINKS_PER_SAVE = 32 and reinsert the deferred metadata into the index so subsequent saves drain the remainder.

Also adds an evict_unlink_failures stats counter (eviction now decrements the index before the on-disk unlink; if Path.unlink raises OSError, surfacing the counter lets operators see when on-disk size has drifted above what the index reports).

The dead writer-thread ("unlink", file_path) dispatch branch is removed since no path enqueues such tuples anymore.

Test plan

  • pytest tests/test_paged_ssd_cache.py::TestInlineLRUUnlinks — 4 passed
  • pytest tests/test_paged_ssd_cache.py tests/test_hot_cache.py — 125 passed (4 new + 121 existing)

@cfbraun cfbraun force-pushed the pr/inline-lru-unlinks branch 2 times, most recently from a9792b3 to 6577237 Compare May 27, 2026 09:30
@cfbraun cfbraun force-pushed the pr/inline-lru-unlinks branch 11 times, most recently from 0a8a1d0 to 1357c6a Compare June 3, 2026 11:08
@cfbraun cfbraun force-pushed the pr/inline-lru-unlinks branch 2 times, most recently from 869206a to 22e86ec Compare June 5, 2026 16:13
``_enforce_size_limit_for_new_block`` enqueues evicted file unlinks as
``("unlink", path)`` items onto ``_write_queue`` — the same bounded
queue that carries pending writes. Combined with the pre-eviction
``_write_queue.full()`` short-circuit at the top of ``save_block``, this
creates a deadlock under sustained save pressure:

  1. Writer is saturated → ``_write_queue`` is full.
  2. ``save_block``'s pre-eviction check sees ``full()`` → returns False
     immediately, BEFORE calling ``_enforce_size_limit_for_new_block``.
  3. Eviction never runs → cache stays at the size cap.
  4. Every subsequent save drops; ``ssd_write_drops`` climbs forever
     while ``_total_size`` sits pinned at the cap.

Inline the unlinks on the eviction-calling thread instead. Eviction
typically removes a single block per save (``evict_until_size`` stops
as soon as ``total_size <= target``), so this is one syscall per save
in steady state. The deferred-unlink justification ("avoid blocking
the inference thread with N file delete syscalls") doesn't materialise
under normal load, and inlining removes the bounded-queue contention
entirely.

Bounded inline burst. The ENOSPC-recovery path invalidates the 30 s
disk-usage cache, which can shrink ``_get_effective_max_size``
sharply on the next save — ``evict_until_size`` would then return
hundreds of entries at once and the inline-unlink loop would stall
the inference thread on a syscall storm. Cap the burst at
``_MAX_INLINE_UNLINKS_PER_SAVE = 32`` and reinsert the deferred
metadata into the index so subsequent saves drain the remainder.
Bounds per-call latency at the cost of taking multiple saves to fully
reconverge.

``evict_unlink_failures`` stats counter. Eviction now decrements the
index before the on-disk unlink; if ``Path.unlink`` raises ``OSError``,
the previous "log a warning and move on" pattern silently lost the
signal. Surfacing the counter lets operators see that the on-disk
size has drifted above what the index reports.

Tests (tests/test_paged_ssd_cache.py::TestInlineLRUUnlinks):
- test_eviction_does_not_enqueue_unlink_tasks: sentinel-patch on
  ``put_nowait`` asserts no ``("unlink", ...)`` items ever enter the
  queue.
- test_eviction_frees_capacity_under_pressure: with the writer busy,
  eviction still keeps ``_index.total_size`` near the configured cap.
- test_inline_eviction_burst_is_capped: forced mass-eviction removes
  at most ``_MAX_INLINE_UNLINKS_PER_SAVE`` entries; the rest reinsert
  so subsequent saves can drain.
- test_unlink_failure_increments_counter: a patched ``OSError`` from
  ``Path.unlink`` increments ``evict_unlink_failures``.

The dead writer-thread ``("unlink", file_path)`` dispatch branch is
removed since no path enqueues such tuples anymore.

87 existing paged_ssd_cache tests + 34 hot_cache tests + 4 new tests
pass.
@cfbraun cfbraun force-pushed the pr/inline-lru-unlinks branch from 22e86ec to af0537d Compare June 6, 2026 08:48
@jundot
Copy link
Copy Markdown
Owner

jundot commented Jun 6, 2026

Thanks for tracking this down. The root cause makes sense: eviction was sharing the bounded write queue with pending writes, so once the queue saturated, cache pressure could not drain cleanly.

This looks good to me, and I am going to merge it. I verified tests/test_paged_ssd_cache.py and tests/test_hot_cache.py locally, and I will fold two small follow-ups into main after merge: preserving LRU order by sorting _lru by last_access after capped inline eviction reinserts deferred entries, and threading evict_unlink_failures through PagedSSDCacheStats so it shows up in the runtime/admin stats path.

@jundot jundot merged commit 09a0558 into jundot:main Jun 6, 2026
4 checks passed
cfbraun added a commit to cfbraun/omlx that referenced this pull request Jun 7, 2026
Resolves conflicts in favor of the merged upstream shape for:

- jundot#1628 (max_context_window_policy): upstream merged the reworked
  separate-nullable-field design, not the original "reuse
  max_context_window" design that still sat on this branch. Drop the
  1_000_000 default and adopt the 32768 fallback + optional policy cap
  jundot ultimately wanted.
- jundot#1451 (inline LRU unlinks): upstream's slice+reinsert variant of
  the burst cap landed (with observability counters from 021162b).
  Drop the local push-cap-into-evict_until_size variant.
- Test suites realigned to the merged shapes (TestInlineLRUUnlinks,
  policy-cap test names, 32768 fallback assertions).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants