fix(engine_pool): skip the settle wait when other engines are serving by JimStenstrom · Pull Request #1785 · jundot/omlx

JimStenstrom · 2026-06-10T05:01:29Z

Addresses the freed=-51GB / eviction-stall portion of #1774 (it does not close the issue — it removes the eviction-time amplification under concurrent load). Complements the fixes already in 0.4.4.dev1, none of which touch the settle barrier itself.

Summary

_unload_engine()'s settle barrier verifies an unload by polling mx.get_active_memory() and comparing the delta against the model's estimated size. That gauge is process-global: while another engine allocates (prefill/KV growth), the delta no longer measures this unload — under concurrent load it reads negative, which is exactly the freed=-51GB in Memory enforcer fails under concurrent load in 0.4.3 — process grows to 116 GB on a ~27 GB model pair, eviction reports freed=-51GB and breaches the configured ceiling → host OOM #1774's report.
The barrier then burns all 10 settle rounds plus 3 emergency-reclaim rounds — ~8s of sleeps and 13 gc.collect() + mx.synchronize() + mx.clear_cache() cycles (beyond the initial release pass) serialized onto the MLX executor against live decode — while the memory enforcer holds the pool lock across the eviction, so admission is blocked too. Under pressure the eviction itself becomes a stall-and-spike, consistent with the reporter's "routine soft-pressure eviction … spiked to 115.9 GB in ~17 s".
Fix: bail out of the settle wait when any other loaded entry is serving (active requests, or a held in-use lease from Engine pool evicts an acquired-but-not-yet-active engine → RuntimeError: Engine not started (acquire→use race; sibling of #1595) #1667). One INFO line, no retry rounds, no emergency reclaim. With the pool otherwise idle the barrier is byte-for-byte unchanged, so the original stuck-unload protection (0.3.5-rc1: engine_pool settle_tolerance fixed at 2 GB is too tight for large (>40 GB) models #768) is fully preserved.

Why this is safe

Accounting is unchanged: the bail-out path accounts the unload exactly as the existing timed-out path already does.
Unreleased memory stays visible: pre-load admission reads max(mx.get_active_memory(), get_phys_footprint(), _current_model_memory) (the v0.4.0 regression: Pre-load eviction fails when loading a second large model #1623 guard), so the live gauge dominates while anything is still resident — and _unload_engine ends with _wake_process_memory_enforcer(), so the enforcer re-polls immediately rather than waiting for the next tick. Skipping the emergency reclaim defers recovery to those two paths instead of losing it.
Conservative failure direction: if an engine doesn't expose has_active_requests() (AttributeError, matching _find_lru_victim's idiom), the entry simply doesn't count as serving and the barrier falls back to today's full wait.
Policy: this continues the direction of the 0.4.3 enforcer change and the fix(dflash): enforce prefill memory guard on DFlash primary path #1770 follow-up — control paths shouldn't fight the process-global MLX gauge while other work is in flight.

Trade-off (stated)

Under sustained concurrent pressure the enforcer's eviction loop now paces faster (no ~8s barrier stall between victims), so it can evict an additional idle victim where the old code would have stalled first. The old path reached the same evictions after timing out — just slower, with admission blocked and the executor saturated by sync/clear rounds in the meantime. If a pacing pause is still wanted there, a small enforcer-side delay would be the right knob; happy to follow up. Same offer for a settle_indeterminate counter in the pool stats (the #1406 pattern) if INFO-level logging feels too quiet for this condition.

Changes

omlx/engine_pool.py only:

New _other_entries_serving(model_id) — true when any other loaded entry has in_use > 0 or has_active_requests(). Iterates a snapshot of _entries since the admin unload routes call _unload_engine without the pool lock.
Settle loop: when a round fails the freed check and another entry is serving, log the indeterminate sample at INFO and break; the emergency-reclaim rounds are skipped on that path (the in-code comment documents the recovery story above).
Every eviction flavor funnels through _unload_engine (enforcer soft/hard pressure, pre-load admission eviction, prefill-headroom eviction, TTL expiry, admin unload), so this single change covers all callers.

Tests

tests/test_engine_pool.py::TestMemorySettleBarrier:

test_settle_bails_out_under_concurrent_activity — second entry serving + rising global gauge: the barrier exits after one sample (no 0.5s retries, no timeout warning, no emergency reclaim, exactly the one initial executor sync/clear cycle), and the unload still completes and is accounted.
test_settle_still_waits_when_pool_otherwise_idle — same rising gauge with no other entry serving: full barrier behavior preserved (10 retries + emergency reclaim, 14 executor sync/clear cycles).
test_other_entries_serving_in_use_lease_counts — the Engine pool evicts an acquired-but-not-yet-active engine → RuntimeError: Engine not started (acquire→use race; sibling of #1595) #1667 acquire-vs-use lease counts as serving.

.venv/bin/python -m pytest tests/test_engine_pool.py tests/test_process_memory_enforcer.py -q

201 passed

Also verified the new tests fail when the fix is reverted: restoring main's engine_pool.py fails test_settle_bails_out_under_concurrent_activity (the barrier burns every round) and test_other_entries_serving_in_use_lease_counts.

The settle barrier in _unload_engine() verifies an unload actually released its Metal buffers by polling mx.get_active_memory() and comparing the delta against the model's estimated size. That gauge is process-global: while another engine allocates (prefill/KV growth) the delta no longer measures this unload — under load it can read negative. The barrier then burns all 10 settle rounds plus 3 emergency-reclaim rounds, each running gc.collect() and mx.synchronize()/mx.clear_cache() on the MLX executor, serializing against live decode for ~8s — while the ProcessMemoryEnforcer holds the pool lock across the eviction. Observed in the wild as a soft-pressure eviction reporting freed=-51GB while the process climbed past its ceiling. Bail out of the settle wait when any other loaded entry is serving (active requests or a held in-use lease): log the indeterminate sample at INFO, skip the emergency reclaim (its sync/clear rounds would stall the very engines that made the measurement indeterminate), and account the unload as today. Idle-pool behavior is unchanged, preserving the original stuck-unload protection. Adds test_settle_bails_out_under_concurrent_activity, test_settle_still_waits_when_pool_otherwise_idle, and an _other_entries_serving in-use-lease unit test. Refs: 1774

jundot · 2026-06-10T05:47:41Z

Thanks for the follow-up. I checked that this only shortcuts the settle wait when another engine is actually serving; idle loaded engines still keep the existing full settle path. The local test run passed with 201 tests, so this looks good to me and I'm going to merge it.

…laim routing jundot#1785's two new TestMemorySettleBarrier tests asserted on omlx.engine_pool.mx.synchronize/clear_cache call counts, but this fork routes unload reclaim through _reclaim_mlx_cache -> scheduler. _sync_and_clear_cache (Metal-panic safety, ebb0b4c) rather than bare engine_pool.mx calls — so those counters stayed 0. Count reclaim cycles on _reclaim_mlx_cache instead (intent preserved: 1 cycle on concurrent-activity bail-out, 14 on idle-pool timeout+emergency).

jundot merged commit c15f19b into jundot:main Jun 10, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(engine_pool): skip the settle wait when other engines are serving#1785

fix(engine_pool): skip the settle wait when other engines are serving#1785
jundot merged 1 commit into
jundot:mainfrom
JimStenstrom:fix/1774-settle-barrier-concurrency

JimStenstrom commented Jun 10, 2026

Uh oh!

jundot commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JimStenstrom commented Jun 10, 2026

Summary

Why this is safe

Trade-off (stated)

Changes

Tests

Uh oh!

jundot commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants