fix(engine_pool): skip the settle wait when other engines are serving#1785
Merged
jundot merged 1 commit intoJun 10, 2026
Merged
Conversation
The settle barrier in _unload_engine() verifies an unload actually released its Metal buffers by polling mx.get_active_memory() and comparing the delta against the model's estimated size. That gauge is process-global: while another engine allocates (prefill/KV growth) the delta no longer measures this unload — under load it can read negative. The barrier then burns all 10 settle rounds plus 3 emergency-reclaim rounds, each running gc.collect() and mx.synchronize()/mx.clear_cache() on the MLX executor, serializing against live decode for ~8s — while the ProcessMemoryEnforcer holds the pool lock across the eviction. Observed in the wild as a soft-pressure eviction reporting freed=-51GB while the process climbed past its ceiling. Bail out of the settle wait when any other loaded entry is serving (active requests or a held in-use lease): log the indeterminate sample at INFO, skip the emergency reclaim (its sync/clear rounds would stall the very engines that made the measurement indeterminate), and account the unload as today. Idle-pool behavior is unchanged, preserving the original stuck-unload protection. Adds test_settle_bails_out_under_concurrent_activity, test_settle_still_waits_when_pool_otherwise_idle, and an _other_entries_serving in-use-lease unit test. Refs: 1774
Owner
|
Thanks for the follow-up. I checked that this only shortcuts the settle wait when another engine is actually serving; idle loaded engines still keep the existing full settle path. The local test run passed with 201 tests, so this looks good to me and I'm going to merge it. |
khsd6327
added a commit
to khsd6327/omlx
that referenced
this pull request
Jun 10, 2026
…laim routing jundot#1785's two new TestMemorySettleBarrier tests asserted on omlx.engine_pool.mx.synchronize/clear_cache call counts, but this fork routes unload reclaim through _reclaim_mlx_cache -> scheduler. _sync_and_clear_cache (Metal-panic safety, ebb0b4c) rather than bare engine_pool.mx calls — so those counters stayed 0. Count reclaim cycles on _reclaim_mlx_cache instead (intent preserved: 1 cycle on concurrent-activity bail-out, 14 on idle-pool timeout+emergency).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses the
freed=-51GB/ eviction-stall portion of #1774 (it does not close the issue — it removes the eviction-time amplification under concurrent load). Complements the fixes already in 0.4.4.dev1, none of which touch the settle barrier itself.Summary
_unload_engine()'s settle barrier verifies an unload by pollingmx.get_active_memory()and comparing the delta against the model's estimated size. That gauge is process-global: while another engine allocates (prefill/KV growth), the delta no longer measures this unload — under concurrent load it reads negative, which is exactly thefreed=-51GBin Memory enforcer fails under concurrent load in 0.4.3 — process grows to 116 GB on a ~27 GB model pair, eviction reportsfreed=-51GBand breaches the configured ceiling → host OOM #1774's report.gc.collect()+mx.synchronize()+mx.clear_cache()cycles (beyond the initial release pass) serialized onto the MLX executor against live decode — while the memory enforcer holds the pool lock across the eviction, so admission is blocked too. Under pressure the eviction itself becomes a stall-and-spike, consistent with the reporter's "routine soft-pressure eviction … spiked to 115.9 GB in ~17 s".RuntimeError: Engine not started(acquire→use race; sibling of #1595) #1667). One INFO line, no retry rounds, no emergency reclaim. With the pool otherwise idle the barrier is byte-for-byte unchanged, so the original stuck-unload protection (0.3.5-rc1: engine_pool settle_tolerance fixed at 2 GB is too tight for large (>40 GB) models #768) is fully preserved.Why this is safe
max(mx.get_active_memory(), get_phys_footprint(), _current_model_memory)(the v0.4.0 regression: Pre-load eviction fails when loading a second large model #1623 guard), so the live gauge dominates while anything is still resident — and_unload_engineends with_wake_process_memory_enforcer(), so the enforcer re-polls immediately rather than waiting for the next tick. Skipping the emergency reclaim defers recovery to those two paths instead of losing it.has_active_requests()(AttributeError, matching_find_lru_victim's idiom), the entry simply doesn't count as serving and the barrier falls back to today's full wait.Trade-off (stated)
Under sustained concurrent pressure the enforcer's eviction loop now paces faster (no ~8s barrier stall between victims), so it can evict an additional idle victim where the old code would have stalled first. The old path reached the same evictions after timing out — just slower, with admission blocked and the executor saturated by sync/clear rounds in the meantime. If a pacing pause is still wanted there, a small enforcer-side delay would be the right knob; happy to follow up. Same offer for a
settle_indeterminatecounter in the pool stats (the #1406 pattern) if INFO-level logging feels too quiet for this condition.Changes
omlx/engine_pool.pyonly:_other_entries_serving(model_id)— true when any other loaded entry hasin_use > 0orhas_active_requests(). Iterates a snapshot of_entriessince the admin unload routes call_unload_enginewithout the pool lock._unload_engine(enforcer soft/hard pressure, pre-load admission eviction, prefill-headroom eviction, TTL expiry, admin unload), so this single change covers all callers.Tests
tests/test_engine_pool.py::TestMemorySettleBarrier:test_settle_bails_out_under_concurrent_activity— second entry serving + rising global gauge: the barrier exits after one sample (no 0.5s retries, no timeout warning, no emergency reclaim, exactly the one initial executor sync/clear cycle), and the unload still completes and is accounted.test_settle_still_waits_when_pool_otherwise_idle— same rising gauge with no other entry serving: full barrier behavior preserved (10 retries + emergency reclaim, 14 executor sync/clear cycles).test_other_entries_serving_in_use_lease_counts— the Engine pool evicts an acquired-but-not-yet-active engine →RuntimeError: Engine not started(acquire→use race; sibling of #1595) #1667 acquire-vs-use lease counts as serving.Also verified the new tests fail when the fix is reverted: restoring main's
engine_pool.pyfailstest_settle_bails_out_under_concurrent_activity(the barrier burns every round) andtest_other_entries_serving_in_use_lease_counts.