Skip to content

fix(engine_pool): skip the settle wait when other engines are serving#1785

Merged
jundot merged 1 commit into
jundot:mainfrom
JimStenstrom:fix/1774-settle-barrier-concurrency
Jun 10, 2026
Merged

fix(engine_pool): skip the settle wait when other engines are serving#1785
jundot merged 1 commit into
jundot:mainfrom
JimStenstrom:fix/1774-settle-barrier-concurrency

Conversation

@JimStenstrom

Copy link
Copy Markdown
Contributor

Addresses the freed=-51GB / eviction-stall portion of #1774 (it does not close the issue — it removes the eviction-time amplification under concurrent load). Complements the fixes already in 0.4.4.dev1, none of which touch the settle barrier itself.

Summary

Why this is safe

  • Accounting is unchanged: the bail-out path accounts the unload exactly as the existing timed-out path already does.
  • Unreleased memory stays visible: pre-load admission reads max(mx.get_active_memory(), get_phys_footprint(), _current_model_memory) (the v0.4.0 regression: Pre-load eviction fails when loading a second large model #1623 guard), so the live gauge dominates while anything is still resident — and _unload_engine ends with _wake_process_memory_enforcer(), so the enforcer re-polls immediately rather than waiting for the next tick. Skipping the emergency reclaim defers recovery to those two paths instead of losing it.
  • Conservative failure direction: if an engine doesn't expose has_active_requests() (AttributeError, matching _find_lru_victim's idiom), the entry simply doesn't count as serving and the barrier falls back to today's full wait.
  • Policy: this continues the direction of the 0.4.3 enforcer change and the fix(dflash): enforce prefill memory guard on DFlash primary path #1770 follow-up — control paths shouldn't fight the process-global MLX gauge while other work is in flight.

Trade-off (stated)

Under sustained concurrent pressure the enforcer's eviction loop now paces faster (no ~8s barrier stall between victims), so it can evict an additional idle victim where the old code would have stalled first. The old path reached the same evictions after timing out — just slower, with admission blocked and the executor saturated by sync/clear rounds in the meantime. If a pacing pause is still wanted there, a small enforcer-side delay would be the right knob; happy to follow up. Same offer for a settle_indeterminate counter in the pool stats (the #1406 pattern) if INFO-level logging feels too quiet for this condition.

Changes

omlx/engine_pool.py only:

  • New _other_entries_serving(model_id) — true when any other loaded entry has in_use > 0 or has_active_requests(). Iterates a snapshot of _entries since the admin unload routes call _unload_engine without the pool lock.
  • Settle loop: when a round fails the freed check and another entry is serving, log the indeterminate sample at INFO and break; the emergency-reclaim rounds are skipped on that path (the in-code comment documents the recovery story above).
  • Every eviction flavor funnels through _unload_engine (enforcer soft/hard pressure, pre-load admission eviction, prefill-headroom eviction, TTL expiry, admin unload), so this single change covers all callers.

Tests

tests/test_engine_pool.py::TestMemorySettleBarrier:

  • test_settle_bails_out_under_concurrent_activity — second entry serving + rising global gauge: the barrier exits after one sample (no 0.5s retries, no timeout warning, no emergency reclaim, exactly the one initial executor sync/clear cycle), and the unload still completes and is accounted.
  • test_settle_still_waits_when_pool_otherwise_idle — same rising gauge with no other entry serving: full barrier behavior preserved (10 retries + emergency reclaim, 14 executor sync/clear cycles).
  • test_other_entries_serving_in_use_lease_counts — the Engine pool evicts an acquired-but-not-yet-active engine → RuntimeError: Engine not started (acquire→use race; sibling of #1595) #1667 acquire-vs-use lease counts as serving.
.venv/bin/python -m pytest tests/test_engine_pool.py tests/test_process_memory_enforcer.py -q
201 passed

Also verified the new tests fail when the fix is reverted: restoring main's engine_pool.py fails test_settle_bails_out_under_concurrent_activity (the barrier burns every round) and test_other_entries_serving_in_use_lease_counts.

The settle barrier in _unload_engine() verifies an unload actually released
its Metal buffers by polling mx.get_active_memory() and comparing the delta
against the model's estimated size. That gauge is process-global: while
another engine allocates (prefill/KV growth) the delta no longer measures
this unload — under load it can read negative. The barrier then burns all
10 settle rounds plus 3 emergency-reclaim rounds, each running gc.collect()
and mx.synchronize()/mx.clear_cache() on the MLX executor, serializing
against live decode for ~8s — while the ProcessMemoryEnforcer holds the
pool lock across the eviction. Observed in the wild as a soft-pressure
eviction reporting freed=-51GB while the process climbed past its ceiling.

Bail out of the settle wait when any other loaded entry is serving
(active requests or a held in-use lease): log the indeterminate sample at
INFO, skip the emergency reclaim (its sync/clear rounds would stall the
very engines that made the measurement indeterminate), and account the
unload as today. Idle-pool behavior is unchanged, preserving the original
stuck-unload protection.

Adds test_settle_bails_out_under_concurrent_activity,
test_settle_still_waits_when_pool_otherwise_idle, and an
_other_entries_serving in-use-lease unit test.

Refs: 1774
@jundot

jundot commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Thanks for the follow-up. I checked that this only shortcuts the settle wait when another engine is actually serving; idle loaded engines still keep the existing full settle path. The local test run passed with 201 tests, so this looks good to me and I'm going to merge it.

@jundot jundot merged commit c15f19b into jundot:main Jun 10, 2026
4 checks passed
khsd6327 added a commit to khsd6327/omlx that referenced this pull request Jun 10, 2026
…laim routing

jundot#1785's two new TestMemorySettleBarrier tests asserted on
omlx.engine_pool.mx.synchronize/clear_cache call counts, but this fork
routes unload reclaim through _reclaim_mlx_cache -> scheduler.
_sync_and_clear_cache (Metal-panic safety, ebb0b4c) rather than bare
engine_pool.mx calls — so those counters stayed 0. Count reclaim
cycles on _reclaim_mlx_cache instead (intent preserved: 1 cycle on
concurrent-activity bail-out, 14 on idle-pool timeout+emergency).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants