Skip to content

Reduce EngineCore idle polling#552

Open
Thump604 wants to merge 1 commit into
waybarrios:mainfrom
Thump604:604/issue-508-adaptive-idle-polling
Open

Reduce EngineCore idle polling#552
Thump604 wants to merge 1 commit into
waybarrios:mainfrom
Thump604:604/issue-508-adaptive-idle-polling

Conversation

@Thump604

Copy link
Copy Markdown
Collaborator

Summary

Implements the light-tier idle polling part of #508.

  • Adds EngineConfig.idle_step_interval with a 100ms default for the empty-scheduler path.
  • Adds a request event that add_request() sets so the idle loop wakes immediately when new work arrives.
  • Keeps the active generation path unchanged: scheduler steps still run as before when requests are present.

Observed behavior

The current empty-scheduler branch sleeps on the active step_interval value, defaulting to 1ms. That keeps the serve loop waking roughly 1000 times per second even when there is no queued work.

Expected behavior

When the scheduler is empty, the loop should wait on a lower-frequency idle interval while still waking promptly on a new request.

Minimal patch shape

This is intentionally limited to EngineCore idle-loop behavior and regression tests. It does not add a CLI flag or new public API surface beyond the EngineConfig field.

Validation

  • uv run --python 3.12 --extra dev pytest tests/test_engine_core_idle_polling.py tests/test_engine_core_thread_streams.py tests/test_batching.py::TestEngineAsync::test_engine_lifecycle tests/test_batching.py::TestEngineAsync::test_engine_context_manager tests/test_batching.py::TestEngineAsync::test_stream_outputs_consumer_break_after_finished_does_not_abort -q -> 7 passed
  • uv run --python 3.12 --extra dev ruff check vllm_mlx/engine_core.py tests/test_engine_core_idle_polling.py --select E,F,W --ignore E402,E501,E731,F811,F841 -> passed
  • uv run --python 3.12 --extra dev black --check --target-version py312 vllm_mlx/engine_core.py tests/test_engine_core_idle_polling.py -> passed
  • python3 /opt/ai-runtime/bin/lint-upstream-claims --root /opt/ai-runtime/worktrees/vllm-mlx/issue-508-adaptive-idle-polling vllm_mlx/engine_core.py tests/test_engine_core_idle_polling.py -> passed
  • git diff --check -> passed

Not claimed

  • Does not implement lazy load, idle unload, /health state transitions, or memory release semantics.
  • Does not change model loading, decoding, scheduling policy, streaming semantics, or request output behavior.
  • Does not claim a measured CPU reduction; it only changes the empty-scheduler wait behavior and proves request wakeup semantics in unit tests.

@janhilgard janhilgard left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

CI all green. Clean diff (+108/-3), narrowly scoped to idle polling behavior.

What it does

Replaces the tight 1ms asyncio.sleep(step_interval) idle loop with a 100ms asyncio.Event.wait(timeout) pattern. When the scheduler has no requests, the engine loop waits up to 100ms but wakes immediately when add_request() signals new work via the event.

Assessment

Correctness: The _wait_for_idle_or_request helper correctly handles all edge cases:

  • request_event is None → falls back to asyncio.sleep (backward compat for engines constructed without event)
  • timeout <= 0 → yields with asyncio.sleep(0)
  • Normal path → asyncio.wait_for(event.wait(), timeout) with TimeoutError suppression
  • Event is cleared in finally block to avoid stale signals

Latency impact: Zero impact on active generation — the idle interval only applies when scheduler.has_requests() returns False. New requests wake the loop immediately via _set_request_event, so request latency is bounded by event loop scheduling, not the 100ms timeout.

Defensive getattr: The code uses getattr(self, "_request_event", None) at all call sites. This handles the case where an EngineCore was constructed before this change (e.g., deserialized or subclassed without __init__). Slightly defensive but not wrong.

Resource savings: Reduces idle wakeups from ~1000/s to ~10/s with instant wake on new work. Meaningful for battery/thermal on Apple Silicon when the server is idle.

Tests: Two focused tests — one verifying the idle interval is used (not the active 1ms), one verifying add_request sets the event for immediate wake.

Minor notes (non-blocking)

  1. The step_interval variable is no longer read in _engine_loop after this change (the active path uses asyncio.sleep(0), not asyncio.sleep(step_interval)). The step_interval field on EngineConfig is now effectively unused. Consider deprecating or documenting this in a follow-up.

  2. The _clear_request_event call at the top of the has_requests() branch is correct — it prevents a stale event from being set during processing, ensuring the next idle wait actually blocks.

LGTM.

@waybarrios

Copy link
Copy Markdown
Owner

The event logic looks right, I went hunting for a lost-wakeup race and didn't find one (asyncio.Event.wait() returns immediately if the event is already set, so the idle-to-active transition is safe). Two things before this can merge though.

First, the tests never exercise the new event-driven path. This test forces the fallback:

# test_engine_loop_uses_idle_interval_when_scheduler_is_empty
engine._request_event = None   # forces the plain-sleep fallback,
                               # so _wait_for_idle_or_request's event wait is never covered

and test_add_request_wakes_idle_engine_loop only checks engine._request_event.is_set() without the loop running, so nothing verifies the loop actually wakes early. Something like this would cover the real behavior:

async def test_idle_loop_wakes_on_add_request():
    engine = EngineCore(..., idle_step_interval=5.0)  # long idle sleep on purpose
    await engine.start()
    await asyncio.sleep(0.05)            # loop is now parked in the idle wait
    t0 = time.monotonic()
    await engine.add_request(request)    # should wake it immediately
    _ = await get_first_output(engine, request_id)
    assert time.monotonic() - t0 < 1.0   # would be ~5s if the wakeup were lost

Second, this changes _engine_loop, which is hot path, so it needs measured numbers per the repo's PR rules: TTFT and decode tok/s for main vs this branch on the same hardware (batch=1 and batch=8), plus the idle CPU% delta since that's the headline claim. The PR body currently opts out of measurement, which doesn't fly for this file.

One minor note while you're in there: abort_request doesn't set the event, so an abort that empties the scheduler leaves the loop sleeping for up to 100ms (it used to notice in about 1ms). Probably harmless, but worth a code comment so nobody trips on it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants