fix(engine): run generation synchronously to preserve MLX stream context#593
fix(engine): run generation synchronously to preserve MLX stream context#593Marsen-22 wants to merge 1 commit into
Conversation
The asyncio.to_thread dispatch in _run_blocking_serialized was sending generation to a worker thread with no MLX stream, causing: RuntimeError: There is no Stream(gpu, 1) in current thread at mlx_lm/generate.py:442 (mx.eval of prompt_cache state). Run generation on the main thread instead, where the default stream is already bound. The _generation_lock still serializes concurrent requests, so we only lose the 'don't block the event loop' property, which is acceptable for a single-user local server. Verified on M4 Max with vllm-mlx 0.4.0rc1 + mlx 0.29.0+. Cross-node LAN inference (WSL -> Mac:8010) returns PONG cleanly with HTTP 200.
Thump604
left a comment
There was a problem hiding this comment.
Thanks for writing this up clearly. I agree with the diagnosed failure shape:
MLX stream/thread ownership can break when generation crosses to the wrong
worker thread.
I do not think this patch is safe to merge as-is, because it fixes that symptom
by running the blocking generation function synchronously on the event-loop
thread:
async with self._generation_lock:
try:
_bind_worker_generation_streams()
return func(*args, **kwargs)That has two concrete regressions relative to the current SimpleEngine contract:
-
Long generations will block the asyncio event loop. While
func()is
running, health checks, other request admission, disconnect handling, and
cancellation cannot be serviced normally. This matters even on a local server
because the same event loop owns the HTTP control surface. -
The
on_cancelpath no longer has the same behavior. In the existing code,
the blocking work is shielded in a worker task and cancellation waits for the
worker to finish before releasing the generation lock. With the proposed
synchronous call, cancellation cannot be observed until control returns to the
event loop, so client disconnects during a long decode will not trigger the
documented cancellation behavior at the point it is needed.
This is also the same design issue Jan called out in the first review of #543.
That PR landed on the safer shape: keep MLX work on one stable owner executor
thread without putting decode on the event loop, and keep the existing worker
queue semantics for cancellation, prompt-cache/system-KV behavior, processor
retirement, and feature parity.
Recommended direction:
- Rework this around the #543 owner-thread / single-executor pattern instead of
synchronous event-loop decode. - Add a regression test proving the event loop remains responsive while
generation is in flight. - Add or preserve cancellation/disconnect coverage for
_run_blocking_serialized.
So I’m marking this changes requested. The bug report is useful, but the patch
shape would trade the stream error for event-loop starvation and weaker
cancellation semantics.
|
Closing this one. The synchronous approach was reviewed and rejected because it blocks the event loop during generation, which stalls health checks and breaks abort handling. The same root cause is addressed by #543 with a dedicated executor that preserves owner-thread affinity. Thanks for digging into it. |
Problem
/v1/chat/completionsreturns HTTP 500 withRuntimeError: There is no Stream(gpu, 1) in current threadraised frommlx_lm/generate.py:442(mx.eval([c.state for c in prompt_cache])).Reproducible on:
Root cause
engine/simple.py:SimpleEngine._run_blocking_serializeddispatchesfuncto a worker thread viaasyncio.to_thread(run_bound). The MLX default stream is bound to the main thread; the worker thread has no stream, so anymx.eval/mx.async_evalcall insidefuncfails because the model arrays carry the main thread's stream context but are being accessed from a stream-less thread.The helper
_bind_worker_generation_streams()(called insiderun_bound) attempts to fix this by binding a stream to the worker thread, but it runs after the model arrays were already created on the main thread — the stream context is captured at array-creation time, not at eval time. So the binding is a no-op for this case.Traceback
Fix
Remove the
asyncio.to_threaddispatch. Runfuncsynchronously on the main thread (where the MLX default stream is already bound). The function-levelasync with self._generation_lockstill serializes concurrent requests, so we don't lose correctness — we only lose the "don't block the event loop" property, which is acceptable for a single-user local server.The
on_cancelcallback contract (documented in the function's docstring: "Cancellation must not release the async lock before the worker thread finishes, or a follow-up request can enter MLX/Metal concurrently and corrupt the command-buffer state") is preserved with a cleantry/except asyncio.CancelledError: ... raise:block.Diff
Net: -10 lines, +1 line, syntax-clean (
ast.parseverified).Trade-offs
/v1/chat/completionsworks_generation_lock)_generation_lock)on_cancelcallback fired on cancelawait task)exceptblock)For a multi-tenant deployment, the right long-term fix is to bind a real MLX stream in the worker thread (e.g.,
mx.new_stream(mx.default_device())+mx.set_streamin atry/finally). This PR takes the simpler synchronous path which is correct for the current single-user local server use case.Verification
Verified on P-01 (M4 Max, macOS) with vllm-mlx 0.4.0rc1 + mlx 0.29.0+ + mlx-lm 0.31.0+. Cross-node test from WSL (P-04) to P-01 (
192.168.50.101:8010) over LAN also returns PONG cleanly.Test plan
mlx-community/Llama-3.2-1B-Instruct-4bit).choices[0].message.contentis non-empty._generation_lockstill serializes correctly).Checklist
ast.parseRelated work (deferred, not in this PR)
A deeper fix would land in
mlx_lm/generate.py:442to bind the MLX stream insidegenerate_stepso the bug doesn't surface in any caller (not just vllm-mlx). That's a separate PR againstml-explore/mlx-examplesorml-explore/mlx-lm.bunta@5800X:~$