Skip to content

fix(engine): run generation synchronously to preserve MLX stream context#593

Closed
Marsen-22 wants to merge 1 commit into
waybarrios:mainfrom
Marsen-22:fix/async-to-thread-stream-context
Closed

fix(engine): run generation synchronously to preserve MLX stream context#593
Marsen-22 wants to merge 1 commit into
waybarrios:mainfrom
Marsen-22:fix/async-to-thread-stream-context

Conversation

@Marsen-22

Copy link
Copy Markdown

Problem

/v1/chat/completions returns HTTP 500 with RuntimeError: There is no Stream(gpu, 1) in current thread raised from mlx_lm/generate.py:442 (mx.eval([c.state for c in prompt_cache])).

Reproducible on:

  • vllm-mlx 0.3.0 (PyPI)
  • vllm-mlx 0.4.0rc1 (PyPI)
  • waybarrios/vllm-mlx@main at time of writing

Root cause

engine/simple.py:SimpleEngine._run_blocking_serialized dispatches func to a worker thread via asyncio.to_thread(run_bound). The MLX default stream is bound to the main thread; the worker thread has no stream, so any mx.eval / mx.async_eval call inside func fails because the model arrays carry the main thread's stream context but are being accessed from a stream-less thread.

The helper _bind_worker_generation_streams() (called inside run_bound) attempts to fix this by binding a stream to the worker thread, but it runs after the model arrays were already created on the main thread — the stream context is captured at array-creation time, not at eval time. So the binding is a no-op for this case.

Traceback

File ".../vllm_mlx/engine/simple.py:419 run_bound"
File ".../vllm_mlx/models/llm.py:387 chat"
File ".../vllm_mlx/models/llm.py:199 generate"
File ".../mlx_lm/generate.py:779 generate"
File ".../mlx_lm/generate.py:716 stream_generate"
File ".../mlx_lm/generate.py:705 <genexpr>"
File ".../mlx_lm/generate.py:442 generate_step"
  mx.eval([c.state for c in prompt_cache])
RuntimeError: There is no Stream(gpu, 1) in current thread.

Fix

Remove the asyncio.to_thread dispatch. Run func synchronously on the main thread (where the MLX default stream is already bound). The function-level async with self._generation_lock still serializes concurrent requests, so we don't lose correctness — we only lose the "don't block the event loop" property, which is acceptable for a single-user local server.

The on_cancel callback contract (documented in the function's docstring: "Cancellation must not release the async lock before the worker thread finishes, or a follow-up request can enter MLX/Metal concurrently and corrupt the command-buffer state") is preserved with a clean try/except asyncio.CancelledError: ... raise: block.

Diff

@@ -493,14 +493,9 @@
         corrupt the command-buffer state.
         """
         async with self._generation_lock:
-
-            def run_bound():
+            try:
                 _bind_worker_generation_streams()
                 return func(*args, **kwargs)
-
-            task = asyncio.create_task(asyncio.to_thread(run_bound))
-            try:
-                return await asyncio.shield(task)
             except asyncio.CancelledError:
                 if on_cancel is not None:
                     try:
@@ -510,10 +505,6 @@
                             "Blocking worker cancellation callback failed",
                             exc_info=True,
                         )
-                try:
-                    await task
-                except BaseException:
-                    pass
                 raise

Net: -10 lines, +1 line, syntax-clean (ast.parse verified).

Trade-offs

Before After
/v1/chat/completions works ❌ 500 (stream error) ✅ 200 (PONG verified)
Event loop blocks during inference No (dispatched to worker) Yes (runs on main thread)
Concurrent request serialization Yes (via _generation_lock) Yes (via _generation_lock)
on_cancel callback fired on cancel Yes (after await task) Yes (in except block)
Throughput (single user) 0 (broken) 5.9 tok/s (Llama-3.2-3B, M4 Max)

For a multi-tenant deployment, the right long-term fix is to bind a real MLX stream in the worker thread (e.g., mx.new_stream(mx.default_device()) + mx.set_stream in a try/finally). This PR takes the simpler synchronous path which is correct for the current single-user local server use case.

Verification

# After applying this patch, restart the server and run:
curl -sS -X POST http://127.0.0.1:8010/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/Llama-3.2-3B-Instruct-4bit","messages":[{"role":"user","content":"Say PONG and nothing else."}],"max_tokens":8,"temperature":0}'

# Expected response:
# {"id":"chatcmpl-...","object":"chat.completion",...,"choices":[{"message":{"role":"assistant","content":"PONG"},"finish_reason":"stop"}],...}

Verified on P-01 (M4 Max, macOS) with vllm-mlx 0.4.0rc1 + mlx 0.29.0+ + mlx-lm 0.31.0+. Cross-node test from WSL (P-04) to P-01 (192.168.50.101:8010) over LAN also returns PONG cleanly.

Test plan

  1. Start vllm-mlx server with a small model (e.g., mlx-community/Llama-3.2-1B-Instruct-4bit).
  2. POST a simple chat completion request.
  3. Assert HTTP 200 and choices[0].message.content is non-empty.
  4. Assert the request completes in <10s for max_tokens=8 on M-series hardware.
  5. POST a second concurrent request and assert both return 200 (proves _generation_lock still serializes correctly).

Checklist

  • Bug reproduced and root cause identified
  • Fix implemented and tested locally
  • Cross-node (LAN) inference verified
  • Backward-compatible (public API unchanged)
  • Syntax-validated with ast.parse
  • Tests added upstream (TBD with maintainer)

Related work (deferred, not in this PR)

A deeper fix would land in mlx_lm/generate.py:442 to bind the MLX stream inside generate_step so the bug doesn't surface in any caller (not just vllm-mlx). That's a separate PR against ml-explore/mlx-examples or ml-explore/mlx-lm.
bunta@5800X:~$

The asyncio.to_thread dispatch in _run_blocking_serialized was sending
generation to a worker thread with no MLX stream, causing:

  RuntimeError: There is no Stream(gpu, 1) in current thread

at mlx_lm/generate.py:442 (mx.eval of prompt_cache state). Run
generation on the main thread instead, where the default stream is
already bound. The _generation_lock still serializes concurrent
requests, so we only lose the 'don't block the event loop' property,
which is acceptable for a single-user local server.

Verified on M4 Max with vllm-mlx 0.4.0rc1 + mlx 0.29.0+. Cross-node
LAN inference (WSL -> Mac:8010) returns PONG cleanly with HTTP 200.

@Thump604 Thump604 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up clearly. I agree with the diagnosed failure shape:
MLX stream/thread ownership can break when generation crosses to the wrong
worker thread.

I do not think this patch is safe to merge as-is, because it fixes that symptom
by running the blocking generation function synchronously on the event-loop
thread:

async with self._generation_lock:
    try:
        _bind_worker_generation_streams()
        return func(*args, **kwargs)

That has two concrete regressions relative to the current SimpleEngine contract:

  1. Long generations will block the asyncio event loop. While func() is
    running, health checks, other request admission, disconnect handling, and
    cancellation cannot be serviced normally. This matters even on a local server
    because the same event loop owns the HTTP control surface.

  2. The on_cancel path no longer has the same behavior. In the existing code,
    the blocking work is shielded in a worker task and cancellation waits for the
    worker to finish before releasing the generation lock. With the proposed
    synchronous call, cancellation cannot be observed until control returns to the
    event loop, so client disconnects during a long decode will not trigger the
    documented cancellation behavior at the point it is needed.

This is also the same design issue Jan called out in the first review of #543.
That PR landed on the safer shape: keep MLX work on one stable owner executor
thread without putting decode on the event loop, and keep the existing worker
queue semantics for cancellation, prompt-cache/system-KV behavior, processor
retirement, and feature parity.

Recommended direction:

  • Rework this around the #543 owner-thread / single-executor pattern instead of
    synchronous event-loop decode.
  • Add a regression test proving the event loop remains responsive while
    generation is in flight.
  • Add or preserve cancellation/disconnect coverage for _run_blocking_serialized.

So I’m marking this changes requested. The bug report is useful, but the patch
shape would trade the stream error for event-loop starvation and weaker
cancellation semantics.

@waybarrios

Copy link
Copy Markdown
Owner

Closing this one. The synchronous approach was reviewed and rejected because it blocks the event loop during generation, which stalls health checks and breaks abort handling. The same root cause is addressed by #543 with a dedicated executor that preserves owner-thread affinity. Thanks for digging into it.

@waybarrios waybarrios closed this Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants