Keep MLLM media stream on owner thread#551
Conversation
janhilgard
left a comment
There was a problem hiding this comment.
Code Review
CI all green. Minimal diff (+61/-4), single logical change.
What it does
Fixes the condition at line 764 in SimpleEngine.stream_chat so that media-bearing MLLM requests stay on the current thread instead of being routed through _run_blocking_serialized (worker thread), which breaks MLX stream ownership (Stream(gpu, N) mismatch).
Before/After
# Before (broken for media + text_model):
if self._text_model is None and not has_media_content(messages):
# current thread ← only text-only without TextModel
# After (correct):
if has_media_content(messages) or self._text_model is None:
# current thread ← media always, or text-only without TextModelRouting analysis
The full routing chain in stream_chat is:
- L729-749:
_text_modelexists AND text-only →_stream_generate_text(TextModel fast path) - L752+, L764 (this change): MLLM path:
has_media_contentOR_text_model is None→ current thread under_generation_lock✅- else →
_run_blocking_serialized(only reachable for text-only MLLM without TextModel, which is actually dead code after check at L729-749, but harmless)
The condition change is logically correct: media requests must always stay on the owner thread for MLX stream safety. Text-only requests with a TextModel are already handled by the early return at L749.
Test
The regression test accurately models the failure: FakeMllmModel.stream_chat asserts threading.get_ident() matches the owner thread and raises the exact error from #535. The test also asserts _run_blocking_serialized is never called for media requests.
LGTM.
a1f0ee9 to
8ff2a5f
Compare
|
The routing fix itself is right, media requests do need to stay on the owner thread. The problem is that #540 landed after this PR was opened and they now collide: #540 added fail-fast admission via # this PR (simple.py, current-thread branch)
async with self._generation_lock: # raw lock: second media request queues
for chunk in self._model.stream_chat(...):
...Two consequences. First, two concurrent media requests queue on the lock instead of the second one getting the 503 import asyncio
async def main():
engine = SimpleEngine("mlx-community/Qwen2.5-VL-3B-Instruct-4bit", force_mllm=True)
await engine.start()
media_msg = [{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "..."}},
{"type": "text", "text": "describe this"},
]}]
async def consume():
async for _ in engine.stream_chat(media_msg):
pass
# expected since #540: second request raises EngineBusy (503 text_generation_busy)
# actual with this PR: it silently queues behind the first
await asyncio.gather(consume(), consume())
asyncio.run(main())The fix should be small: switch that branch to |
Summary
Keep media-bearing SimpleEngine MLLM requests on the current
mlx_vlm.stream_chat()thread instead of routing them through the generic blocking worker when an auxiliary TextModel route exists.Local repro
Issue #535 reports this startup/request shape:
--mllm/ VLM auto-detection enables SimpleEngine MLLM routingMedia request → MLLM pathThere is no Stream(gpu, 1) in current thread.The focused regression added here models that exact route split:
_is_mllm=True,_text_modelpresent, media content present, andmlx_vlm.stream_chat()requiring its owner/current thread. It fails on upstream main because the media route calls_run_blocking_serialized()and movesstream_chat()to a generic worker.Upstream code path compared
vllm_mlx/engine/simple.py::SimpleEngine.stream_chat7e304840713c2abce8ea1fb40cb960a740c01ce2Change
mlx_vlm.stream_chat()path under the existing generation lock.Validation
tests/test_simple_engine_mllm_media_thread.pyuv run --python 3.12 python -m pytest tests/test_simple_engine.py tests/test_simple_engine_mllm_media_thread.py -q— 39 passeduv run --python 3.12 python -m pytest tests/test_api_utils.py -q— 93 passeduv run --python 3.12 python -m pytest tests/test_server.py -k 'mllm or stream_thread or residency_preload' -q— 5 passeduv run --python 3.12 ruff check vllm_mlx/engine/simple.py tests/test_simple_engine_mllm_media_thread.pyuv run --python 3.12 black --check vllm_mlx/engine/simple.py tests/test_simple_engine_mllm_media_thread.pygit diff --check/opt/ai-runtime/run/upstream-local-repro-preflight/20260519T225531Z-upstream-local-repro-preflight.jsonWhat this does not claim