Keep MLLM media stream on owner thread by Thump604 · Pull Request #551 · waybarrios/vllm-mlx

Thump604 · 2026-05-19T22:56:11Z

Summary

Keep media-bearing SimpleEngine MLLM requests on the current mlx_vlm.stream_chat() thread instead of routing them through the generic blocking worker when an auxiliary TextModel route exists.

Local repro

Issue #535 reports this startup/request shape:

model is a Qwen3.6 VLM-capable local quant
--mllm / VLM auto-detection enables SimpleEngine MLLM routing
text-only routing builds an auxiliary TextModel
a media-bearing chat request logs Media request → MLLM path
the request then fails with There is no Stream(gpu, 1) in current thread.

The focused regression added here models that exact route split: _is_mllm=True, _text_model present, media content present, and mlx_vlm.stream_chat() requiring its owner/current thread. It fails on upstream main because the media route calls _run_blocking_serialized() and moves stream_chat() to a generic worker.

Upstream code path compared

vllm_mlx/engine/simple.py::SimpleEngine.stream_chat
current upstream main 7e304840713c2abce8ea1fb40cb960a740c01ce2

Change

Preserve the existing TextModel fast path for text-only MLLM requests.
Keep media-bearing MLLM requests on the current mlx_vlm.stream_chat() path under the existing generation lock.
Keep the existing text-only fallback behavior when no TextModel route exists.

Validation

Red/green focused regression: tests/test_simple_engine_mllm_media_thread.py
uv run --python 3.12 python -m pytest tests/test_simple_engine.py tests/test_simple_engine_mllm_media_thread.py -q — 39 passed
uv run --python 3.12 python -m pytest tests/test_api_utils.py -q — 93 passed
uv run --python 3.12 python -m pytest tests/test_server.py -k 'mllm or stream_thread or residency_preload' -q — 5 passed
uv run --python 3.12 ruff check vllm_mlx/engine/simple.py tests/test_simple_engine_mllm_media_thread.py
uv run --python 3.12 black --check vllm_mlx/engine/simple.py tests/test_simple_engine_mllm_media_thread.py
git diff --check
Runtime upstream-local-repro preflight against clean upstream main: /opt/ai-runtime/run/upstream-local-repro-preflight/20260519T225531Z-upstream-local-repro-preflight.json

What this does not claim

This does not claim full live verification against the reporter's exact 35B quantized weights.
This does not change model detection; config.json-based VLM detection is already on current main via detect MLLM via config.json with substring matcher as fallback #520.
This does not change the separate TextModel owner-thread PR Keep VLM TextModel generation on owner thread #543.

janhilgard

Code Review

CI all green. Minimal diff (+61/-4), single logical change.

What it does

Fixes the condition at line 764 in SimpleEngine.stream_chat so that media-bearing MLLM requests stay on the current thread instead of being routed through _run_blocking_serialized (worker thread), which breaks MLX stream ownership (Stream(gpu, N) mismatch).

Before/After

# Before (broken for media + text_model):
if self._text_model is None and not has_media_content(messages):
    # current thread ← only text-only without TextModel

# After (correct):
if has_media_content(messages) or self._text_model is None:
    # current thread ← media always, or text-only without TextModel

Routing analysis

The full routing chain in stream_chat is:

L729-749: _text_model exists AND text-only → _stream_generate_text (TextModel fast path)
L752+, L764 (this change): MLLM path:
- has_media_content OR _text_model is None → current thread under _generation_lock ✅
- else → _run_blocking_serialized (only reachable for text-only MLLM without TextModel, which is actually dead code after check at L729-749, but harmless)

The condition change is logically correct: media requests must always stay on the owner thread for MLX stream safety. Text-only requests with a TextModel are already handled by the early return at L749.

Test

The regression test accurately models the failure: FakeMllmModel.stream_chat asserts threading.get_ident() matches the owner thread and raises the exact error from #535. The test also asserts _run_blocking_serialized is never called for media requests.

LGTM.

waybarrios · 2026-06-11T21:00:23Z

The routing fix itself is right, media requests do need to stay on the owner thread. The problem is that #540 landed after this PR was opened and they now collide: #540 added fail-fast admission via _acquire_generation_slot, and the current-thread branch here takes the raw lock instead:

# this PR (simple.py, current-thread branch)
async with self._generation_lock:   # raw lock: second media request queues
    for chunk in self._model.stream_chat(...):
        ...

Two consequences. First, two concurrent media requests queue on the lock instead of the second one getting the 503 text_generation_busy that #540 introduced. Second, _active_requests is never populated on this path, so when a text request fails fast while a media request holds the lock, the EngineBusy error reports active=none, which makes it useless for debugging. Repro sketch:

import asyncio

async def main():
    engine = SimpleEngine("mlx-community/Qwen2.5-VL-3B-Instruct-4bit", force_mllm=True)
    await engine.start()
    media_msg = [{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "..."}},
        {"type": "text", "text": "describe this"},
    ]}]

    async def consume():
        async for _ in engine.stream_chat(media_msg):
            pass

    # expected since #540: second request raises EngineBusy (503 text_generation_busy)
    # actual with this PR: it silently queues behind the first
    await asyncio.gather(consume(), consume())

asyncio.run(main())

The fix should be small: switch that branch to async with self._acquire_generation_slot(request_id): (or add the same fail-fast check before taking the lock). It would also be good if the test asserted the fail-fast rejection for a second concurrent media request, not just thread identity. Happy to merge after that.

Thump604 assigned janhilgard May 19, 2026

Thump604 mentioned this pull request May 19, 2026

Bug: Vision capabilities/tool calling in vllm-mlx vs mlx-vlm? #535

Open

Thump604 requested a review from janhilgard May 19, 2026 22:56

janhilgard approved these changes May 20, 2026

View reviewed changes

Thump604 requested a review from waybarrios May 20, 2026 12:08

Thump604 assigned waybarrios May 20, 2026

Keep MLLM media stream on owner thread

8ff2a5f

Thump604 force-pushed the 604/issue-535-vlm-media-routing branch from a1f0ee9 to 8ff2a5f Compare May 30, 2026 21:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep MLLM media stream on owner thread#551

Keep MLLM media stream on owner thread#551
Thump604 wants to merge 1 commit into
waybarrios:mainfrom
Thump604:604/issue-535-vlm-media-routing

Thump604 commented May 19, 2026

Uh oh!

janhilgard left a comment

Uh oh!

waybarrios commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Thump604 commented May 19, 2026

Summary

Local repro

Upstream code path compared

Change

Validation

What this does not claim

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

Code Review

What it does

Before/After

Routing analysis

Test

Uh oh!

waybarrios commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants