Skip to content

Keep MLLM media stream on owner thread#551

Open
Thump604 wants to merge 1 commit into
waybarrios:mainfrom
Thump604:604/issue-535-vlm-media-routing
Open

Keep MLLM media stream on owner thread#551
Thump604 wants to merge 1 commit into
waybarrios:mainfrom
Thump604:604/issue-535-vlm-media-routing

Conversation

@Thump604

Copy link
Copy Markdown
Collaborator

Summary

Keep media-bearing SimpleEngine MLLM requests on the current mlx_vlm.stream_chat() thread instead of routing them through the generic blocking worker when an auxiliary TextModel route exists.

Local repro

Issue #535 reports this startup/request shape:

  • model is a Qwen3.6 VLM-capable local quant
  • --mllm / VLM auto-detection enables SimpleEngine MLLM routing
  • text-only routing builds an auxiliary TextModel
  • a media-bearing chat request logs Media request → MLLM path
  • the request then fails with There is no Stream(gpu, 1) in current thread.

The focused regression added here models that exact route split: _is_mllm=True, _text_model present, media content present, and mlx_vlm.stream_chat() requiring its owner/current thread. It fails on upstream main because the media route calls _run_blocking_serialized() and moves stream_chat() to a generic worker.

Upstream code path compared

  • vllm_mlx/engine/simple.py::SimpleEngine.stream_chat
  • current upstream main 7e304840713c2abce8ea1fb40cb960a740c01ce2

Change

  • Preserve the existing TextModel fast path for text-only MLLM requests.
  • Keep media-bearing MLLM requests on the current mlx_vlm.stream_chat() path under the existing generation lock.
  • Keep the existing text-only fallback behavior when no TextModel route exists.

Validation

  • Red/green focused regression: tests/test_simple_engine_mllm_media_thread.py
  • uv run --python 3.12 python -m pytest tests/test_simple_engine.py tests/test_simple_engine_mllm_media_thread.py -q — 39 passed
  • uv run --python 3.12 python -m pytest tests/test_api_utils.py -q — 93 passed
  • uv run --python 3.12 python -m pytest tests/test_server.py -k 'mllm or stream_thread or residency_preload' -q — 5 passed
  • uv run --python 3.12 ruff check vllm_mlx/engine/simple.py tests/test_simple_engine_mllm_media_thread.py
  • uv run --python 3.12 black --check vllm_mlx/engine/simple.py tests/test_simple_engine_mllm_media_thread.py
  • git diff --check
  • Runtime upstream-local-repro preflight against clean upstream main: /opt/ai-runtime/run/upstream-local-repro-preflight/20260519T225531Z-upstream-local-repro-preflight.json

What this does not claim

@janhilgard janhilgard left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

CI all green. Minimal diff (+61/-4), single logical change.

What it does

Fixes the condition at line 764 in SimpleEngine.stream_chat so that media-bearing MLLM requests stay on the current thread instead of being routed through _run_blocking_serialized (worker thread), which breaks MLX stream ownership (Stream(gpu, N) mismatch).

Before/After

# Before (broken for media + text_model):
if self._text_model is None and not has_media_content(messages):
    # current thread ← only text-only without TextModel

# After (correct):
if has_media_content(messages) or self._text_model is None:
    # current thread ← media always, or text-only without TextModel

Routing analysis

The full routing chain in stream_chat is:

  1. L729-749: _text_model exists AND text-only → _stream_generate_text (TextModel fast path)
  2. L752+, L764 (this change): MLLM path:
    • has_media_content OR _text_model is None → current thread under _generation_lock
    • else → _run_blocking_serialized (only reachable for text-only MLLM without TextModel, which is actually dead code after check at L729-749, but harmless)

The condition change is logically correct: media requests must always stay on the owner thread for MLX stream safety. Text-only requests with a TextModel are already handled by the early return at L749.

Test

The regression test accurately models the failure: FakeMllmModel.stream_chat asserts threading.get_ident() matches the owner thread and raises the exact error from #535. The test also asserts _run_blocking_serialized is never called for media requests.

LGTM.

@Thump604 Thump604 force-pushed the 604/issue-535-vlm-media-routing branch from a1f0ee9 to 8ff2a5f Compare May 30, 2026 21:42
@waybarrios

Copy link
Copy Markdown
Owner

The routing fix itself is right, media requests do need to stay on the owner thread. The problem is that #540 landed after this PR was opened and they now collide: #540 added fail-fast admission via _acquire_generation_slot, and the current-thread branch here takes the raw lock instead:

# this PR (simple.py, current-thread branch)
async with self._generation_lock:   # raw lock: second media request queues
    for chunk in self._model.stream_chat(...):
        ...

Two consequences. First, two concurrent media requests queue on the lock instead of the second one getting the 503 text_generation_busy that #540 introduced. Second, _active_requests is never populated on this path, so when a text request fails fast while a media request holds the lock, the EngineBusy error reports active=none, which makes it useless for debugging. Repro sketch:

import asyncio

async def main():
    engine = SimpleEngine("mlx-community/Qwen2.5-VL-3B-Instruct-4bit", force_mllm=True)
    await engine.start()
    media_msg = [{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "..."}},
        {"type": "text", "text": "describe this"},
    ]}]

    async def consume():
        async for _ in engine.stream_chat(media_msg):
            pass

    # expected since #540: second request raises EngineBusy (503 text_generation_busy)
    # actual with this PR: it silently queues behind the first
    await asyncio.gather(consume(), consume())

asyncio.run(main())

The fix should be small: switch that branch to async with self._acquire_generation_slot(request_id): (or add the same fail-fast check before taking the lock). It would also be good if the test asserted the fail-fast rejection for a second concurrent media request, not just thread identity. Happy to merge after that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants