Skip to content

Keep VLM TextModel generation on owner thread#543

Open
Thump604 wants to merge 2 commits into
waybarrios:mainfrom
Thump604:604/vlm-text-owner-thread
Open

Keep VLM TextModel generation on owner thread#543
Thump604 wants to merge 2 commits into
waybarrios:mainfrom
Thump604:604/vlm-text-owner-thread

Conversation

@Thump604

@Thump604 Thump604 commented May 17, 2026

Copy link
Copy Markdown
Collaborator

Summary

Keep VLM-derived TextModel generation on one stable owner worker thread, without adding a synchronous event-loop decode path.

Local Repro

With a Qwen 3.6 VLM artifact using the SimpleEngine text-only route, a tiny text generation completed successfully, then the short-lived process segfaulted during asyncio.run() shutdown.

Minimal observed sequence:

MARK after start
MARK after generation ok
MARK after stop
MARK after clear_cache
MARK after asyncio.run
exit_code=139

Narrowed local artifacts:

/opt/ai-runtime/run/qwen35-dflash/20260517T014129Z/diag-simpleengine-start-stop.log
/opt/ai-runtime/run/qwen35-dflash/20260517T014129Z/diag-simpleengine-tiny-generation.log
/opt/ai-runtime/run/qwen35-dflash/20260517T014129Z/diag-built-textmodel-worker-queue.log
/opt/ai-runtime/run/qwen35-dflash/20260517T014129Z/diag-built-textmodel-worker-owned-queue.log

The control cases showed:

  • SimpleEngine.start() / stop() without generation exited cleanly
  • direct MLXLanguageModel.stream_generate() exited cleanly
  • VLM-derived raw TextModel generation exited cleanly when built and generated on the same thread
  • VLM-derived raw TextModel generation crashed when built on one thread and generated through a different worker thread

Code Path Compared

Checked waybarrios/vllm-mlx main at c2d3aec:

vllm_mlx/engine/simple.py::SimpleEngine.start
vllm_mlx/engine/simple.py::SimpleEngine._stream_generate_text

SimpleEngine.start() built the VLM-derived TextModel on the event-loop thread. _stream_generate_text() then ran normal TextModel decode through the serialized worker path. That crossed the MLX ownership boundary for the VLM-derived TextModel.

Change

Build the VLM-derived TextModel on a dedicated single-thread executor and record that thread as the owner. TextModel generation then uses the same executor through the existing _stream_generate_text() producer path.

This keeps the existing worker-queue behavior for:

  • client cancellation / abort handling
  • system KV cache restore and prompt_cache forwarding
  • SpecPrefill fallback behavior
  • processor-retirement resume behavior
  • MTP cache layering when active

The earlier direct synchronous event-loop decode branch has been removed.

Tests

Updated the regression coverage to prove the VLM-derived TextModel is built and generated on the same owner worker thread, not on the event-loop thread.

Local verification:

.venv/bin/python -m pytest tests/test_simple_engine.py tests/test_simple_engine_cancel_serialization.py -q
42 passed

.venv/bin/python -m pytest -q
2072 passed, 11 skipped, 23 deselected

.venv/bin/python -m ruff check vllm_mlx/engine/simple.py tests/test_simple_engine.py tests/test_simple_engine_cancel_serialization.py
All checks passed

.venv/bin/python -m black --check --fast vllm_mlx/engine/simple.py tests/test_simple_engine.py tests/test_simple_engine_cancel_serialization.py
3 files would be left unchanged

git diff --check
passed

Not Claimed

This PR does not change model cards, default model selection, resident routing, MTP behavior, or SpecPrefill behavior. It only keeps VLM-derived TextModel build and decode on one stable TextModel worker thread while preserving the existing SimpleEngine text-route semantics.

@janhilgard janhilgard left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

The motivation is sound: VLM-derived TextModel objects have MLX thread-affinity, so crossing to a worker thread via _run_blocking_serialized can cause segfaults during asyncio shutdown. The approach of detecting the owner thread and staying on it is reasonable. However, there are several issues — some correctness bugs, some missing feature parity with the existing worker path — that should be addressed before merging.


1. Missing abort_event handling — synchronous generator blocks event loop

Severity: High

The new owner-thread path runs mlx_stream_generate() (a synchronous, potentially long-running generator) directly inside async with self._generation_lock, on the event loop thread. The existing worker path runs this in asyncio.to_thread() (via _run_blocking_serialized) precisely so the event loop stays responsive.

In the new path:

  • Each iteration of the for resp in mlx_stream_generate(...) loop is a synchronous blocking call (Metal compute). The event loop is blocked for the duration of each decode step.
  • The abort_event threading.Event is never checked. If a client disconnects mid-stream, there is no way to break out of the generation loop.
  • HTTP health checks, other concurrent requests, and even cancellation signals cannot be processed while the loop is blocked.

This may be acceptable for very short generations, but for any meaningful max_tokens (e.g. 4096), this will freeze the entire server.

Suggestion: Either (a) run the synchronous generator in a thread but ensure it stays on the owner thread (perhaps via a single-thread executor bound to the owner thread), or (b) at minimum check abort_event inside the loop (though this won't help with event-loop blocking).


2. Missing processor retirement logic

Severity: Medium

The worker-queue path (_run_all) has a significant feature: logits processor retirement. When can_retire_processors is true and processors have been retired (e.g., after a thinking phase), the existing code breaks out of the generation loop, seeds a new prompt_cache, and resumes with MTP re-enabled via _resume_after_processor_retirement().

The new owner-thread path has none of this. Requests that use retirable logits processors (which is the default for <think> reasoning models) will either:

  • Run the entire generation without MTP even after the processor retires, or
  • Behave differently from the worker path in subtle ways.

This is a functional regression for VLM-derived TextModel requests that use reasoning/thinking.


3. backbone_cache is None condition is overly restrictive

Severity: Medium

The guard condition requires backbone_cache is None:

if (
    self._text_model_owner_thread == threading.get_ident()
    and not use_specprefill
    and backbone_cache is None
):

This means the owner-thread path is only used when there is no system KV cache hit. On the first request (cache miss with system messages), backbone_cache gets set in the code above (around line 1828 in main). On subsequent requests with the same system prompt, backbone_cache is restored from the snapshot.

In practice, after the first request, nearly all requests to a model with a system prompt will have backbone_cache is not None, so the fix won't apply. The segfault problem presumably still exists when the fallback worker path is used.

If the owner-thread path can't support backbone_cache, this should at least be documented as a known limitation.


4. No prompt_cache / system KV snapshot support

Severity: Low-Medium

Related to point 3: the new path never constructs a prompt_cache from backbone_cache, never passes prompt_cache to mlx_stream_generate, and never builds/restores system KV snapshots. The worker path does all of this. This means:

  • No KV cache reuse across requests with the same system prompt
  • No MTP cache layering on top of backbone cache

5. Thread identity check may not match in all deployment configurations

Severity: Low

threading.get_ident() is used to track the owner thread. In start(), the text model is built on the event-loop thread. In _stream_generate_text(), the check self._text_model_owner_thread == threading.get_ident() assumes the async generator runs on the same event-loop thread.

For a standard single-threaded asyncio event loop this is correct. But if anyone runs with uvloop or a custom event loop policy that uses multiple threads for coroutine execution, this assumption breaks silently and the code falls through to the existing worker path (safe, but defeats the purpose). This is not a bug per se, but worth a comment.


6. Test is good but narrow

Severity: Low

The test correctly verifies that generation stays on the owner thread. However:

  • It doesn't test the fallback path (when backbone_cache is not None or use_specprefill is True).
  • It doesn't test what happens when _text_model_owner_thread is None (non-MLLM model).
  • It doesn't test abort/cancellation behavior.
  • The fake stream_generate yields only one token — a multi-token test would better exercise the accumulation and stop-sequence logic.

Minor notes

  • Code style and formatting are clean, consistent with the rest of the file.
  • The _text_model_owner_thread lifecycle management (set in start, cleared in stop and error paths) is thorough.
  • The return at the end of the new block correctly prevents falling through to the worker-queue path.

Recommendation

The core idea is correct, but the implementation introduces event-loop blocking (issue 1) and lacks feature parity with the worker path (issue 2). I'd suggest either:

  1. Minimal fix: Use a dedicated single-thread executor that is pinned to the owner thread, so Metal ops stay on the right thread without blocking the event loop.
  2. Or: Accept the blocking trade-off but document it clearly, add abort_event checking, and add processor retirement support for feature parity.

@Thump604 Thump604 requested a review from janhilgard May 17, 2026 13:04

@janhilgard janhilgard left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All issues from the previous review are addressed. The dedicated ThreadPoolExecutor(max_workers=1) approach is clean — it preserves full feature parity (abort handling, processor retirement, backbone cache, system KV cache) while guaranteeing thread ownership for the VLM-derived TextModel.

Key improvements over v1:

  • Event loop no longer blocked (generation runs on executor thread)
  • All existing _run_blocking_serialized machinery (lock, cancellation, shield) reused
  • Defensive assertion in _run_all() catches wrong-thread bugs loudly
  • Proper executor shutdown in stop()

CI 9/9 green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants