Keep VLM TextModel generation on owner thread by Thump604 · Pull Request #543 · waybarrios/vllm-mlx

Thump604 · 2026-05-17T02:58:10Z

Summary

Keep VLM-derived TextModel generation on one stable owner worker thread, without adding a synchronous event-loop decode path.

Local Repro

With a Qwen 3.6 VLM artifact using the SimpleEngine text-only route, a tiny text generation completed successfully, then the short-lived process segfaulted during asyncio.run() shutdown.

Minimal observed sequence:

MARK after start
MARK after generation ok
MARK after stop
MARK after clear_cache
MARK after asyncio.run
exit_code=139

Narrowed local artifacts:

/opt/ai-runtime/run/qwen35-dflash/20260517T014129Z/diag-simpleengine-start-stop.log
/opt/ai-runtime/run/qwen35-dflash/20260517T014129Z/diag-simpleengine-tiny-generation.log
/opt/ai-runtime/run/qwen35-dflash/20260517T014129Z/diag-built-textmodel-worker-queue.log
/opt/ai-runtime/run/qwen35-dflash/20260517T014129Z/diag-built-textmodel-worker-owned-queue.log

The control cases showed:

SimpleEngine.start() / stop() without generation exited cleanly
direct MLXLanguageModel.stream_generate() exited cleanly
VLM-derived raw TextModel generation exited cleanly when built and generated on the same thread
VLM-derived raw TextModel generation crashed when built on one thread and generated through a different worker thread

Code Path Compared

Checked waybarrios/vllm-mlx main at c2d3aec:

vllm_mlx/engine/simple.py::SimpleEngine.start
vllm_mlx/engine/simple.py::SimpleEngine._stream_generate_text

SimpleEngine.start() built the VLM-derived TextModel on the event-loop thread. _stream_generate_text() then ran normal TextModel decode through the serialized worker path. That crossed the MLX ownership boundary for the VLM-derived TextModel.

Change

Build the VLM-derived TextModel on a dedicated single-thread executor and record that thread as the owner. TextModel generation then uses the same executor through the existing _stream_generate_text() producer path.

This keeps the existing worker-queue behavior for:

client cancellation / abort handling
system KV cache restore and prompt_cache forwarding
SpecPrefill fallback behavior
processor-retirement resume behavior
MTP cache layering when active

The earlier direct synchronous event-loop decode branch has been removed.

Tests

Updated the regression coverage to prove the VLM-derived TextModel is built and generated on the same owner worker thread, not on the event-loop thread.

Local verification:

.venv/bin/python -m pytest tests/test_simple_engine.py tests/test_simple_engine_cancel_serialization.py -q
42 passed

.venv/bin/python -m pytest -q
2072 passed, 11 skipped, 23 deselected

.venv/bin/python -m ruff check vllm_mlx/engine/simple.py tests/test_simple_engine.py tests/test_simple_engine_cancel_serialization.py
All checks passed

.venv/bin/python -m black --check --fast vllm_mlx/engine/simple.py tests/test_simple_engine.py tests/test_simple_engine_cancel_serialization.py
3 files would be left unchanged

git diff --check
passed

Not Claimed

This PR does not change model cards, default model selection, resident routing, MTP behavior, or SpecPrefill behavior. It only keeps VLM-derived TextModel build and decode on one stable TextModel worker thread while preserving the existing SimpleEngine text-route semantics.

janhilgard

Review Summary

The motivation is sound: VLM-derived TextModel objects have MLX thread-affinity, so crossing to a worker thread via _run_blocking_serialized can cause segfaults during asyncio shutdown. The approach of detecting the owner thread and staying on it is reasonable. However, there are several issues — some correctness bugs, some missing feature parity with the existing worker path — that should be addressed before merging.

1. Missing `abort_event` handling — synchronous generator blocks event loop

Severity: High

The new owner-thread path runs mlx_stream_generate() (a synchronous, potentially long-running generator) directly inside async with self._generation_lock, on the event loop thread. The existing worker path runs this in asyncio.to_thread() (via _run_blocking_serialized) precisely so the event loop stays responsive.

In the new path:

Each iteration of the for resp in mlx_stream_generate(...) loop is a synchronous blocking call (Metal compute). The event loop is blocked for the duration of each decode step.
The abort_event threading.Event is never checked. If a client disconnects mid-stream, there is no way to break out of the generation loop.
HTTP health checks, other concurrent requests, and even cancellation signals cannot be processed while the loop is blocked.

This may be acceptable for very short generations, but for any meaningful max_tokens (e.g. 4096), this will freeze the entire server.

Suggestion: Either (a) run the synchronous generator in a thread but ensure it stays on the owner thread (perhaps via a single-thread executor bound to the owner thread), or (b) at minimum check abort_event inside the loop (though this won't help with event-loop blocking).

2. Missing processor retirement logic

Severity: Medium

The worker-queue path (_run_all) has a significant feature: logits processor retirement. When can_retire_processors is true and processors have been retired (e.g., after a thinking phase), the existing code breaks out of the generation loop, seeds a new prompt_cache, and resumes with MTP re-enabled via _resume_after_processor_retirement().

The new owner-thread path has none of this. Requests that use retirable logits processors (which is the default for <think> reasoning models) will either:

Run the entire generation without MTP even after the processor retires, or
Behave differently from the worker path in subtle ways.

This is a functional regression for VLM-derived TextModel requests that use reasoning/thinking.

3. `backbone_cache is None` condition is overly restrictive

Severity: Medium

The guard condition requires backbone_cache is None:

if (
    self._text_model_owner_thread == threading.get_ident()
    and not use_specprefill
    and backbone_cache is None
):

This means the owner-thread path is only used when there is no system KV cache hit. On the first request (cache miss with system messages), backbone_cache gets set in the code above (around line 1828 in main). On subsequent requests with the same system prompt, backbone_cache is restored from the snapshot.

In practice, after the first request, nearly all requests to a model with a system prompt will have backbone_cache is not None, so the fix won't apply. The segfault problem presumably still exists when the fallback worker path is used.

If the owner-thread path can't support backbone_cache, this should at least be documented as a known limitation.

4. No `prompt_cache` / system KV snapshot support

Severity: Low-Medium

Related to point 3: the new path never constructs a prompt_cache from backbone_cache, never passes prompt_cache to mlx_stream_generate, and never builds/restores system KV snapshots. The worker path does all of this. This means:

No KV cache reuse across requests with the same system prompt
No MTP cache layering on top of backbone cache

5. Thread identity check may not match in all deployment configurations

Severity: Low

threading.get_ident() is used to track the owner thread. In start(), the text model is built on the event-loop thread. In _stream_generate_text(), the check self._text_model_owner_thread == threading.get_ident() assumes the async generator runs on the same event-loop thread.

For a standard single-threaded asyncio event loop this is correct. But if anyone runs with uvloop or a custom event loop policy that uses multiple threads for coroutine execution, this assumption breaks silently and the code falls through to the existing worker path (safe, but defeats the purpose). This is not a bug per se, but worth a comment.

6. Test is good but narrow

Severity: Low

The test correctly verifies that generation stays on the owner thread. However:

It doesn't test the fallback path (when backbone_cache is not None or use_specprefill is True).
It doesn't test what happens when _text_model_owner_thread is None (non-MLLM model).
It doesn't test abort/cancellation behavior.
The fake stream_generate yields only one token — a multi-token test would better exercise the accumulation and stop-sequence logic.

Minor notes

Code style and formatting are clean, consistent with the rest of the file.
The _text_model_owner_thread lifecycle management (set in start, cleared in stop and error paths) is thorough.
The return at the end of the new block correctly prevents falling through to the worker-queue path.

Recommendation

The core idea is correct, but the implementation introduces event-loop blocking (issue 1) and lacks feature parity with the worker path (issue 2). I'd suggest either:

Minimal fix: Use a dedicated single-thread executor that is pinned to the owner thread, so Metal ops stay on the right thread without blocking the event loop.
Or: Accept the blocking trade-off but document it clearly, add abort_event checking, and add processor retirement support for feature parity.

janhilgard

All issues from the previous review are addressed. The dedicated ThreadPoolExecutor(max_workers=1) approach is clean — it preserves full feature parity (abort handling, processor retirement, backbone cache, system KV cache) while guaranteeing thread ownership for the VLM-derived TextModel.

Key improvements over v1:

Event loop no longer blocked (generation runs on executor thread)
All existing _run_blocking_serialized machinery (lock, cancellation, shield) reused
Defensive assertion in _run_all() catches wrong-thread bugs loudly
Proper executor shutdown in stop()

CI 9/9 green.

Thump604 assigned janhilgard May 17, 2026

Thump604 requested a review from janhilgard May 17, 2026 02:58

janhilgard requested changes May 17, 2026

View reviewed changes

Thump604 requested a review from janhilgard May 17, 2026 13:04

janhilgard approved these changes May 17, 2026

View reviewed changes

Thump604 assigned waybarrios May 19, 2026

Thump604 requested a review from waybarrios May 19, 2026 20:11

This was referenced May 19, 2026

Keep MLLM media stream on owner thread #551

Open

Bug: Vision capabilities/tool calling in vllm-mlx vs mlx-vlm? #535

Open

Thump604 added 2 commits May 31, 2026 15:57

fix: keep vlm text generation on owner thread

a24263a

fix: run vlm text model on stable owner worker

69b45f1

Thump604 force-pushed the 604/vlm-text-owner-thread branch from 86fae9a to 69b45f1 Compare May 31, 2026 20:58

Thump604 mentioned this pull request Jun 6, 2026

fix(engine): run generation synchronously to preserve MLX stream context #593

Closed

6 tasks

waybarrios mentioned this pull request Jun 11, 2026

fix(simple): use persistent MLX worker thread to fix thread-local stream crash #478

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep VLM TextModel generation on owner thread#543

Keep VLM TextModel generation on owner thread#543
Thump604 wants to merge 2 commits into
waybarrios:mainfrom
Thump604:604/vlm-text-owner-thread

Thump604 commented May 17, 2026 •

edited

Loading

Uh oh!

janhilgard left a comment

Uh oh!

janhilgard left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Thump604 commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Local Repro

Code Path Compared

Change

Tests

Not Claimed

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

Review Summary

1. Missing abort_event handling — synchronous generator blocks event loop

2. Missing processor retirement logic

3. backbone_cache is None condition is overly restrictive

4. No prompt_cache / system KV snapshot support

5. Thread identity check may not match in all deployment configurations

6. Test is good but narrow

Minor notes

Recommendation

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Thump604 commented May 17, 2026 •

edited

Loading

1. Missing `abort_event` handling — synchronous generator blocks event loop

3. `backbone_cache is None` condition is overly restrictive

4. No `prompt_cache` / system KV snapshot support