Keep VLM TextModel generation on owner thread#543
Conversation
janhilgard
left a comment
There was a problem hiding this comment.
Review Summary
The motivation is sound: VLM-derived TextModel objects have MLX thread-affinity, so crossing to a worker thread via _run_blocking_serialized can cause segfaults during asyncio shutdown. The approach of detecting the owner thread and staying on it is reasonable. However, there are several issues — some correctness bugs, some missing feature parity with the existing worker path — that should be addressed before merging.
1. Missing abort_event handling — synchronous generator blocks event loop
Severity: High
The new owner-thread path runs mlx_stream_generate() (a synchronous, potentially long-running generator) directly inside async with self._generation_lock, on the event loop thread. The existing worker path runs this in asyncio.to_thread() (via _run_blocking_serialized) precisely so the event loop stays responsive.
In the new path:
- Each iteration of the
for resp in mlx_stream_generate(...)loop is a synchronous blocking call (Metal compute). The event loop is blocked for the duration of each decode step. - The
abort_eventthreading.Event is never checked. If a client disconnects mid-stream, there is no way to break out of the generation loop. - HTTP health checks, other concurrent requests, and even cancellation signals cannot be processed while the loop is blocked.
This may be acceptable for very short generations, but for any meaningful max_tokens (e.g. 4096), this will freeze the entire server.
Suggestion: Either (a) run the synchronous generator in a thread but ensure it stays on the owner thread (perhaps via a single-thread executor bound to the owner thread), or (b) at minimum check abort_event inside the loop (though this won't help with event-loop blocking).
2. Missing processor retirement logic
Severity: Medium
The worker-queue path (_run_all) has a significant feature: logits processor retirement. When can_retire_processors is true and processors have been retired (e.g., after a thinking phase), the existing code breaks out of the generation loop, seeds a new prompt_cache, and resumes with MTP re-enabled via _resume_after_processor_retirement().
The new owner-thread path has none of this. Requests that use retirable logits processors (which is the default for <think> reasoning models) will either:
- Run the entire generation without MTP even after the processor retires, or
- Behave differently from the worker path in subtle ways.
This is a functional regression for VLM-derived TextModel requests that use reasoning/thinking.
3. backbone_cache is None condition is overly restrictive
Severity: Medium
The guard condition requires backbone_cache is None:
if (
self._text_model_owner_thread == threading.get_ident()
and not use_specprefill
and backbone_cache is None
):This means the owner-thread path is only used when there is no system KV cache hit. On the first request (cache miss with system messages), backbone_cache gets set in the code above (around line 1828 in main). On subsequent requests with the same system prompt, backbone_cache is restored from the snapshot.
In practice, after the first request, nearly all requests to a model with a system prompt will have backbone_cache is not None, so the fix won't apply. The segfault problem presumably still exists when the fallback worker path is used.
If the owner-thread path can't support backbone_cache, this should at least be documented as a known limitation.
4. No prompt_cache / system KV snapshot support
Severity: Low-Medium
Related to point 3: the new path never constructs a prompt_cache from backbone_cache, never passes prompt_cache to mlx_stream_generate, and never builds/restores system KV snapshots. The worker path does all of this. This means:
- No KV cache reuse across requests with the same system prompt
- No MTP cache layering on top of backbone cache
5. Thread identity check may not match in all deployment configurations
Severity: Low
threading.get_ident() is used to track the owner thread. In start(), the text model is built on the event-loop thread. In _stream_generate_text(), the check self._text_model_owner_thread == threading.get_ident() assumes the async generator runs on the same event-loop thread.
For a standard single-threaded asyncio event loop this is correct. But if anyone runs with uvloop or a custom event loop policy that uses multiple threads for coroutine execution, this assumption breaks silently and the code falls through to the existing worker path (safe, but defeats the purpose). This is not a bug per se, but worth a comment.
6. Test is good but narrow
Severity: Low
The test correctly verifies that generation stays on the owner thread. However:
- It doesn't test the fallback path (when
backbone_cache is not Noneoruse_specprefill is True). - It doesn't test what happens when
_text_model_owner_threadisNone(non-MLLM model). - It doesn't test abort/cancellation behavior.
- The fake
stream_generateyields only one token — a multi-token test would better exercise the accumulation and stop-sequence logic.
Minor notes
- Code style and formatting are clean, consistent with the rest of the file.
- The
_text_model_owner_threadlifecycle management (set instart, cleared instopand error paths) is thorough. - The
returnat the end of the new block correctly prevents falling through to the worker-queue path.
Recommendation
The core idea is correct, but the implementation introduces event-loop blocking (issue 1) and lacks feature parity with the worker path (issue 2). I'd suggest either:
- Minimal fix: Use a dedicated single-thread executor that is pinned to the owner thread, so Metal ops stay on the right thread without blocking the event loop.
- Or: Accept the blocking trade-off but document it clearly, add abort_event checking, and add processor retirement support for feature parity.
janhilgard
left a comment
There was a problem hiding this comment.
All issues from the previous review are addressed. The dedicated ThreadPoolExecutor(max_workers=1) approach is clean — it preserves full feature parity (abort handling, processor retirement, backbone cache, system KV cache) while guaranteeing thread ownership for the VLM-derived TextModel.
Key improvements over v1:
- Event loop no longer blocked (generation runs on executor thread)
- All existing
_run_blocking_serializedmachinery (lock, cancellation, shield) reused - Defensive assertion in
_run_all()catches wrong-thread bugs loudly - Proper executor shutdown in
stop()
CI 9/9 green.
86fae9a to
69b45f1
Compare
Summary
Keep VLM-derived TextModel generation on one stable owner worker thread, without adding a synchronous event-loop decode path.
Local Repro
With a Qwen 3.6 VLM artifact using the SimpleEngine text-only route, a tiny text generation completed successfully, then the short-lived process segfaulted during
asyncio.run()shutdown.Minimal observed sequence:
Narrowed local artifacts:
The control cases showed:
SimpleEngine.start()/stop()without generation exited cleanlyMLXLanguageModel.stream_generate()exited cleanlyCode Path Compared
Checked
waybarrios/vllm-mlxmain atc2d3aec:SimpleEngine.start()built the VLM-derived TextModel on the event-loop thread._stream_generate_text()then ran normal TextModel decode through the serialized worker path. That crossed the MLX ownership boundary for the VLM-derived TextModel.Change
Build the VLM-derived TextModel on a dedicated single-thread executor and record that thread as the owner. TextModel generation then uses the same executor through the existing
_stream_generate_text()producer path.This keeps the existing worker-queue behavior for:
The earlier direct synchronous event-loop decode branch has been removed.
Tests
Updated the regression coverage to prove the VLM-derived TextModel is built and generated on the same owner worker thread, not on the event-loop thread.
Local verification:
Not Claimed
This PR does not change model cards, default model selection, resident routing, MTP behavior, or SpecPrefill behavior. It only keeps VLM-derived TextModel build and decode on one stable TextModel worker thread while preserving the existing SimpleEngine text-route semantics.