fix(text-model-from-vlm): realize private lazy arrays before the model leaves the build thread by ursk · Pull Request #614 · waybarrios/vllm-mlx

ursk · 2026-06-12T16:03:45Z

Fixes #613.

Problem

Every Gemma 4 MLLM text-route generation on main fails with RuntimeError: There is no Stream(gpu, N) in current thread (full diagnosis, bisect, and minimal MLX repro in #613). Short version:

MLX 0.31 streams are thread-bound: lazy graphs recorded under a stream on one thread cannot be evaluated from another.
rope_utils' scaled-RoPE classes compute self._freqs lazily in __init__, and nn.Module.parameters() excludes underscore-prefixed attributes — so _freqs survives build_text_model as a lazy graph tagged to the load thread's stream.
Generation runs via asyncio.to_thread on arbitrary pool threads; the first forward through a full_attention layer (layer 5 on Gemma 4 — layers 0–4 are sliding) evaluates _freqs cross-thread and dies.

#595 exposed this by enabling the TextModel route for Gemma 4; the route's threading hazard predates it.

Fix

Realize every module-held array — including private attributes — at the end of build_text_model, so nothing stream-tagged escapes the build thread. One mx.eval over text_model.modules(), guarded for duck-typed test doubles.

Verification

Regression test (test_build_text_model_realizes_private_lazy_arrays): builds a fake gemma4 TextModel holding a lazy private array and asserts it is evaluable from a different thread. Red without the fix, green with it.
Live: gemma-4-26b-a4b-it 4-bit MLX on SimpleEngine (M3 Ultra), previously 500 on every /v1/messages text request, now serves the full tool-call flow (tool_use emission, 3-turn tool-result follow-up with end_turn, streaming) cleanly. Same for the 31B dense build.
Full suite: 2181 passed, 11 skipped.

Note: this is the targeted unblock. The structural hazard (to_thread onto arbitrary pool threads + per-call stream rebinding) is discussed in #613 — a single pinned MLX worker thread would retire the whole class; we run that pattern in production downstream and can upstream it as a follow-up if there's interest.

🤖 Generated with Claude Code

…l leaves the build thread MLX lazy graphs are tagged to the stream of the thread that recorded them, and nn.Module.parameters() excludes underscore-prefixed attributes. The scaled-RoPE _freqs built in rope_utils (Llama3RoPE, YarnRoPE, SuScaledRoPE, ProportionalRoPE) therefore survive build_text_model as lazy arrays tagged to the load thread's stream; the first generation evaluated on a different worker thread dies with 'There is no Stream(gpu, N) in current thread'. Gemma 4 hit this on every MLLM text-route request once dispatch landed (waybarrios#595): layers 0-4 are sliding_attention (plain RoPE, no _freqs) and layer 5 is the first full_attention layer with scaled RoPE. Realize every module-held array (including private attributes) at the end of build_text_model so nothing stream-tagged escapes the build thread. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

ursk mentioned this pull request Jun 12, 2026

fix(engine): stop MLLM text route at the model's full config EOS set #610

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(text-model-from-vlm): realize private lazy arrays before the model leaves the build thread#614

fix(text-model-from-vlm): realize private lazy arrays before the model leaves the build thread#614
ursk wants to merge 1 commit into
waybarrios:mainfrom
ursk:fix/textmodel-realize-stream-tagged-arrays

ursk commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ursk commented Jun 12, 2026

Problem

Fix

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant