Skip to content

fix(text-model-from-vlm): realize private lazy arrays before the model leaves the build thread#614

Open
ursk wants to merge 1 commit into
waybarrios:mainfrom
ursk:fix/textmodel-realize-stream-tagged-arrays
Open

fix(text-model-from-vlm): realize private lazy arrays before the model leaves the build thread#614
ursk wants to merge 1 commit into
waybarrios:mainfrom
ursk:fix/textmodel-realize-stream-tagged-arrays

Conversation

@ursk

@ursk ursk commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Fixes #613.

Problem

Every Gemma 4 MLLM text-route generation on main fails with RuntimeError: There is no Stream(gpu, N) in current thread (full diagnosis, bisect, and minimal MLX repro in #613). Short version:

  • MLX 0.31 streams are thread-bound: lazy graphs recorded under a stream on one thread cannot be evaluated from another.
  • rope_utils' scaled-RoPE classes compute self._freqs lazily in __init__, and nn.Module.parameters() excludes underscore-prefixed attributes — so _freqs survives build_text_model as a lazy graph tagged to the load thread's stream.
  • Generation runs via asyncio.to_thread on arbitrary pool threads; the first forward through a full_attention layer (layer 5 on Gemma 4 — layers 0–4 are sliding) evaluates _freqs cross-thread and dies.

#595 exposed this by enabling the TextModel route for Gemma 4; the route's threading hazard predates it.

Fix

Realize every module-held array — including private attributes — at the end of build_text_model, so nothing stream-tagged escapes the build thread. One mx.eval over text_model.modules(), guarded for duck-typed test doubles.

Verification

  • Regression test (test_build_text_model_realizes_private_lazy_arrays): builds a fake gemma4 TextModel holding a lazy private array and asserts it is evaluable from a different thread. Red without the fix, green with it.
  • Live: gemma-4-26b-a4b-it 4-bit MLX on SimpleEngine (M3 Ultra), previously 500 on every /v1/messages text request, now serves the full tool-call flow (tool_use emission, 3-turn tool-result follow-up with end_turn, streaming) cleanly. Same for the 31B dense build.
  • Full suite: 2181 passed, 11 skipped.

Note: this is the targeted unblock. The structural hazard (to_thread onto arbitrary pool threads + per-call stream rebinding) is discussed in #613 — a single pinned MLX worker thread would retire the whole class; we run that pattern in production downstream and can upstream it as a follow-up if there's interest.

🤖 Generated with Claude Code

…l leaves the build thread

MLX lazy graphs are tagged to the stream of the thread that recorded
them, and nn.Module.parameters() excludes underscore-prefixed
attributes. The scaled-RoPE _freqs built in rope_utils (Llama3RoPE,
YarnRoPE, SuScaledRoPE, ProportionalRoPE) therefore survive
build_text_model as lazy arrays tagged to the load thread's stream;
the first generation evaluated on a different worker thread dies with
'There is no Stream(gpu, N) in current thread'. Gemma 4 hit this on
every MLLM text-route request once dispatch landed (waybarrios#595): layers 0-4
are sliding_attention (plain RoPE, no _freqs) and layer 5 is the first
full_attention layer with scaled RoPE.

Realize every module-held array (including private attributes) at the
end of build_text_model so nothing stream-tagged escapes the build
thread.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gemma 4 text route broken on main: 'There is no Stream(gpu, N) in current thread' (lazy RoPE._freqs × thread-bound streams)

1 participant