fix(text-model-from-vlm): realize private lazy arrays before the model leaves the build thread#614
Open
ursk wants to merge 1 commit into
Open
Conversation
…l leaves the build thread MLX lazy graphs are tagged to the stream of the thread that recorded them, and nn.Module.parameters() excludes underscore-prefixed attributes. The scaled-RoPE _freqs built in rope_utils (Llama3RoPE, YarnRoPE, SuScaledRoPE, ProportionalRoPE) therefore survive build_text_model as lazy arrays tagged to the load thread's stream; the first generation evaluated on a different worker thread dies with 'There is no Stream(gpu, N) in current thread'. Gemma 4 hit this on every MLLM text-route request once dispatch landed (waybarrios#595): layers 0-4 are sliding_attention (plain RoPE, no _freqs) and layer 5 is the first full_attention layer with scaled RoPE. Realize every module-held array (including private attributes) at the end of build_text_model so nothing stream-tagged escapes the build thread. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #613.
Problem
Every Gemma 4 MLLM text-route generation on
mainfails withRuntimeError: There is no Stream(gpu, N) in current thread(full diagnosis, bisect, and minimal MLX repro in #613). Short version:rope_utils' scaled-RoPE classes computeself._freqslazily in__init__, andnn.Module.parameters()excludes underscore-prefixed attributes — so_freqssurvivesbuild_text_modelas a lazy graph tagged to the load thread's stream.asyncio.to_threadon arbitrary pool threads; the first forward through afull_attentionlayer (layer 5 on Gemma 4 — layers 0–4 are sliding) evaluates_freqscross-thread and dies.#595 exposed this by enabling the TextModel route for Gemma 4; the route's threading hazard predates it.
Fix
Realize every module-held array — including private attributes — at the end of
build_text_model, so nothing stream-tagged escapes the build thread. Onemx.evalovertext_model.modules(), guarded for duck-typed test doubles.Verification
test_build_text_model_realizes_private_lazy_arrays): builds a fake gemma4 TextModel holding a lazy private array and asserts it is evaluable from a different thread. Red without the fix, green with it./v1/messagestext request, now serves the full tool-call flow (tool_useemission, 3-turn tool-result follow-up withend_turn, streaming) cleanly. Same for the 31B dense build.Note: this is the targeted unblock. The structural hazard (
to_threadonto arbitrary pool threads + per-call stream rebinding) is discussed in #613 — a single pinned MLX worker thread would retire the whole class; we run that pattern in production downstream and can upstream it as a follow-up if there's interest.🤖 Generated with Claude Code