Skip to content

fix(engine): stop MLLM text route at the model's full config EOS set#610

Open
ursk wants to merge 5 commits into
waybarrios:mainfrom
ursk:fix/mllm-text-route-eos
Open

fix(engine): stop MLLM text route at the model's full config EOS set#610
ursk wants to merge 5 commits into
waybarrios:mainfrom
ursk:fix/mllm-text-route-eos

Conversation

@ursk

@ursk ursk commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Problem

On the SimpleEngine MLLM text route (text-only -> mlx_lm TextModel), generation never stops at the model's chat end-of-turn token. The route passes the raw HF processor tokenizer to mlx_lm.stream_generate, which wraps it with eos_token_ids={tokenizer.eos_token_id} — dropping turn terminators that chat models declare only in the config EOS list.

Gemma 4 is the clearest case: config.json declares eos_token_id: [1, 106, 50], where <turn|>=106 terminates chat turns and <|tool_response>=50 terminates tool responses, while tokenizer.eos_token stays <eos>=1. The model emits 106, the sampler doesn't stop, the marker detokenizes into visible text, and generation free-runs until max_tokens — every text-route response ends with finish_reason: "length" and trailing garbage:

<|channel>thought
<channel|>Yes, there is a file named `asitop_powermetrics…` in /tmp.<turn|>
C'est parti ! On va s'occuper de ton projet de site web. …   ← free-running

The hard-coded Qwen3.5 <|im_end|> patch already in that block is this same bug, previously fixed per-model.

Two adjacent defects in the same path, included here because they surface in the same repro:

  • usage.prompt_tokens is 0 on the text route unless the system-KV-cache branch (requires a system message and ChatML <|im_start|> markers — never true for Gemma templates) or specprefill happens to run.
  • Reasoning markers leak into content when thinking is disabled. The allow_reasoning gate (fix(reasoning): respect enable_thinking=false from chat_template_kwargs #537) skips the reasoning parser entirely under enable_thinking=false, but Gemma 4 opens <|channel>thought regardless, so raw channel markers end up in message.content.

Fix

  1. collect_eos_token_ids() (new, utils/tokenizer.py): unions tokenizer.eos_token_id / .eos_token_ids with the config.json + generation_config.json EOS lists. The text route now wraps its tokenizer in mlx_lm.TokenizerWrapper with that set. MLLMScheduler._get_stop_tokens (the batched path) is refactored onto the same helper.

    ⚠️ Behavioral change on the batched path: the old _get_stop_tokens read only generation_config.json; the shared helper also reads config.json. Models that list extra ids (e.g. pad/bos) in the config EOS set will now stop on tokens the batched path previously ignored — that union is the point of the fix. The Gemma 4 set {1, 106, 50} is pinned through MLLMScheduler._get_stop_tokens in tests/test_collect_eos_token_ids.py.

  2. Up-front prompt tokenization in _stream_generate_text so usage.prompt_tokens is always reported; the system-KV and specprefill branches reuse the tokens instead of re-encoding.

  3. Refined allow_reasoning gate — non-streaming and streaming:

    • Non-streaming: when thinking is disabled, run extract_reasoning iff the parser's explicit start/end markers are present in the output. With explicit markers, the implicit-thinking-swallows-content hazard fix(reasoning): respect enable_thinking=false from chat_template_kwargs #537 guards against cannot occur; without them, behavior is unchanged.
    • Streaming (chat completions, Anthropic messages, Responses): with thinking disabled the parser stays off until the model emits the parser's explicit marker in the accumulated raw stream; from that point deltas are routed through the parser. Parsed reasoning is suppressed (the request disabled thinking) and only cleaned content is emitted — this also keeps the Anthropic block choreography valid when the text block is already open. Implicit-thinking parsers without explicit marker attributes never latch, so fix(reasoning): respect enable_thinking=false from chat_template_kwargs #537 behavior is unchanged.

Verification

Live against gemma-4-31b-it 4-bit MLX on SimpleEngine (M3 Ultra):

  • Canonical 3-turn tool flow (user → assistant(tool_call) → tool(long result)): before — correct sentence, then <turn|> leak, then drift to the 256-token cap, finish_reason: "length", prompt_tokens: 0; after — one-sentence answer, finish_reason: "stop", 25 completion tokens, real prompt_tokens.
  • Tool-call emission leg: finish_reason: "tool_calls" with well-formed arguments, usage populated.
  • 12 hermetic unit tests for collect_eos_token_ids incl. the batched-path Gemma 4 pin (tests/test_collect_eos_token_ids.py).
  • Marker-guard tests for the reasoning leak in both paths (tests/test_chat_template_kwargs.py): non-streaming extractor with/without markers, plus chat-completions and Anthropic streaming with Gemma 4 channel markers under enable_thinking=false. The streaming tests fail without the streaming commit.
  • Rebased onto current main (post-Fix hybrid cache snapshot aliasing #576); full suite: 2195 passed, 11 skipped.

@waybarrios

Copy link
Copy Markdown
Owner

The fixes are right and the deep review came back clean, but #576 just landed on main and touches the same _stream_generate_text region, so this is now conflicting and needs a rebase. Two things worth addressing while you're in there.

The marker-detection guard for reasoning leakage only fires in _extract_reasoning_and_tool_calls, which is the non-streaming path. The streaming path gates on _thinking_disabled directly and never sees the new guard, so Gemma 4 streaming with thinking disabled still leaks markers:

# server.py, streaming path: never calls _extract_reasoning_and_tool_calls
if not _thinking_disabled(request, chat_kwargs):
    ...

Not a regression (streaming was already broken), but the PR description claims the leak is fixed and right now that's only true for non-streaming. Either cover streaming too or scope the claim. There are also no tests for the reasoning-marker fix in either path.

Second, the MLLMScheduler refactor changes behavior slightly: the old _get_stop_tokens read only generation_config.json, while collect_eos_token_ids also reads config.json. That's the point of the fix, but models that list pad/bos ids in the config EOS set will now stop on tokens they previously ignored on the batched path. Worth a line in the description and ideally a test pinning the Gemma 4 token set.

Verify and others added 5 commits June 12, 2026 05:58
…LM text route

The SimpleEngine MLLM text route passed the raw HF processor tokenizer to
mlx_lm.stream_generate, which defaults its stop set to
{tokenizer.eos_token_id} — dropping turn terminators that chat models
declare only in the config EOS list. Gemma 4 ends chat turns with
<turn|>=106 (and tool responses with <|tool_response>=50) while
tokenizer.eos_token stays <eos>=1, so every text-route generation sailed
through end-of-turn, leaked the marker as visible text, and free-ran to
max_tokens (finish_reason=length on a one-sentence answer). The existing
hard-coded Qwen3.5 <|im_end|> patch in the same block was this same bug
fixed per-model.

- add collect_eos_token_ids(): unions tokenizer eos_token_id /
  eos_token_ids with config.json + generation_config.json eos lists
- wrap the text-route tokenizer in mlx_lm TokenizerWrapper with that set
- refactor MLLMScheduler._get_stop_tokens (batched path, which already
  did this correctly) onto the shared helper
- tokenize the full prompt up front so usage.prompt_tokens is reported
  on every request; previously it was 0 unless the system-KV-cache
  branch (system message + ChatML markers) or specprefill happened to run

Verified live against gemma-4-31b-it (4-bit MLX, SimpleEngine): the
canonical tool-result follow-up now answers in one sentence with
finish_reason=stop and correct usage; both tests in
test_gemma4_batched_tool_loop.py pass against the served model.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…isabled

The allow_reasoning gate (PR waybarrios#537) skips the reasoning parser entirely
when enable_thinking=false, to keep implicit-thinking parsers from
swallowing plain content into a reasoning block. But models can open an
explicit reasoning block regardless of the template kwarg — Gemma 4 emits
<|channel>thought even when thinking is disabled — and with the parser
skipped the raw channel markers leak verbatim into message.content.

Refine the gate: when thinking is disabled, run extract_reasoning iff the
parser's explicit start/end markers are present in the output. With
explicit markers the implicit-swallowing hazard cannot occur; without
them behavior is unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…okenizer

The text route now wraps the processor tokenizer in TokenizerWrapper, so
the routing test asserts delegation rather than identity, and the
chat-template-kwargs test's mock gains the bos_token/encode attrs the
up-front usage tokenization reads on every request.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…aths

The non-streaming guard (previous commit) parses reasoning markers even
when thinking is disabled, but all three streaming paths (chat
completions, Anthropic messages, Responses) gated on _thinking_disabled
directly and never saw it — Gemma 4 streaming with thinking disabled
still leaked raw <|channel> markers into content.

Add the streaming counterpart: with thinking disabled the reasoning
parser stays off until the model emits the parser's explicit start/end
marker in the accumulated raw stream; from that point deltas are routed
through the parser. Parsed reasoning is suppressed (the request disabled
thinking) and only cleaned content is emitted, which also keeps the
Anthropic block choreography valid when a text block is already open.
Implicit-thinking parsers without explicit markers never latch, so the
PR waybarrios#537 swallowing hazard is unchanged.

Tests cover the marker guard in both the non-streaming extractor and the
chat-completions + Anthropic streaming paths; the streaming tests fail
without this commit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The MLLMScheduler refactor onto collect_eos_token_ids changes behavior:
the old _get_stop_tokens read only generation_config.json, the helper
also reads config.json. Pin the Gemma 4 token set so the union (and the
processor/tokenizer shapes) can't regress silently.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ursk ursk force-pushed the fix/mllm-text-route-eos branch from 13edef3 to 2b931cf Compare June 12, 2026 13:13
@ursk

ursk commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

All three points addressed:

  • Rebased onto main (post-Fix hybrid cache snapshot aliasing #576). The conflict was in the text-route tokenizer setup and the _stream_generate_text preamble; resolution keeps Fix hybrid cache snapshot aliasing #576's _probe_system_kv_cache_support call and cache_blocking_controls gate alongside this PR's TokenizerWrapper wrap and up-front tokenization. Full suite after rebase: 2195 passed, 11 skipped.
  • Streaming is now covered (a2d9a41): all three streaming paths (chat completions, Anthropic messages, Responses) get the streaming counterpart of the marker guard — with thinking disabled the parser stays off until the parser's explicit marker appears in the accumulated raw stream, then deltas are routed through it. Parsed reasoning is suppressed (the request disabled thinking) and only cleaned content is emitted, which also keeps the Anthropic block sequence valid when the text block is already open. Implicit parsers without explicit marker attributes never latch, so the fix(reasoning): respect enable_thinking=false from chat_template_kwargs #537 hazard is untouched. Tests cover the guard in the non-streaming extractor and both the chat-completions and Anthropic streaming paths with Gemma 4 channel markers; the streaming tests fail without the fix commit.
  • Batched-path EOS change (2b931cf): the Gemma 4 set {1, 106, 50} is now pinned through MLLMScheduler._get_stop_tokens (both processor shapes), and the PR description calls out the config.json union as a behavioral change for models that list extra ids in the config EOS set.

@ursk

ursk commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Live E2E verification of the rebased branch is done — with one important caveat that turned into #613 / #614.

Running the rebased branch against gemma-4-26b-a4b (and 31B dense) initially failed every text-route request with There is no Stream(gpu, 2) in current thread. That is not this PR: pure main (a48c86c) fails identically, and git bisect lands on #595 — which merely enabled the TextModel route for Gemma 4, exposing a latent thread-affinity fault (lazy scaled-RoPE _freqs, excluded from parameters(), tagged to the load thread's stream; full diagnosis in #613). The targeted fix is #614.

With #614 applied on top of this branch, the full live gate passes on gemma-4-26b-a4b-it 4-bit (SimpleEngine, M3 Ultra), exercising exactly this PR's claims:

  • startup: Text route stop tokens: [1, 50, 106] — the pinned Gemma 4 config EOS set
  • tool-call leg: stop_reason: "tool_use", well-formed input, usage populated
  • canonical 3-turn tool-result follow-up (1.6k-token tool result): stop_reason: "end_turn", input_tokens: 1594 (was 0), 60 completion tokens, no <turn|> / <|channel> leakage, no free-running
  • streaming with thinking disabled (server default enable_thinking: false): clean text_deltas, no raw channel markers, no thinking blocks

So #614 is a soft prerequisite for observing this PR's text-route fix live on current main, though the two are independent changes. Review order is your call; they don't conflict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants