fix(engine): stop MLLM text route at the model's full config EOS set#610
fix(engine): stop MLLM text route at the model's full config EOS set#610ursk wants to merge 5 commits into
Conversation
|
The fixes are right and the deep review came back clean, but #576 just landed on main and touches the same The marker-detection guard for reasoning leakage only fires in # server.py, streaming path: never calls _extract_reasoning_and_tool_calls
if not _thinking_disabled(request, chat_kwargs):
...Not a regression (streaming was already broken), but the PR description claims the leak is fixed and right now that's only true for non-streaming. Either cover streaming too or scope the claim. There are also no tests for the reasoning-marker fix in either path. Second, the MLLMScheduler refactor changes behavior slightly: the old |
…LM text route
The SimpleEngine MLLM text route passed the raw HF processor tokenizer to
mlx_lm.stream_generate, which defaults its stop set to
{tokenizer.eos_token_id} — dropping turn terminators that chat models
declare only in the config EOS list. Gemma 4 ends chat turns with
<turn|>=106 (and tool responses with <|tool_response>=50) while
tokenizer.eos_token stays <eos>=1, so every text-route generation sailed
through end-of-turn, leaked the marker as visible text, and free-ran to
max_tokens (finish_reason=length on a one-sentence answer). The existing
hard-coded Qwen3.5 <|im_end|> patch in the same block was this same bug
fixed per-model.
- add collect_eos_token_ids(): unions tokenizer eos_token_id /
eos_token_ids with config.json + generation_config.json eos lists
- wrap the text-route tokenizer in mlx_lm TokenizerWrapper with that set
- refactor MLLMScheduler._get_stop_tokens (batched path, which already
did this correctly) onto the shared helper
- tokenize the full prompt up front so usage.prompt_tokens is reported
on every request; previously it was 0 unless the system-KV-cache
branch (system message + ChatML markers) or specprefill happened to run
Verified live against gemma-4-31b-it (4-bit MLX, SimpleEngine): the
canonical tool-result follow-up now answers in one sentence with
finish_reason=stop and correct usage; both tests in
test_gemma4_batched_tool_loop.py pass against the served model.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…isabled The allow_reasoning gate (PR waybarrios#537) skips the reasoning parser entirely when enable_thinking=false, to keep implicit-thinking parsers from swallowing plain content into a reasoning block. But models can open an explicit reasoning block regardless of the template kwarg — Gemma 4 emits <|channel>thought even when thinking is disabled — and with the parser skipped the raw channel markers leak verbatim into message.content. Refine the gate: when thinking is disabled, run extract_reasoning iff the parser's explicit start/end markers are present in the output. With explicit markers the implicit-swallowing hazard cannot occur; without them behavior is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…okenizer The text route now wraps the processor tokenizer in TokenizerWrapper, so the routing test asserts delegation rather than identity, and the chat-template-kwargs test's mock gains the bos_token/encode attrs the up-front usage tokenization reads on every request. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…aths The non-streaming guard (previous commit) parses reasoning markers even when thinking is disabled, but all three streaming paths (chat completions, Anthropic messages, Responses) gated on _thinking_disabled directly and never saw it — Gemma 4 streaming with thinking disabled still leaked raw <|channel> markers into content. Add the streaming counterpart: with thinking disabled the reasoning parser stays off until the model emits the parser's explicit start/end marker in the accumulated raw stream; from that point deltas are routed through the parser. Parsed reasoning is suppressed (the request disabled thinking) and only cleaned content is emitted, which also keeps the Anthropic block choreography valid when a text block is already open. Implicit-thinking parsers without explicit markers never latch, so the PR waybarrios#537 swallowing hazard is unchanged. Tests cover the marker guard in both the non-streaming extractor and the chat-completions + Anthropic streaming paths; the streaming tests fail without this commit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The MLLMScheduler refactor onto collect_eos_token_ids changes behavior: the old _get_stop_tokens read only generation_config.json, the helper also reads config.json. Pin the Gemma 4 token set so the union (and the processor/tokenizer shapes) can't regress silently. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
13edef3 to
2b931cf
Compare
|
All three points addressed:
|
|
Live E2E verification of the rebased branch is done — with one important caveat that turned into #613 / #614. Running the rebased branch against gemma-4-26b-a4b (and 31B dense) initially failed every text-route request with With #614 applied on top of this branch, the full live gate passes on gemma-4-26b-a4b-it 4-bit (SimpleEngine, M3 Ultra), exercising exactly this PR's claims:
So #614 is a soft prerequisite for observing this PR's text-route fix live on current main, though the two are independent changes. Review order is your call; they don't conflict. |
Problem
On the SimpleEngine MLLM text route (
text-only -> mlx_lm TextModel), generation never stops at the model's chat end-of-turn token. The route passes the raw HF processor tokenizer tomlx_lm.stream_generate, which wraps it witheos_token_ids={tokenizer.eos_token_id}— dropping turn terminators that chat models declare only in the config EOS list.Gemma 4 is the clearest case:
config.jsondeclareseos_token_id: [1, 106, 50], where<turn|>=106 terminates chat turns and<|tool_response>=50 terminates tool responses, whiletokenizer.eos_tokenstays<eos>=1. The model emits 106, the sampler doesn't stop, the marker detokenizes into visible text, and generation free-runs untilmax_tokens— every text-route response ends withfinish_reason: "length"and trailing garbage:The hard-coded Qwen3.5
<|im_end|>patch already in that block is this same bug, previously fixed per-model.Two adjacent defects in the same path, included here because they surface in the same repro:
usage.prompt_tokensis 0 on the text route unless the system-KV-cache branch (requires a system message and ChatML<|im_start|>markers — never true for Gemma templates) or specprefill happens to run.contentwhen thinking is disabled. Theallow_reasoninggate (fix(reasoning): respect enable_thinking=false from chat_template_kwargs #537) skips the reasoning parser entirely underenable_thinking=false, but Gemma 4 opens<|channel>thoughtregardless, so raw channel markers end up inmessage.content.Fix
collect_eos_token_ids()(new,utils/tokenizer.py): unionstokenizer.eos_token_id/.eos_token_idswith theconfig.json+generation_config.jsonEOS lists. The text route now wraps its tokenizer inmlx_lm.TokenizerWrapperwith that set.MLLMScheduler._get_stop_tokens(the batched path) is refactored onto the same helper._get_stop_tokensread onlygeneration_config.json; the shared helper also readsconfig.json. Models that list extra ids (e.g. pad/bos) in the config EOS set will now stop on tokens the batched path previously ignored — that union is the point of the fix. The Gemma 4 set{1, 106, 50}is pinned throughMLLMScheduler._get_stop_tokensintests/test_collect_eos_token_ids.py.Up-front prompt tokenization in
_stream_generate_textsousage.prompt_tokensis always reported; the system-KV and specprefill branches reuse the tokens instead of re-encoding.Refined
allow_reasoninggate — non-streaming and streaming:extract_reasoningiff the parser's explicit start/end markers are present in the output. With explicit markers, the implicit-thinking-swallows-content hazard fix(reasoning): respect enable_thinking=false from chat_template_kwargs #537 guards against cannot occur; without them, behavior is unchanged.Verification
Live against
gemma-4-31b-it4-bit MLX on SimpleEngine (M3 Ultra):user → assistant(tool_call) → tool(long result)): before — correct sentence, then<turn|>leak, then drift to the 256-token cap,finish_reason: "length",prompt_tokens: 0; after — one-sentence answer,finish_reason: "stop", 25 completion tokens, realprompt_tokens.finish_reason: "tool_calls"with well-formedarguments,usagepopulated.collect_eos_token_idsincl. the batched-path Gemma 4 pin (tests/test_collect_eos_token_ids.py).tests/test_chat_template_kwargs.py): non-streaming extractor with/without markers, plus chat-completions and Anthropic streaming with Gemma 4 channel markers underenable_thinking=false. The streaming tests fail without the streaming commit.main(post-Fix hybrid cache snapshot aliasing #576); full suite: 2195 passed, 11 skipped.