fix(qwen): accept current attention kwargs in batch patch#616
Conversation
0d2202e to
595dc47
Compare
janhilgard
left a comment
There was a problem hiding this comment.
Approving. This is a real production-incident fix for any stack that has already moved to mlx-vlm 0.6.2 — confirmed independently against my install.
Verified directly:
-
mlx_vlm.models.qwen3_5.language.Qwen3_5Attention.__call__on a freshmlx-vlm 0.6.2install reports the signature(self, x, mask=None, cache=None, position_ids=None, position_embeddings=(cos, sin) | None = None, target_verify: bool = False) -> mx.arraywhich is exactly what the patched
_patched_callnow accepts. Pre-PR, my localvllm-mlx-upstream/vllm_mlx/patches/qwen3_5_mllm.pystill has the old 4-arg signature, so the next restart of a Qwen 3.6 server (mine: 27B Heretic-v2 + native MTP, 35B-A3B) would land the exactTypeError: ... unexpected keyword argument 'position_embeddings'your description quotes. Silent HTTP 200 +finish_reason=error+ zero token accounting is the worst flavor of regression because dashboards stay green; this PR closes that gap. -
Both
_target_verify_linearsand_target_verify_left_padded_attentionexist onmlx_vlm.models.qwen3_5.languagein 0.6.2, so thegetattr(..., _default_...)resolver will pick up the live helpers rather than the fallbacks. The fallback strategy (running q/k/v projections in the default linear fashion, returningNonefrom the target-verify hook so the SDPA path takes over) is the right shape — old behavior preserved when those internals are absent. -
The
mask == "left_padded_decode"sentinel handling is guarded byisinstance(mask, str), so themx.array↔str__eq__ambiguity can't trip it. Settingmask = Noneafter detecting the sentinel and routing through_maybe_target_verify_attentionmatches what mlx-vlm now does internally. -
_apply_rotarykeeps three branches in priority order: (1)position_embeddingstuple wins when provided, (2)attention.rotary_emb.apply_rotarywhen available, (3) legacyrotary_emb(values, position_ids)+apply_multimodal_rotary_pos_emb. That covers the matrix of mlx-vlm 0.5.x / 0.6.x rotary surfaces.
Code-review notes:
-
_kv_seq_lencollapses the previous two-branch logic into one. Concretely the old branch was:position_ids is None:kv_seq_len = keys.shape[-2] + offset + 1regardless of cacheposition_idsgiven:kv_seq_len += offset + 1 if cache is not None else 0
The new helper does
length + offset + 1 if cache is not None else length. Functionally equivalent on every production path I know (cacheis always present for both prefill and decode in our stack), but the cache-is-Nonecorner case drops the+ offset + 1term where the old code would still add it. Worth a one-line comment explaining the new shape so the next reader doesn't try to "reintroduce" the old branch. -
The
getattr(qwen35_language, "_target_verify_linears", _default_target_verify_linears)pattern leans on private (_-prefixed) attribute names ofmlx-vlm. They're stable today and the fallbacks are safe, but it's the kind of thing that's worth flagging in a comment so the next mlx-vlm refactor doesn't silently regress to the fallback paths (which would re-disable target-verify performance without breaking anything). -
The PR bundles the bug fix with a fairly aggressive refactor into seven helpers (
_normalize_position_inputs,_position_ids_for_offset,_kv_seq_len,_apply_rotary,_slice_attention_mask,_maybe_target_verify_attention,_default_target_verify_*). Each helper is small and named well, and the test still exercises the same observable behavior. I'd normally lobby for separating "make it work with new ABI" from "refactor to helpers" so blame/bisect stays sharp, but the resulting reader experience is genuinely better than the previous one-big-function, and the helpers carve out exactly the seams future mlx-vlm ABI churn will need. Calling it acceptable. -
The new regression test
test_qwen35_patch_accepts_current_mlx_vlm_attention_kwargsvalidates kwargs acceptance but not target-verify semantics (the fake module doesn't expose_target_verify_left_padded_attention, so the fallback returnsNoneand the SDPA path runs). That's a fair scope — the alternative is wiring up a faketarget_verify_left_padded_attentionand asserting it gets called, which couples the test to mlx-vlm internals. Worth a follow-up test iftarget_verify=Trueever needs to be guaranteed at the patch level rather than at mlx-vlm's level. -
CI: lint ✓, type-check ✓, test-apple-silicon × 2 ✓, test-matrix × 4 ✓, tests ✓. All 9 green on
595dc47.
LGTM, please merge — this one belongs in the next release because it silently catches anyone on mlx-vlm>=0.6.0 running Qwen 3.5/3.6 artifacts on the CB+prefix path.
Summary
position_embeddingsandtarget_verifykwargsLocal repro / evidence
Before the local Runtime fix, Qwen 35B CB+prefix owner serving hit:
The client saw HTTP 200 with
finish_reason=error,content=null, and zero token accounting.Post-fix local evidence paths:
/opt/ai-runtime/run/non-live-jobs-ab/20260613T114937Z-qwen36_35b_cb_prefix/qwen35-cb-prefix-nothink-canary.json/opt/ai-runtime/run/non-live-jobs-ab/20260613T114937Z-qwen36_35b_cb_prefix/qwen35-cb-prefix-thinking-default-canary.json/opt/ai-runtime/run/non-live-jobs-ab/20260613T115016Z-qwen36_27b_cb_prefix/qwen27-cb-prefix-nothink-canary.json/opt/ai-runtime/run/non-live-jobs-ab/20260613T115016Z-qwen36_27b_cb_prefix/qwen27-cb-prefix-thinking-default-canary.jsonThose returned visible content,
finish_reason=stop, nonzero token accounting, androuting=textmodel_cb.Upstream code path compared
vllm_mlx/patches/qwen3_5_mllm.pytests/test_qwen35_mllm_patch.pyQwen3_5Attention.__call__is patched in vllm-mlx for BatchKVCache. The wrapper needed to match the current mlx-vlm call surface.Validation
AI_RUNTIME_OFFICIAL_LOAD=1 /opt/ai-runtime/venv-live/bin/python -m pytest tests/test_qwen35_mllm_patch.py -q->4 passedAI_RUNTIME_OFFICIAL_LOAD=1 /opt/ai-runtime/venv-live/bin/python -m py_compile vllm_mlx/patches/qwen3_5_mllm.pygit diff --check/opt/ai-runtime/run/upstream-local-repro-preflight/20260613T120702Z-upstream-local-repro-preflight.jsonNot claimed
This PR does not make a model-quality claim, does not change sampling, and does not change default/resident routing. It only keeps the Qwen BatchKVCache attention patch compatible with the current mlx-vlm attention kwargs.