fix(qwen): accept current attention kwargs in batch patch by Thump604 · Pull Request #616 · waybarrios/vllm-mlx

Thump604 · 2026-06-13T12:10:26Z

Summary

update the Qwen3.5/3.6 BatchKVCache attention patch to accept current mlx-vlm position_embeddings and target_verify kwargs
preserve the existing BatchKVCache offset handling and stale-position recovery behavior
add a regression test for the current mlx-vlm attention kwargs

Local repro / evidence

Before the local Runtime fix, Qwen 35B CB+prefix owner serving hit:

TypeError: patch_qwen35_attention_for_batching.<locals>._patched_call() got an unexpected keyword argument 'position_embeddings'

The client saw HTTP 200 with finish_reason=error, content=null, and zero token accounting.

Post-fix local evidence paths:

/opt/ai-runtime/run/non-live-jobs-ab/20260613T114937Z-qwen36_35b_cb_prefix/qwen35-cb-prefix-nothink-canary.json
/opt/ai-runtime/run/non-live-jobs-ab/20260613T114937Z-qwen36_35b_cb_prefix/qwen35-cb-prefix-thinking-default-canary.json
/opt/ai-runtime/run/non-live-jobs-ab/20260613T115016Z-qwen36_27b_cb_prefix/qwen27-cb-prefix-nothink-canary.json
/opt/ai-runtime/run/non-live-jobs-ab/20260613T115016Z-qwen36_27b_cb_prefix/qwen27-cb-prefix-thinking-default-canary.json

Those returned visible content, finish_reason=stop, nonzero token accounting, and routing=textmodel_cb.

Upstream code path compared

vllm_mlx/patches/qwen3_5_mllm.py
tests/test_qwen35_mllm_patch.py

Qwen3_5Attention.__call__ is patched in vllm-mlx for BatchKVCache. The wrapper needed to match the current mlx-vlm call surface.

Validation

AI_RUNTIME_OFFICIAL_LOAD=1 /opt/ai-runtime/venv-live/bin/python -m pytest tests/test_qwen35_mllm_patch.py -q -> 4 passed
AI_RUNTIME_OFFICIAL_LOAD=1 /opt/ai-runtime/venv-live/bin/python -m py_compile vllm_mlx/patches/qwen3_5_mllm.py
git diff --check
Runtime upstream-local-repro preflight artifact: /opt/ai-runtime/run/upstream-local-repro-preflight/20260613T120702Z-upstream-local-repro-preflight.json

Not claimed

This PR does not make a model-quality claim, does not change sampling, and does not change default/resident routing. It only keeps the Qwen BatchKVCache attention patch compatible with the current mlx-vlm attention kwargs.

janhilgard

Approving. This is a real production-incident fix for any stack that has already moved to mlx-vlm 0.6.2 — confirmed independently against my install.

Verified directly:

mlx_vlm.models.qwen3_5.language.Qwen3_5Attention.__call__ on a fresh mlx-vlm 0.6.2 install reports the signature
```
(self, x, mask=None, cache=None, position_ids=None,
 position_embeddings=(cos, sin) | None = None,
 target_verify: bool = False) -> mx.array
```
which is exactly what the patched _patched_call now accepts. Pre-PR, my local vllm-mlx-upstream/vllm_mlx/patches/qwen3_5_mllm.py still has the old 4-arg signature, so the next restart of a Qwen 3.6 server (mine: 27B Heretic-v2 + native MTP, 35B-A3B) would land the exact TypeError: ... unexpected keyword argument 'position_embeddings' your description quotes. Silent HTTP 200 + finish_reason=error + zero token accounting is the worst flavor of regression because dashboards stay green; this PR closes that gap.
Both _target_verify_linears and _target_verify_left_padded_attention exist on mlx_vlm.models.qwen3_5.language in 0.6.2, so the getattr(..., _default_...) resolver will pick up the live helpers rather than the fallbacks. The fallback strategy (running q/k/v projections in the default linear fashion, returning None from the target-verify hook so the SDPA path takes over) is the right shape — old behavior preserved when those internals are absent.
The mask == "left_padded_decode" sentinel handling is guarded by isinstance(mask, str), so the mx.array ↔ str __eq__ ambiguity can't trip it. Setting mask = None after detecting the sentinel and routing through _maybe_target_verify_attention matches what mlx-vlm now does internally.
_apply_rotary keeps three branches in priority order: (1) position_embeddings tuple wins when provided, (2) attention.rotary_emb.apply_rotary when available, (3) legacy rotary_emb(values, position_ids) + apply_multimodal_rotary_pos_emb. That covers the matrix of mlx-vlm 0.5.x / 0.6.x rotary surfaces.

Code-review notes:

_kv_seq_len collapses the previous two-branch logic into one. Concretely the old branch was:
- position_ids is None: kv_seq_len = keys.shape[-2] + offset + 1 regardless of cache
- position_ids given: kv_seq_len += offset + 1 if cache is not None else 0
The new helper does length + offset + 1 if cache is not None else length. Functionally equivalent on every production path I know (cache is always present for both prefill and decode in our stack), but the cache-is-None corner case drops the + offset + 1 term where the old code would still add it. Worth a one-line comment explaining the new shape so the next reader doesn't try to "reintroduce" the old branch.
The getattr(qwen35_language, "_target_verify_linears", _default_target_verify_linears) pattern leans on private (_-prefixed) attribute names of mlx-vlm. They're stable today and the fallbacks are safe, but it's the kind of thing that's worth flagging in a comment so the next mlx-vlm refactor doesn't silently regress to the fallback paths (which would re-disable target-verify performance without breaking anything).
The PR bundles the bug fix with a fairly aggressive refactor into seven helpers (_normalize_position_inputs, _position_ids_for_offset, _kv_seq_len, _apply_rotary, _slice_attention_mask, _maybe_target_verify_attention, _default_target_verify_*). Each helper is small and named well, and the test still exercises the same observable behavior. I'd normally lobby for separating "make it work with new ABI" from "refactor to helpers" so blame/bisect stays sharp, but the resulting reader experience is genuinely better than the previous one-big-function, and the helpers carve out exactly the seams future mlx-vlm ABI churn will need. Calling it acceptable.
The new regression test test_qwen35_patch_accepts_current_mlx_vlm_attention_kwargs validates kwargs acceptance but not target-verify semantics (the fake module doesn't expose _target_verify_left_padded_attention, so the fallback returns None and the SDPA path runs). That's a fair scope — the alternative is wiring up a fake target_verify_left_padded_attention and asserting it gets called, which couples the test to mlx-vlm internals. Worth a follow-up test if target_verify=True ever needs to be guaranteed at the patch level rather than at mlx-vlm's level.
CI: lint ✓, type-check ✓, test-apple-silicon × 2 ✓, test-matrix × 4 ✓, tests ✓. All 9 green on 595dc47.

LGTM, please merge — this one belongs in the next release because it silently catches anyone on mlx-vlm>=0.6.0 running Qwen 3.5/3.6 artifacts on the CB+prefix path.

Thump604 requested a review from janhilgard June 13, 2026 12:10

Thump604 assigned janhilgard Jun 13, 2026

fix(qwen): accept current attention kwargs in batch patch

595dc47

Thump604 force-pushed the 604/qwen-cb-prefix-compat branch from 0d2202e to 595dc47 Compare June 13, 2026 12:12

janhilgard approved these changes Jun 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(qwen): accept current attention kwargs in batch patch#616

fix(qwen): accept current attention kwargs in batch patch#616
Thump604 wants to merge 1 commit into
waybarrios:mainfrom
Thump604:604/qwen-cb-prefix-compat

Thump604 commented Jun 13, 2026

Uh oh!

janhilgard left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Thump604 commented Jun 13, 2026

Summary

Local repro / evidence

Upstream code path compared

Validation

Not claimed

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants