Skip to content

fix(qwen): accept current attention kwargs in batch patch#616

Open
Thump604 wants to merge 1 commit into
waybarrios:mainfrom
Thump604:604/qwen-cb-prefix-compat
Open

fix(qwen): accept current attention kwargs in batch patch#616
Thump604 wants to merge 1 commit into
waybarrios:mainfrom
Thump604:604/qwen-cb-prefix-compat

Conversation

@Thump604

Copy link
Copy Markdown
Collaborator

Summary

  • update the Qwen3.5/3.6 BatchKVCache attention patch to accept current mlx-vlm position_embeddings and target_verify kwargs
  • preserve the existing BatchKVCache offset handling and stale-position recovery behavior
  • add a regression test for the current mlx-vlm attention kwargs

Local repro / evidence

Before the local Runtime fix, Qwen 35B CB+prefix owner serving hit:

TypeError: patch_qwen35_attention_for_batching.<locals>._patched_call() got an unexpected keyword argument 'position_embeddings'

The client saw HTTP 200 with finish_reason=error, content=null, and zero token accounting.

Post-fix local evidence paths:

  • /opt/ai-runtime/run/non-live-jobs-ab/20260613T114937Z-qwen36_35b_cb_prefix/qwen35-cb-prefix-nothink-canary.json
  • /opt/ai-runtime/run/non-live-jobs-ab/20260613T114937Z-qwen36_35b_cb_prefix/qwen35-cb-prefix-thinking-default-canary.json
  • /opt/ai-runtime/run/non-live-jobs-ab/20260613T115016Z-qwen36_27b_cb_prefix/qwen27-cb-prefix-nothink-canary.json
  • /opt/ai-runtime/run/non-live-jobs-ab/20260613T115016Z-qwen36_27b_cb_prefix/qwen27-cb-prefix-thinking-default-canary.json

Those returned visible content, finish_reason=stop, nonzero token accounting, and routing=textmodel_cb.

Upstream code path compared

  • vllm_mlx/patches/qwen3_5_mllm.py
  • tests/test_qwen35_mllm_patch.py

Qwen3_5Attention.__call__ is patched in vllm-mlx for BatchKVCache. The wrapper needed to match the current mlx-vlm call surface.

Validation

  • AI_RUNTIME_OFFICIAL_LOAD=1 /opt/ai-runtime/venv-live/bin/python -m pytest tests/test_qwen35_mllm_patch.py -q -> 4 passed
  • AI_RUNTIME_OFFICIAL_LOAD=1 /opt/ai-runtime/venv-live/bin/python -m py_compile vllm_mlx/patches/qwen3_5_mllm.py
  • git diff --check
  • Runtime upstream-local-repro preflight artifact: /opt/ai-runtime/run/upstream-local-repro-preflight/20260613T120702Z-upstream-local-repro-preflight.json

Not claimed

This PR does not make a model-quality claim, does not change sampling, and does not change default/resident routing. It only keeps the Qwen BatchKVCache attention patch compatible with the current mlx-vlm attention kwargs.

@Thump604 Thump604 force-pushed the 604/qwen-cb-prefix-compat branch from 0d2202e to 595dc47 Compare June 13, 2026 12:12

@janhilgard janhilgard left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. This is a real production-incident fix for any stack that has already moved to mlx-vlm 0.6.2 — confirmed independently against my install.

Verified directly:

  • mlx_vlm.models.qwen3_5.language.Qwen3_5Attention.__call__ on a fresh mlx-vlm 0.6.2 install reports the signature

    (self, x, mask=None, cache=None, position_ids=None,
     position_embeddings=(cos, sin) | None = None,
     target_verify: bool = False) -> mx.array
    

    which is exactly what the patched _patched_call now accepts. Pre-PR, my local vllm-mlx-upstream/vllm_mlx/patches/qwen3_5_mllm.py still has the old 4-arg signature, so the next restart of a Qwen 3.6 server (mine: 27B Heretic-v2 + native MTP, 35B-A3B) would land the exact TypeError: ... unexpected keyword argument 'position_embeddings' your description quotes. Silent HTTP 200 + finish_reason=error + zero token accounting is the worst flavor of regression because dashboards stay green; this PR closes that gap.

  • Both _target_verify_linears and _target_verify_left_padded_attention exist on mlx_vlm.models.qwen3_5.language in 0.6.2, so the getattr(..., _default_...) resolver will pick up the live helpers rather than the fallbacks. The fallback strategy (running q/k/v projections in the default linear fashion, returning None from the target-verify hook so the SDPA path takes over) is the right shape — old behavior preserved when those internals are absent.

  • The mask == "left_padded_decode" sentinel handling is guarded by isinstance(mask, str), so the mx.arraystr __eq__ ambiguity can't trip it. Setting mask = None after detecting the sentinel and routing through _maybe_target_verify_attention matches what mlx-vlm now does internally.

  • _apply_rotary keeps three branches in priority order: (1) position_embeddings tuple wins when provided, (2) attention.rotary_emb.apply_rotary when available, (3) legacy rotary_emb(values, position_ids) + apply_multimodal_rotary_pos_emb. That covers the matrix of mlx-vlm 0.5.x / 0.6.x rotary surfaces.

Code-review notes:

  1. _kv_seq_len collapses the previous two-branch logic into one. Concretely the old branch was:

    • position_ids is None: kv_seq_len = keys.shape[-2] + offset + 1 regardless of cache
    • position_ids given: kv_seq_len += offset + 1 if cache is not None else 0

    The new helper does length + offset + 1 if cache is not None else length. Functionally equivalent on every production path I know (cache is always present for both prefill and decode in our stack), but the cache-is-None corner case drops the + offset + 1 term where the old code would still add it. Worth a one-line comment explaining the new shape so the next reader doesn't try to "reintroduce" the old branch.

  2. The getattr(qwen35_language, "_target_verify_linears", _default_target_verify_linears) pattern leans on private (_-prefixed) attribute names of mlx-vlm. They're stable today and the fallbacks are safe, but it's the kind of thing that's worth flagging in a comment so the next mlx-vlm refactor doesn't silently regress to the fallback paths (which would re-disable target-verify performance without breaking anything).

  3. The PR bundles the bug fix with a fairly aggressive refactor into seven helpers (_normalize_position_inputs, _position_ids_for_offset, _kv_seq_len, _apply_rotary, _slice_attention_mask, _maybe_target_verify_attention, _default_target_verify_*). Each helper is small and named well, and the test still exercises the same observable behavior. I'd normally lobby for separating "make it work with new ABI" from "refactor to helpers" so blame/bisect stays sharp, but the resulting reader experience is genuinely better than the previous one-big-function, and the helpers carve out exactly the seams future mlx-vlm ABI churn will need. Calling it acceptable.

  4. The new regression test test_qwen35_patch_accepts_current_mlx_vlm_attention_kwargs validates kwargs acceptance but not target-verify semantics (the fake module doesn't expose _target_verify_left_padded_attention, so the fallback returns None and the SDPA path runs). That's a fair scope — the alternative is wiring up a fake target_verify_left_padded_attention and asserting it gets called, which couples the test to mlx-vlm internals. Worth a follow-up test if target_verify=True ever needs to be guaranteed at the patch level rather than at mlx-vlm's level.

  5. CI: lint ✓, type-check ✓, test-apple-silicon × 2 ✓, test-matrix × 4 ✓, tests ✓. All 9 green on 595dc47.

LGTM, please merge — this one belongs in the next release because it silently catches anyone on mlx-vlm>=0.6.0 running Qwen 3.5/3.6 artifacts on the CB+prefix path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants