Skip to content

Fix Qwen3_5 mixed-bit per-tensor quantization keys not remapped through sanitize#1311

Open
ivaniguarans wants to merge 1 commit into
ml-explore:mainfrom
ivaniguarans:fix/qwen3-5-vl-sanitize-quantization
Open

Fix Qwen3_5 mixed-bit per-tensor quantization keys not remapped through sanitize#1311
ivaniguarans wants to merge 1 commit into
ml-explore:mainfrom
ivaniguarans:fix/qwen3-5-vl-sanitize-quantization

Conversation

@ivaniguarans
Copy link
Copy Markdown

Problem

Loading an Qwen3_5ForConditionalGeneration checkpoint with non-uniform per-tensor config["quantization"] overrides — notably Intel's AutoRound mixed-bit MLX builds — fails on main with:

ValueError: Expected shape (10240, 640) but received shape (10240, 960)
for parameter language_model.model.layers.0.linear_attn.in_proj_qkv.weight

Reproduced against Intel/Qwen3.6-27B-4.5b-mlx-AutoRound (459 per-tensor entries, 128 of which are real 6-bit overrides on top of a 4-bit global).

Root cause

class_predicate in mlx_lm/utils.py looks up paths in the post-sanitize namespace, but config["quantization"] keys are stored in the pre-sanitize namespace as they were written to config.json. For Qwen3.5_VL, Model.sanitize in mlx_lm/models/qwen3_5.py rewrites model.language_model.*language_model.model.* (and drops vision_tower.* / model.visual.* entries), but the quantization config is never remapped to match. Every per-tensor override misses the lookup, falls back to the global bit-width, and the tensor is allocated at the wrong size — failing the shape check on load.

Fix

Add a per-model sanitize_quantization(quant_config) method on Qwen3_5ForConditionalGeneration that mirrors the existing sanitize(weights) key transformation, and wire it via a hasattr-discovered opt-in call at the _quantize call site in mlx_lm/utils.py — the same shape as the existing sanitize invocation a few lines above. Qwen3_5MoeForConditionalGeneration inherits the method through its class Model(Qwen3_5Model) subclassing, so MoE checkpoints pick up the prefix remap without a separate implementation (MoE-specific expert-key remaps such as experts.gate_up_projswitch_mlp.gate_proj are not in scope of this change — no MoE mixed-bit checkpoint with per-tensor expert overrides has been reported yet).

Models without the method are unaffected. The hook is scoped to the top-level config["quantization"] branch; the legacy quantization_config path (AWQ/GPTQ/mxfp4) carries flat global configs without per-tensor overrides and does not exercise this bug.

Credit to @ivanfioravanti for the local-fix observation in #1214's comments.

Testing

  • Unit tests (TestQwen3_5SanitizeQuantization in tests/test_utils.py): one table-driven test covering all five key shapes the method handles (globals passthrough, vision_tower / model.visual drops, model.language_model.* rewrite, language_model.* passthrough, bare-key prefix), plus a non-mutation test confirming the input dict is left intact.
  • Loader-path integration test (TestUtils.test_load_model_qwen3_5_with_unsanitized_per_tensor_quantization): mirrors the existing test_load_model_gemma4_with_per_layer_projection_quantization pattern. Builds a tiny synthetic Qwen3.5 model, hand-edits config["quantization"] with an un-sanitized per-tensor override key, saves and reloads via load_model, and asserts the rewritten key appears in loaded_config["quantization"] (the un-sanitized form is gone), then runs a forward pass as a smoke test that load_weights doesn't raise on the rewritten config.
  • Real-checkpoint verification on M5 Max (128 GB) against Intel/Qwen3.6-27B-4.5b-mlx-AutoRound:
    • Without this patch: ValueError as quoted above.
    • With this patch: load succeeds in 1.3 s; generate(prompt="The capital of France is", max_tokens=8) returns "Paris.\n\n<think>\nHere's a".
  • Full python -m unittest discover tests/ shows no regressions (12 pre-existing failures, all verified to fail identically on upstream/main).

Fixes #1214.

@nastya236 nastya236 added the bug Something isn't working label Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Mixed-bit Qwen3_5 multimodal quants fail to load — per-tensor quantization overrides not remapped through Model.sanitize

2 participants