Support Qwen3.5-MoE MoE MTP heads#84
Open
janfeddersen-wq wants to merge 1 commit into
Open
Conversation
The native Qwen MTP path assumed a dense single-MLP MTP head: a fixed
15-tensor gate and no expert stacking. Qwen3.5-MoE checkpoints whose MTP
head is itself an MoE block (router gate + per-expert MLPs + a shared
expert / shared_expert_gate) were therefore rejected with
`invalid-mtp-tensor-layout`, even though the runtime can already build
the block -- mlx-lm's qwen3_5 DecoderLayer instantiates SparseMoeBlock
whenever num_experts > 0.
- artifacts: derive the expected MTP key set from num_experts for MoE
heads instead of the hard-coded dense 15, so the tensor gate passes
(no-op for dense heads).
- mtp_patch: stack per-expert mtp.layers.*.mlp.experts.{i}.* weights into
the switch_mlp layout SwitchGLU expects, mirroring the stacking mlx-lm
performs for the main decoder layers (no-op for dense heads).
- tests for the MoE tensor gate and the expert stacking.
Verified on Qwopus3.5-122B-A10B (qwen3_5_moe, 256 experts, bf16 MTP
sidecar grafted onto a 4-bit base): `mtplx inspect` passes (785/785
tensors) and MTP speculative decoding runs with ~70% depth-1 acceptance
(~2.1 tokens per target verify pass).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds support for Qwen3.5-MoE MTP heads to the native Qwen MTP path.
The MTP head on Qwen3.5-MoE checkpoints (e.g.
Qwen/Qwen3.5-122B-A10B) is itself an MoE block — routergate+ per-expert MLPs + ashared_expertandshared_expert_gate— whereas the existing path assumed a dense single-MLP head. As a result these models were recognized asqwen3-next-mtpbut rejected at the tensor gate withinvalid-mtp-tensor-layout.No new backend is required: mlx-lm's
qwen3_5DecoderLayeralready instantiatesSparseMoeBlockwhennum_experts > 0, so the MTP block builds correctly once the weights are stacked. The change is two small, dense-safe additions:artifacts.py— for MoE heads, derive the expected MTP key set fromnum_experts(mtp.layers.*.mlp.{gate, experts.{i}.*, shared_expert.*, shared_expert_gate}) instead of the fixed dense 15, soinspect's tensor gate passes. No-op for dense heads.mtp_patch.py—_stack_mtp_moe_expertsstacks per-expertmtp.layers.*.mlp.experts.{i}.{proj}tensors into theswitch_mlplayoutSwitchGLUexpects, mirroring the stacking mlx-lm performs for the main decoder layers. No-op for dense heads.Verification
Tested on Qwopus3.5-122B-A10B (
qwen3_5_moe, 256 experts / 8 active), bf16 MTP sidecar grafted onto a 4-bit base:mtplx inspect→can_run: true, tensor gate 785/785, 0 missing / 0 extra.accepted_by_depth = [40, 19, 3]of[57, 57, 56]drafted → ~70% depth-1 acceptance, 120 tokens in 57 verify passes (~2.1 tokens/verify). A mis-loaded MoE head would accept ~0%.pytest tests/test_artifacts.py tests/test_mtp_patch.pygreen.Known limitation — MoE exactness
At temperature 0, MTP vs non-MTP greedy decode is ~98% identical and re-converges immediately, but occasionally flips a single token. This is the MoE router hitting a near-tie that resolves differently under batched verification vs single-token autoregressive decode (a known MoE/FP effect), not a drafting error. The
max_diff = 0.0exactness guarantee was established on a dense model; strict bit-exactness for MoE heads likely needs separate handling (e.g. fp32 router logits during verify). Flagging for discussion — happy to follow up.🤖 Generated with Claude Code