Skip to content

Support Qwen3.5-MoE MoE MTP heads#84

Open
janfeddersen-wq wants to merge 1 commit into
youssofal:mainfrom
janfeddersen-wq:qwen3-5-moe-mtp
Open

Support Qwen3.5-MoE MoE MTP heads#84
janfeddersen-wq wants to merge 1 commit into
youssofal:mainfrom
janfeddersen-wq:qwen3-5-moe-mtp

Conversation

@janfeddersen-wq
Copy link
Copy Markdown

Summary

Adds support for Qwen3.5-MoE MTP heads to the native Qwen MTP path.

The MTP head on Qwen3.5-MoE checkpoints (e.g. Qwen/Qwen3.5-122B-A10B) is itself an MoE block — router gate + per-expert MLPs + a shared_expert and shared_expert_gate — whereas the existing path assumed a dense single-MLP head. As a result these models were recognized as qwen3-next-mtp but rejected at the tensor gate with invalid-mtp-tensor-layout.

No new backend is required: mlx-lm's qwen3_5 DecoderLayer already instantiates SparseMoeBlock when num_experts > 0, so the MTP block builds correctly once the weights are stacked. The change is two small, dense-safe additions:

  • artifacts.py — for MoE heads, derive the expected MTP key set from num_experts (mtp.layers.*.mlp.{gate, experts.{i}.*, shared_expert.*, shared_expert_gate}) instead of the fixed dense 15, so inspect's tensor gate passes. No-op for dense heads.
  • mtp_patch.py_stack_mtp_moe_experts stacks per-expert mtp.layers.*.mlp.experts.{i}.{proj} tensors into the switch_mlp layout SwitchGLU expects, mirroring the stacking mlx-lm performs for the main decoder layers. No-op for dense heads.
  • Tests for the MoE tensor gate and the expert stacking.

Verification

Tested on Qwopus3.5-122B-A10B (qwen3_5_moe, 256 experts / 8 active), bf16 MTP sidecar grafted onto a 4-bit base:

  • mtplx inspectcan_run: true, tensor gate 785/785, 0 missing / 0 extra.
  • 120-token greedy run: accepted_by_depth = [40, 19, 3] of [57, 57, 56] drafted → ~70% depth-1 acceptance, 120 tokens in 57 verify passes (~2.1 tokens/verify). A mis-loaded MoE head would accept ~0%.
  • pytest tests/test_artifacts.py tests/test_mtp_patch.py green.

Known limitation — MoE exactness

At temperature 0, MTP vs non-MTP greedy decode is ~98% identical and re-converges immediately, but occasionally flips a single token. This is the MoE router hitting a near-tie that resolves differently under batched verification vs single-token autoregressive decode (a known MoE/FP effect), not a drafting error. The max_diff = 0.0 exactness guarantee was established on a dense model; strict bit-exactness for MoE heads likely needs separate handling (e.g. fp32 router logits during verify). Flagging for discussion — happy to follow up.

🤖 Generated with Claude Code

The native Qwen MTP path assumed a dense single-MLP MTP head: a fixed
15-tensor gate and no expert stacking. Qwen3.5-MoE checkpoints whose MTP
head is itself an MoE block (router gate + per-expert MLPs + a shared
expert / shared_expert_gate) were therefore rejected with
`invalid-mtp-tensor-layout`, even though the runtime can already build
the block -- mlx-lm's qwen3_5 DecoderLayer instantiates SparseMoeBlock
whenever num_experts > 0.

- artifacts: derive the expected MTP key set from num_experts for MoE
  heads instead of the hard-coded dense 15, so the tensor gate passes
  (no-op for dense heads).
- mtp_patch: stack per-expert mtp.layers.*.mlp.experts.{i}.* weights into
  the switch_mlp layout SwitchGLU expects, mirroring the stacking mlx-lm
  performs for the main decoder layers (no-op for dense heads).
- tests for the MoE tensor gate and the expert stacking.

Verified on Qwopus3.5-122B-A10B (qwen3_5_moe, 256 experts, bf16 MTP
sidecar grafted onto a 4-bit base): `mtplx inspect` passes (785/785
tensors) and MTP speculative decoding runs with ~70% depth-1 acceptance
(~2.1 tokens per target verify pass).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant