Skip to content

Reproducing the 2.24× speedup: documented MLX patch alone doesn't get there — please publish the actual fork commit #64

@benjamin-levin

Description

@benjamin-levin

Reproducing the 2.24× speedup: documented MLX patch alone doesn't get there — please publish the actual fork commit

Hi @youssofal — first, thank you for the clear, comprehensive README and the verified benchmark numbers. The single-model native-MTP architecture is genuinely impressive work, and the public default model Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed runs cleanly out of the box.

I spent ~6 hours this week trying to reproduce the headline 2.24× / ~63 tok/s number on M4 Max 36 GB by following the public source path:

  1. ✅ Cloned and built MTPLX/native_extensions/verify_mlp (the gdn-tail + gate-up Metal kernels)
  2. ✅ Cloned ml-explore/mlx v0.31.2 and applied the documented qmv_fast_impl patch:
    • constexpr int num_simdgroups = 2;4
    • int bn = 8; / MTL::Size group_dims(bk, 2, 1);bn = 16; / MTL::Size group_dims(bk, 4, 1); in qmv() launch config
    • #pragma clang loop unroll_count(4) on the inner row loop in qmv_fast_impl
  3. ✅ Wheel built (mlx-0.31.2.dev20260513+68cf2fd-cp313-cp313-macosx_15_0_arm64.whl), installed in the brew mtplx venv
  4. mtplx doctor --deep confirms patched MLX loads cleanly, variant: native_full_rowwise + enabled: True when MTPLX_NATIVE_MLP_ROWWISE=1

But the end-to-end result is identical to stock:

ctx tok/s, sustained tok/s, performance-cold
512 20.8
1k 15.4 15.5
4k 6.5 6.4
16k 1.9

(Native kernels are confirmed active via mlp_variant_stats.native_calls > 0 in the health endpoint. Patched MLX is confirmed loaded via mlx_fork.version = 0.31.2.dev20260513+68cf2fd. mlx_fork.ok = False because our build commit ≠ 2377a99f.)

Direct kernel microbench (Mac Studio M4 Max, mx.quantized_matmul, K=5120, N=5120, group_size=64, bits=4, 200 iters median):

M Stock MLX 0.31.2 Patched (with unroll) Patched (no unroll)
1 0.134 ms 0.151 (+12%) 0.142 (+6%)
3 0.173 ms 0.210 (+21%) 0.159 (-8%)
4 0.172 ms 0.218 (+27%) 0.168 (-2%)
6 0.200 ms 0.238 (+19%) 0.200 (0%)
8 0.233 ms 0.264 (+13%) 0.236 (+1%)
16 0.272 ms 0.291 (+7%) 0.266 (-2%)

Two observations:

  1. The unroll_count(4) pragma is a regression on M4 Max at every M. Probably an M5 Max-specific tuning that doesn't transfer.
  2. Even without the pragma, the patch is at best 2-8% faster at M=3,4 — far from the kernel-level speedup the headline 2.24× would require if qmv is the dominant verify-MLP cost.

Request: would you be willing to publish the actual contents of mlx-mtplx-0.31.2-qmm at commit 2377a99f? Either as a git remote URL, a public branch on a fork, or even a patch series attached to the repo. Specifically, if there are changes beyond what README.md lines 228-230 describe — e.g. additional kernel restructuring, register-pressure tuning, different reduction strategies, simdgroup-shared-memory work, anything in qmv_impl or related kernels — those are the bits the community can't reverse-engineer from the README.

Why this matters: MTPLX is currently the only documented native-MTP MLX runtime, and Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed is genuinely the best Apple-Silicon-local thinking-agent model I've tested. But the gap between "I followed the README" and "I get the headline number" is a UX cliff for anyone on M4 / M3 Max hardware. Publishing the actual fork — even as a sealed read-only branch — would let users on non-M5 hardware extract a tractable subset of the speedup.

If publication isn't possible (commercial, sponsored, in-flight upstream PR, etc.), that's a fair answer too — just knowing for sure would save reproducers a lot of time.

Hardware: Mac Studio M4 Max, 36 GB unified memory, iogpu.wired_limit_mb=32000, mx.set_wired_limit(31 GB). macOS 15.7.4. MTPLX 0.3.4 from your homebrew tap, Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed model. Stack baseline (plain mlx_lm.server, no MTP, no patches) gets 50.5 tok/s @ 128k context on the same hardware with Qwen3.6-35B-A3B-4bit, so the hardware itself isn't the constraint.

Thank you again for the project — happy to share any additional diagnostic data (mtplx doctor --deep --json, per-cycle traces, etc.) that would help triage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions