Reproducing the 2.24× speedup: documented MLX patch alone doesn't get there — please publish the actual fork commit
Hi @youssofal — first, thank you for the clear, comprehensive README and the verified benchmark numbers. The single-model native-MTP architecture is genuinely impressive work, and the public default model Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed runs cleanly out of the box.
I spent ~6 hours this week trying to reproduce the headline 2.24× / ~63 tok/s number on M4 Max 36 GB by following the public source path:
- ✅ Cloned and built
MTPLX/native_extensions/verify_mlp (the gdn-tail + gate-up Metal kernels)
- ✅ Cloned
ml-explore/mlx v0.31.2 and applied the documented qmv_fast_impl patch:
constexpr int num_simdgroups = 2; → 4
int bn = 8; / MTL::Size group_dims(bk, 2, 1); → bn = 16; / MTL::Size group_dims(bk, 4, 1); in qmv() launch config
#pragma clang loop unroll_count(4) on the inner row loop in qmv_fast_impl
- ✅ Wheel built (
mlx-0.31.2.dev20260513+68cf2fd-cp313-cp313-macosx_15_0_arm64.whl), installed in the brew mtplx venv
- ✅
mtplx doctor --deep confirms patched MLX loads cleanly, variant: native_full_rowwise + enabled: True when MTPLX_NATIVE_MLP_ROWWISE=1
But the end-to-end result is identical to stock:
| ctx |
tok/s, sustained |
tok/s, performance-cold |
| 512 |
— |
20.8 |
| 1k |
15.4 |
15.5 |
| 4k |
6.5 |
6.4 |
| 16k |
1.9 |
— |
(Native kernels are confirmed active via mlp_variant_stats.native_calls > 0 in the health endpoint. Patched MLX is confirmed loaded via mlx_fork.version = 0.31.2.dev20260513+68cf2fd. mlx_fork.ok = False because our build commit ≠ 2377a99f.)
Direct kernel microbench (Mac Studio M4 Max, mx.quantized_matmul, K=5120, N=5120, group_size=64, bits=4, 200 iters median):
| M |
Stock MLX 0.31.2 |
Patched (with unroll) |
Patched (no unroll) |
| 1 |
0.134 ms |
0.151 (+12%) |
0.142 (+6%) |
| 3 |
0.173 ms |
0.210 (+21%) |
0.159 (-8%) |
| 4 |
0.172 ms |
0.218 (+27%) |
0.168 (-2%) |
| 6 |
0.200 ms |
0.238 (+19%) |
0.200 (0%) |
| 8 |
0.233 ms |
0.264 (+13%) |
0.236 (+1%) |
| 16 |
0.272 ms |
0.291 (+7%) |
0.266 (-2%) |
Two observations:
- The
unroll_count(4) pragma is a regression on M4 Max at every M. Probably an M5 Max-specific tuning that doesn't transfer.
- Even without the pragma, the patch is at best 2-8% faster at M=3,4 — far from the kernel-level speedup the headline 2.24× would require if qmv is the dominant verify-MLP cost.
Request: would you be willing to publish the actual contents of mlx-mtplx-0.31.2-qmm at commit 2377a99f? Either as a git remote URL, a public branch on a fork, or even a patch series attached to the repo. Specifically, if there are changes beyond what README.md lines 228-230 describe — e.g. additional kernel restructuring, register-pressure tuning, different reduction strategies, simdgroup-shared-memory work, anything in qmv_impl or related kernels — those are the bits the community can't reverse-engineer from the README.
Why this matters: MTPLX is currently the only documented native-MTP MLX runtime, and Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed is genuinely the best Apple-Silicon-local thinking-agent model I've tested. But the gap between "I followed the README" and "I get the headline number" is a UX cliff for anyone on M4 / M3 Max hardware. Publishing the actual fork — even as a sealed read-only branch — would let users on non-M5 hardware extract a tractable subset of the speedup.
If publication isn't possible (commercial, sponsored, in-flight upstream PR, etc.), that's a fair answer too — just knowing for sure would save reproducers a lot of time.
Hardware: Mac Studio M4 Max, 36 GB unified memory, iogpu.wired_limit_mb=32000, mx.set_wired_limit(31 GB). macOS 15.7.4. MTPLX 0.3.4 from your homebrew tap, Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed model. Stack baseline (plain mlx_lm.server, no MTP, no patches) gets 50.5 tok/s @ 128k context on the same hardware with Qwen3.6-35B-A3B-4bit, so the hardware itself isn't the constraint.
Thank you again for the project — happy to share any additional diagnostic data (mtplx doctor --deep --json, per-cycle traces, etc.) that would help triage.
Reproducing the 2.24× speedup: documented MLX patch alone doesn't get there — please publish the actual fork commit
Hi @youssofal — first, thank you for the clear, comprehensive README and the verified benchmark numbers. The single-model native-MTP architecture is genuinely impressive work, and the public default model
Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speedruns cleanly out of the box.I spent ~6 hours this week trying to reproduce the headline 2.24× / ~63 tok/s number on M4 Max 36 GB by following the public source path:
MTPLX/native_extensions/verify_mlp(the gdn-tail + gate-up Metal kernels)ml-explore/mlx v0.31.2and applied the documentedqmv_fast_implpatch:constexpr int num_simdgroups = 2;→4int bn = 8;/MTL::Size group_dims(bk, 2, 1);→bn = 16;/MTL::Size group_dims(bk, 4, 1);inqmv()launch config#pragma clang loop unroll_count(4)on the inner row loop inqmv_fast_implmlx-0.31.2.dev20260513+68cf2fd-cp313-cp313-macosx_15_0_arm64.whl), installed in the brew mtplx venvmtplx doctor --deepconfirms patched MLX loads cleanly,variant: native_full_rowwise+enabled: TruewhenMTPLX_NATIVE_MLP_ROWWISE=1But the end-to-end result is identical to stock:
(Native kernels are confirmed active via
mlp_variant_stats.native_calls > 0in the health endpoint. Patched MLX is confirmed loaded viamlx_fork.version = 0.31.2.dev20260513+68cf2fd.mlx_fork.ok = Falsebecause our build commit ≠2377a99f.)Direct kernel microbench (Mac Studio M4 Max,
mx.quantized_matmul, K=5120, N=5120, group_size=64, bits=4, 200 iters median):Two observations:
unroll_count(4)pragma is a regression on M4 Max at every M. Probably an M5 Max-specific tuning that doesn't transfer.Request: would you be willing to publish the actual contents of
mlx-mtplx-0.31.2-qmmat commit2377a99f? Either as a git remote URL, a public branch on a fork, or even a patch series attached to the repo. Specifically, if there are changes beyond whatREADME.mdlines 228-230 describe — e.g. additional kernel restructuring, register-pressure tuning, different reduction strategies, simdgroup-shared-memory work, anything inqmv_implor related kernels — those are the bits the community can't reverse-engineer from the README.Why this matters: MTPLX is currently the only documented native-MTP MLX runtime, and
Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speedis genuinely the best Apple-Silicon-local thinking-agent model I've tested. But the gap between "I followed the README" and "I get the headline number" is a UX cliff for anyone on M4 / M3 Max hardware. Publishing the actual fork — even as a sealed read-only branch — would let users on non-M5 hardware extract a tractable subset of the speedup.If publication isn't possible (commercial, sponsored, in-flight upstream PR, etc.), that's a fair answer too — just knowing for sure would save reproducers a lot of time.
Hardware: Mac Studio M4 Max, 36 GB unified memory,
iogpu.wired_limit_mb=32000,mx.set_wired_limit(31 GB). macOS 15.7.4. MTPLX 0.3.4 from your homebrew tap,Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speedmodel. Stack baseline (plainmlx_lm.server, no MTP, no patches) gets 50.5 tok/s @ 128k context on the same hardware with Qwen3.6-35B-A3B-4bit, so the hardware itself isn't the constraint.Thank you again for the project — happy to share any additional diagnostic data (
mtplx doctor --deep --json, per-cycle traces, etc.) that would help triage.