Reproducing the 2.24× speedup: documented MLX patch alone doesn't get there — please publish the actual fork commit

# Reproducing the 2.24× speedup: documented MLX patch alone doesn't get there — please publish the actual fork commit

Hi @youssofal — first, thank you for the clear, comprehensive README and the verified benchmark numbers. The single-model native-MTP architecture is genuinely impressive work, and the public default model `Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed` runs cleanly out of the box.

I spent ~6 hours this week trying to reproduce the headline **2.24× / ~63 tok/s** number on **M4 Max 36 GB** by following the public source path:

1. ✅ Cloned and built `MTPLX/native_extensions/verify_mlp` (the gdn-tail + gate-up Metal kernels)
2. ✅ Cloned `ml-explore/mlx v0.31.2` and applied the documented `qmv_fast_impl` patch:
   - `constexpr int num_simdgroups = 2;` → `4`
   - `int bn = 8;` / `MTL::Size group_dims(bk, 2, 1);` → `bn = 16;` / `MTL::Size group_dims(bk, 4, 1);` in `qmv()` launch config
   - `#pragma clang loop unroll_count(4)` on the inner row loop in `qmv_fast_impl`
3. ✅ Wheel built (`mlx-0.31.2.dev20260513+68cf2fd-cp313-cp313-macosx_15_0_arm64.whl`), installed in the brew mtplx venv
4. ✅ `mtplx doctor --deep` confirms patched MLX loads cleanly, `variant: native_full_rowwise` + `enabled: True` when `MTPLX_NATIVE_MLP_ROWWISE=1`

But the end-to-end result is **identical to stock**:

| ctx | tok/s, sustained | tok/s, performance-cold |
|---|---|---|
| 512 | — | 20.8 |
| 1k | 15.4 | 15.5 |
| 4k | 6.5 | 6.4 |
| 16k | 1.9 | — |

(Native kernels are confirmed active via `mlp_variant_stats.native_calls > 0` in the health endpoint. Patched MLX is confirmed loaded via `mlx_fork.version = 0.31.2.dev20260513+68cf2fd`. `mlx_fork.ok = False` because our build commit ≠ `2377a99f`.)

**Direct kernel microbench** (Mac Studio M4 Max, `mx.quantized_matmul`, K=5120, N=5120, group_size=64, bits=4, 200 iters median):

| M | Stock MLX 0.31.2 | Patched (with unroll) | Patched (no unroll) |
|---|---|---|---|
| 1 | 0.134 ms | 0.151 (+12%) | 0.142 (+6%) |
| 3 | 0.173 ms | 0.210 (+21%) | 0.159 (-8%) |
| 4 | 0.172 ms | 0.218 (+27%) | 0.168 (-2%) |
| 6 | 0.200 ms | 0.238 (+19%) | 0.200 (0%) |
| 8 | 0.233 ms | 0.264 (+13%) | 0.236 (+1%) |
| 16 | 0.272 ms | 0.291 (+7%) | 0.266 (-2%) |

Two observations:

1. **The `unroll_count(4)` pragma is a regression on M4 Max at every M.** Probably an M5 Max-specific tuning that doesn't transfer.
2. Even without the pragma, the patch is at best **2-8% faster** at M=3,4 — far from the kernel-level speedup the headline 2.24× would require if qmv is the dominant verify-MLP cost.

**Request:** would you be willing to publish the actual contents of `mlx-mtplx-0.31.2-qmm` at commit `2377a99f`? Either as a git remote URL, a public branch on a fork, or even a patch series attached to the repo. Specifically, if there are changes beyond what `README.md` lines 228-230 describe — e.g. additional kernel restructuring, register-pressure tuning, different reduction strategies, simdgroup-shared-memory work, anything in `qmv_impl` or related kernels — those are the bits the community can't reverse-engineer from the README.

**Why this matters:** MTPLX is currently the only documented native-MTP MLX runtime, and `Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed` is genuinely the best Apple-Silicon-local thinking-agent model I've tested. But the gap between "I followed the README" and "I get the headline number" is a UX cliff for anyone on M4 / M3 Max hardware. Publishing the actual fork — even as a sealed read-only branch — would let users on non-M5 hardware extract a tractable subset of the speedup.

If publication isn't possible (commercial, sponsored, in-flight upstream PR, etc.), that's a fair answer too — just knowing for sure would save reproducers a lot of time.

**Hardware:** Mac Studio M4 Max, 36 GB unified memory, `iogpu.wired_limit_mb=32000`, `mx.set_wired_limit(31 GB)`. macOS 15.7.4. MTPLX 0.3.4 from your homebrew tap, `Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed` model. Stack baseline (plain `mlx_lm.server`, no MTP, no patches) gets **50.5 tok/s @ 128k context** on the same hardware with Qwen3.6-35B-A3B-4bit, so the hardware itself isn't the constraint.

Thank you again for the project — happy to share any additional diagnostic data (`mtplx doctor --deep --json`, per-cycle traces, etc.) that would help triage.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing the 2.24× speedup: documented MLX patch alone doesn't get there — please publish the actual fork commit #64

Reproducing the 2.24× speedup: documented MLX patch alone doesn't get there — please publish the actual fork commit

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

M	Stock MLX 0.31.2	Patched (with unroll)	Patched (no unroll)
1	0.134 ms	0.151 (+12%)	0.142 (+6%)
3	0.173 ms	0.210 (+21%)	0.159 (-8%)
4	0.172 ms	0.218 (+27%)	0.168 (-2%)
6	0.200 ms	0.238 (+19%)	0.200 (0%)
8	0.233 ms	0.264 (+13%)	0.236 (+1%)
16	0.272 ms	0.291 (+7%)	0.266 (-2%)

Reproducing the 2.24× speedup: documented MLX patch alone doesn't get there — please publish the actual fork commit #64

Description

Reproducing the 2.24× speedup: documented MLX patch alone doesn't get there — please publish the actual fork commit

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions