Skip to content

oQ-MTP performance on M1 not as expected #1097

@s-n-t

Description

@s-n-t

I've tested Qwen3.6-27B-oQ5-fp16-mtp and Qwen3.6-27B-oQ8-fp16-mtp generated from the original weights, acceptance seems to be high:

2026-05-06 20:12:45,019 - omlx.patches.mlx_lm_mtp.batch_generator - INFO - MTP[1] finish=length tokens=128 cycles=68 accept=58/68 (85.3%) emits[init=2,draft=58,bonus=58,verify=10] timing[backbone=1305.3ms mtp=324.7ms sample=6296.6ms cache=17.4ms]
2026-05-06 20:12:45,056 - omlx.scheduler - INFO - Cache phase timings: cleanup_finished_sync=0.1ms/2, store_cache_main_eval=8.0ms/2, store_cache_main_prep=8.1ms/2
2026-05-06 20:13:12,767 - omlx.patches.mlx_lm_mtp.batch_generator - INFO - MTP path activated for uid=2 (model has mtp_forward, batch=1)
2026-05-06 20:13:21,280 - omlx.patches.mlx_lm_mtp.batch_generator - INFO - MTP[2] finish=length tokens=128 cycles=68 accept=58/68 (85.3%) emits[init=2,draft=58,bonus=58,verify=10] timing[backbone=1323.1ms mtp=325.9ms sample=6410.3ms cache=18.2ms]

But oQ5 runs slower with MTP enabled and while oQ8 is improved it's not as much as I think is expected?

MTP ON

oMLX - LLM inference, optimized for your Mac
https://github.com/jundot/omlx
Benchmark Model: Qwen3.6-27B-oQ5-fp16-mtp
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          7308.2       65.63   140.1 tok/s    15.4 tok/s      15.643    73.6 tok/s    19.80 GB
pp4096/tg128         28089.0       65.23   145.8 tok/s    15.5 tok/s      36.373   116.1 tok/s    21.22 GB
pp8192/tg128         56390.1       64.54   145.3 tok/s    15.6 tok/s      64.586   128.8 tok/s    22.25 GB
pp16384/tg128       115483.6       70.19   141.9 tok/s    14.4 tok/s     124.397   132.7 tok/s    23.75 GB
pp32768/tg128       244621.0       75.65   134.0 tok/s    13.3 tok/s     254.228   129.4 tok/s    26.74 GB
MTP OFF

oMLX - LLM inference, optimized for your Mac
https://github.com/jundot/omlx
Benchmark Model: Qwen3.6-27B-oQ5-fp16-mtp
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          7318.8       59.80   139.9 tok/s    16.9 tok/s      14.913    77.2 tok/s    19.46 GB
pp4096/tg128         28026.7       61.19   146.1 tok/s    16.5 tok/s      35.798   118.0 tok/s    20.88 GB
pp8192/tg128         56369.0       62.09   145.3 tok/s    16.2 tok/s      64.254   129.5 tok/s    21.91 GB
pp16384/tg128       115442.2       66.20   141.9 tok/s    15.2 tok/s     123.849   133.3 tok/s    23.41 GB
pp32768/tg128       244644.6       73.36   133.9 tok/s    13.7 tok/s     253.961   129.5 tok/s    26.41 GB
MTP ON

oMLX - LLM inference, optimized for your Mac
https://github.com/jundot/omlx
Benchmark Model: Qwen3.6-27B-oQ8-fp16-mtp
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          7207.5       75.86   142.1 tok/s    13.3 tok/s      16.841    68.4 tok/s    28.81 GB
pp4096/tg128         27623.5       70.30   148.3 tok/s    14.3 tok/s      36.552   115.6 tok/s    30.26 GB
pp8192/tg128         55456.9       71.22   147.7 tok/s    14.2 tok/s      64.502   129.0 tok/s    31.29 GB
pp16384/tg128       113586.4       78.26   144.2 tok/s    12.9 tok/s     123.526   133.7 tok/s    32.79 GB
pp32768/tg128       240711.8       80.76   136.1 tok/s    12.5 tok/s     250.969   131.1 tok/s    35.79 GB
MTP OFF

oMLX - LLM inference, optimized for your Mac
https://github.com/jundot/omlx
Benchmark Model: Qwen3.6-27B-oQ8-fp16-mtp
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          7201.1       89.31   142.2 tok/s    11.3 tok/s      18.543    62.1 tok/s    28.34 GB
pp4096/tg128         27628.8       91.76   148.3 tok/s    11.0 tok/s      39.283   107.5 tok/s    29.80 GB
pp8192/tg128         55460.2       92.42   147.7 tok/s    10.9 tok/s      67.198   123.8 tok/s    30.82 GB
pp16384/tg128       113573.6       95.73   144.3 tok/s    10.5 tok/s     125.732   131.3 tok/s    32.32 GB
pp32768/tg128       240794.7      102.56   136.1 tok/s     9.8 tok/s     253.819   129.6 tok/s    35.32 GB

Happy to test anything else if it helps?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions