I am seeing a large gap between the local smoke benchmark and real agent API calls using the dame target/draft pair
Summary
On real chat/agent requests with ~8.4k prompt tokens, dflash serve is much slower than baseline autoregresive (AR) mode during decode.
The long prefill time is expected (because of the bigger prompt), but post prefill decode is also much worse than plain decoding.
In the mac m1 max, plain AR is roughly 50-60 tok/s and DFlash stays around 15-20 tok/s.
Environment
dflash-mlx 0.1.5.1
mlx 0.31.2
mlx-lm 0.31.3
- Python
3.14.3
- Apple M1 Max, 64 GB
- target:
mlx-community/Qwen3.6-35B-A3B-4bit
- draft:
z-lab/Qwen3.6-35B-A3B-DFlash
What I ran
Smoke benchmark looked good:
- ~102 prompt tokens
dflash tok/s: 110.97
- acceptance:
0.91
But real server/API diagnostics are very different.
1. verify_mode=auto
From .artifacts/dflash/diagnostics/20260511-165200-serve-basic
- request 1:
8389 prompt tokens, 246 generated, decode_tok_s=15.15, acceptance=0.561, tokens_per_cycle=2.28
- request 2:
8392 prompt tokens, 332 generated, decode_tok_s=19.60, acceptance=0.663, tokens_per_cycle=2.96
2. verify_mode=adaptive
From .artifacts/dflash/diagnostics/20260511-172625-serve-basic
- request 1:
8392 prompt tokens, 332 generated, decode_tok_s=20.14, acceptance=0.663, tokens_per_cycle=2.96
- request 2:
8389 prompt tokens, 206 generated, decode_tok_s=17.75, acceptance=0.597, tokens_per_cycle=2.48
- adaptive did engage on one request:
adaptive_block_reductions=1, adaptive_block_cycles=12, adaptive_block_min=4
better, but still far below plain AR
3. verify_mode=adaptive + non-quantized draft
From .artifacts/dflash/diagnostics/20260511-172930-serve-basic
Serve command included --draft-quant none.
Results:
- request 1:
8389 prompt tokens, 432 generated, decode_tok_s=20.07, acceptance=0.669, tokens_per_cycle=3.02
- request 2:
8392 prompt tokens, 324 generated, decode_tok_s=19.33, acceptance=0.660, tokens_per_cycle=2.95
Slightly better drafting, but not enough and memory increased.
What I checked already
I do not see an obvious serve-vs-benchmark config mismatch in the core runtime:
- profile:
balanced
split-sdpa enabled
draft_sink_size=64
draft_window_size=1024
verify_len_cap=0
target_fa_window=0
- draft
block_size=16
Question
Is this expected for long-context real chat/agent workloads?
If not, what should I try next? are there any recommended serve settings for long context agentic traffic?
I am seeing a large gap between the local smoke benchmark and real agent API calls using the dame target/draft pair
Summary
On real chat/agent requests with ~8.4k prompt tokens,
dflash serveis much slower than baseline autoregresive (AR) mode during decode.The long prefill time is expected (because of the bigger prompt), but post prefill decode is also much worse than plain decoding.
In the mac m1 max, plain AR is roughly
50-60 tok/sand DFlash stays around15-20 tok/s.Environment
dflash-mlx0.1.5.1mlx0.31.2mlx-lm0.31.33.14.3mlx-community/Qwen3.6-35B-A3B-4bitz-lab/Qwen3.6-35B-A3B-DFlashWhat I ran
Smoke benchmark looked good:
dflash tok/s:110.970.91But real server/API diagnostics are very different.
1.
verify_mode=autoFrom
.artifacts/dflash/diagnostics/20260511-165200-serve-basic8389prompt tokens,246generated,decode_tok_s=15.15,acceptance=0.561,tokens_per_cycle=2.288392prompt tokens,332generated,decode_tok_s=19.60,acceptance=0.663,tokens_per_cycle=2.962.
verify_mode=adaptiveFrom
.artifacts/dflash/diagnostics/20260511-172625-serve-basic8392prompt tokens,332generated,decode_tok_s=20.14,acceptance=0.663,tokens_per_cycle=2.968389prompt tokens,206generated,decode_tok_s=17.75,acceptance=0.597,tokens_per_cycle=2.48adaptive_block_reductions=1,adaptive_block_cycles=12,adaptive_block_min=4better, but still far below plain AR
3.
verify_mode=adaptive+ non-quantized draftFrom
.artifacts/dflash/diagnostics/20260511-172930-serve-basicServe command included
--draft-quant none.Results:
8389prompt tokens,432generated,decode_tok_s=20.07,acceptance=0.669,tokens_per_cycle=3.028392prompt tokens,324generated,decode_tok_s=19.33,acceptance=0.660,tokens_per_cycle=2.95Slightly better drafting, but not enough and memory increased.
What I checked already
I do not see an obvious serve-vs-benchmark config mismatch in the core runtime:
balancedsplit-sdpaenableddraft_sink_size=64draft_window_size=1024verify_len_cap=0target_fa_window=0block_size=16Question
Is this expected for long-context real chat/agent workloads?
If not, what should I try next? are there any recommended serve settings for long context agentic traffic?