Skip to content

M1 Max: Long-context API requests are slower with DFlash than basic decoding over Qwen3.6-35B-A3B #32

@hdmi

Description

@hdmi

I am seeing a large gap between the local smoke benchmark and real agent API calls using the dame target/draft pair

Summary

On real chat/agent requests with ~8.4k prompt tokens, dflash serve is much slower than baseline autoregresive (AR) mode during decode.
The long prefill time is expected (because of the bigger prompt), but post prefill decode is also much worse than plain decoding.
In the mac m1 max, plain AR is roughly 50-60 tok/s and DFlash stays around 15-20 tok/s.

Environment

  • dflash-mlx 0.1.5.1
  • mlx 0.31.2
  • mlx-lm 0.31.3
  • Python 3.14.3
  • Apple M1 Max, 64 GB
  • target: mlx-community/Qwen3.6-35B-A3B-4bit
  • draft: z-lab/Qwen3.6-35B-A3B-DFlash

What I ran

Smoke benchmark looked good:

  • ~102 prompt tokens
  • dflash tok/s: 110.97
  • acceptance: 0.91

But real server/API diagnostics are very different.

1. verify_mode=auto

From .artifacts/dflash/diagnostics/20260511-165200-serve-basic

  • request 1: 8389 prompt tokens, 246 generated, decode_tok_s=15.15, acceptance=0.561, tokens_per_cycle=2.28
  • request 2: 8392 prompt tokens, 332 generated, decode_tok_s=19.60, acceptance=0.663, tokens_per_cycle=2.96

2. verify_mode=adaptive

From .artifacts/dflash/diagnostics/20260511-172625-serve-basic

  • request 1: 8392 prompt tokens, 332 generated, decode_tok_s=20.14, acceptance=0.663, tokens_per_cycle=2.96
  • request 2: 8389 prompt tokens, 206 generated, decode_tok_s=17.75, acceptance=0.597, tokens_per_cycle=2.48
  • adaptive did engage on one request: adaptive_block_reductions=1, adaptive_block_cycles=12, adaptive_block_min=4

better, but still far below plain AR

3. verify_mode=adaptive + non-quantized draft

From .artifacts/dflash/diagnostics/20260511-172930-serve-basic

Serve command included --draft-quant none.

Results:

  • request 1: 8389 prompt tokens, 432 generated, decode_tok_s=20.07, acceptance=0.669, tokens_per_cycle=3.02
  • request 2: 8392 prompt tokens, 324 generated, decode_tok_s=19.33, acceptance=0.660, tokens_per_cycle=2.95

Slightly better drafting, but not enough and memory increased.

What I checked already

I do not see an obvious serve-vs-benchmark config mismatch in the core runtime:

  • profile: balanced
  • split-sdpa enabled
  • draft_sink_size=64
  • draft_window_size=1024
  • verify_len_cap=0
  • target_fa_window=0
  • draft block_size=16

Question

Is this expected for long-context real chat/agent workloads?
If not, what should I try next? are there any recommended serve settings for long context agentic traffic?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions