M1 Max: Long-context API requests are slower with DFlash than basic decoding over Qwen3.6-35B-A3B

I am seeing a large gap between the local smoke benchmark and real agent API calls using the dame target/draft pair

## Summary

On real chat/agent requests with ~8.4k prompt tokens, `dflash serve` is much slower than baseline autoregresive (AR) mode during decode.
The long prefill time is expected (because of the bigger prompt), but post prefill decode is also much worse than plain decoding.
In the mac m1 max, plain AR is roughly `50-60 tok/s` and DFlash stays around `15-20 tok/s`.

## Environment

- `dflash-mlx` `0.1.5.1`
- `mlx` `0.31.2`
- `mlx-lm` `0.31.3`
- Python `3.14.3`
- Apple M1 Max, 64 GB
- target: `mlx-community/Qwen3.6-35B-A3B-4bit`
- draft: `z-lab/Qwen3.6-35B-A3B-DFlash`

## What I ran

Smoke benchmark looked good:

- ~102 prompt tokens
- `dflash tok/s`: `110.97`
- acceptance: `0.91`

But real server/API diagnostics are very different.

### 1. `verify_mode=auto`

From `.artifacts/dflash/diagnostics/20260511-165200-serve-basic`

- request 1: `8389` prompt tokens, `246` generated, `decode_tok_s=15.15`, `acceptance=0.561`, `tokens_per_cycle=2.28`
- request 2: `8392` prompt tokens, `332` generated, `decode_tok_s=19.60`, `acceptance=0.663`, `tokens_per_cycle=2.96`

### 2. `verify_mode=adaptive`

From `.artifacts/dflash/diagnostics/20260511-172625-serve-basic`

- request 1: `8392` prompt tokens, `332` generated, `decode_tok_s=20.14`, `acceptance=0.663`, `tokens_per_cycle=2.96`
- request 2: `8389` prompt tokens, `206` generated, `decode_tok_s=17.75`, `acceptance=0.597`, `tokens_per_cycle=2.48`
- adaptive did engage on one request: `adaptive_block_reductions=1`, `adaptive_block_cycles=12`, `adaptive_block_min=4`

better, but still far below plain AR

### 3. `verify_mode=adaptive` + non-quantized draft

From `.artifacts/dflash/diagnostics/20260511-172930-serve-basic`

Serve command included `--draft-quant none`.

Results:

- request 1: `8389` prompt tokens, `432` generated, `decode_tok_s=20.07`, `acceptance=0.669`, `tokens_per_cycle=3.02`
- request 2: `8392` prompt tokens, `324` generated, `decode_tok_s=19.33`, `acceptance=0.660`, `tokens_per_cycle=2.95`

Slightly better drafting, but not enough and memory increased.

## What I checked already

I do not see an obvious serve-vs-benchmark config mismatch in the core runtime:

- profile: `balanced`
- `split-sdpa` enabled
- `draft_sink_size=64`
- `draft_window_size=1024`
- `verify_len_cap=0`
- `target_fa_window=0`
- draft `block_size=16`

## Question

Is this expected for long-context real chat/agent workloads?
If not, what should I try next? are there any recommended serve settings for long context agentic traffic?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M1 Max: Long-context API requests are slower with DFlash than basic decoding over Qwen3.6-35B-A3B #32

Summary

Environment

What I ran

1. `verify_mode=auto`

2. `verify_mode=adaptive`

3. `verify_mode=adaptive` + non-quantized draft

What I checked already

Question

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

M1 Max: Long-context API requests are slower with DFlash than basic decoding over Qwen3.6-35B-A3B #32

Description

Summary

Environment

What I ran

1. verify_mode=auto

2. verify_mode=adaptive

3. verify_mode=adaptive + non-quantized draft

What I checked already

Question

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. `verify_mode=auto`

2. `verify_mode=adaptive`

3. `verify_mode=adaptive` + non-quantized draft