First, thank you for the very fast turnaround on releasing z-lab/Qwen3.5-122B-A10B-DFlash (per #81 — released 2026-04-25). Sharing day-one feedback from M1 Ultra + SwiftLM SSD-streaming, since this is the most memory-constrained large-MoE setup the draft is likely to land on.
Setup
- Hardware: Apple M1 Ultra, 64 GB unified memory, macOS 26.x, internal NVMe
- Backend: SwiftLM
b598-era + recent local fixes (cherry-picked off SharpAI/main), DFlash code path active. The target's Qwen3MoE+DFlash extension picks up Qwen3.5-122B-A10B-4bit correctly:
[SwiftLM] DFlash: target model supports DFlashTargetModel
[SwiftLM] DFlash draft model loaded (block_size=16, 6 target layers, mask_token=248077)
[SwiftLM] Draft model loaded successfully (16 block size, DFlash mode)
[SwiftLM] Using speculative decoding: …Qwen3.5-122B-A10B-DFlash → …Qwen3.5-122B-A10B-4bit (DFlash block-diffusion)
- Target:
mlx-community/Qwen3.5-122B-A10B-4bit (69.6 GB, 48 layers, A10B active per token)
- SwiftLM flags:
--stream-experts --dflash --draft-model …Qwen3.5-122B-A10B-DFlash, SWIFTLM_TOP_K=6 TEND_MOE_CACHE_SLOTS=16 (our standard SSD-stream config; used for the baseline below as well)
Results — generation throughput
Streaming bench via /v1/chat/completions, single request, temperature: 0.6. Same three prompts as the baseline measurement.
| Configuration |
Short (~126 tok in) |
Medium (~400 tok in) |
Long (~800 tok in) |
--stream-experts baseline (no DFlash) |
6.30 tok/s · 153 tok generated · stop |
6.11 tok/s · 246 tok · stop |
6.22 tok/s · 800 tok · length |
--stream-experts --dflash …-DFlash |
6.30 tok/s · 200 tok · finish_reason=null |
2.78 tok/s · 395 tok · finish_reason=null |
server crashed mid-run |
DFlash is net-negative on this hardware: parity on short, −55% on medium, server crash on long.
Acceptance pattern (this is the interesting part)
The DFlash cycle log shows a clear pathological pattern across hundreds of cycles:
[DFlash] Cycle 180: blockLen=16, verifyLen=16, acceptanceLen=1, commitCount=2
[DFlash] Cycle 181: blockLen=16, verifyLen=16, acceptanceLen=1, commitCount=2
[DFlash] Cycle 182: blockLen=16, verifyLen=16, acceptanceLen=1, commitCount=2
[DFlash] Cycle 183: blockLen=16, verifyLen=16, acceptanceLen=0, commitCount=1
[DFlash] Cycle 184: blockLen=16, verifyLen=16, acceptanceLen=0, commitCount=1
[DFlash] Cycle 185: blockLen=16, verifyLen=16, acceptanceLen=1, commitCount=2
…repeated 200+ cycles, acceptanceLen always 0 or 1
acceptanceLen is consistently 0 or 1 out of block_size=16. The expected/healthy range for a well-aligned DFlash draft is much higher (your README's MLX section implies the draft is meant to commit several tokens per block on average).
So we're paying a 17-position verify pass — which on --stream-experts means SSD reads for 17 positions × 8 experts each = ~136 expert weight reads per layer per cycle, vs ~8 reads per layer per token in vanilla — for ~1 committed token. That fan-out is most of the regression.
Crash on long prompt
The server crashed silently somewhere in the long-prompt bench. The DFlash cycle log abruptly stops at Cycle ~202 followed by a single . and then the process is gone. No backtrace, no OOM marker visible in stdout/stderr. vm_stat immediately after showed plenty of free pages, so it doesn't look like classic system-wide memory pressure — could be MLX-internal (KV cache + draft block buffers compounding under sustained verify) or a DFlash-specific edge case. Happy to instrument and re-run if useful.
Possible directions
Two hypotheses I'd love your read on:
-
Draft–target distribution mismatch on the MoE routing layer. If the 4-layer draft's hidden state at the routing boundaries differs slightly from the target's, the draft's predicted block routes through "wrong" experts, target rejects almost everything. Is the published draft a stable checkpoint, or is there a known iteration coming with better acceptance?
-
--stream-experts interaction, similar in spirit to SwiftLM's #72 (SSD streaming + vanilla --draft-model causing fan-out, ultimately auto-capped to 1 draft token). DFlash bypasses that auto-cap because it's a different code path. Worth knowing whether you've validated the 122B draft on a swap-bound / out-of-core path, or only fully-resident-in-RAM setups.
If a fresh draft checkpoint or recommended config (smaller block_size, sliding-window cap, etc.) would change the picture, happy to re-bench and report back.
Repro
SWIFTLM_TOP_K=6 TEND_MOE_CACHE_SLOTS=16 \
SwiftLM \
--model <path>/Qwen3.5-122B-A10B-4bit \
--port 8002 \
--stream-experts \
--dflash \
--draft-model <path>/Qwen3.5-122B-A10B-DFlash
Bench script (single-request streaming, three prompts at 200/400/800 max-tokens) is the same one used for the M1 Ultra baseline numbers in SwiftLM #84. Happy to share JSON / logs if useful.
Cross-reference: also flagging this on the SwiftLM side since it intersects with their SSD-stream + DFlash code path.
First, thank you for the very fast turnaround on releasing
z-lab/Qwen3.5-122B-A10B-DFlash(per #81 — released 2026-04-25). Sharing day-one feedback from M1 Ultra + SwiftLM SSD-streaming, since this is the most memory-constrained large-MoE setup the draft is likely to land on.Setup
b598-era + recent local fixes (cherry-picked offSharpAI/main), DFlash code path active. The target'sQwen3MoE+DFlashextension picks upQwen3.5-122B-A10B-4bitcorrectly:mlx-community/Qwen3.5-122B-A10B-4bit(69.6 GB, 48 layers, A10B active per token)--stream-experts --dflash --draft-model …Qwen3.5-122B-A10B-DFlash,SWIFTLM_TOP_K=6 TEND_MOE_CACHE_SLOTS=16(our standard SSD-stream config; used for the baseline below as well)Results — generation throughput
Streaming bench via
/v1/chat/completions, single request,temperature: 0.6. Same three prompts as the baseline measurement.--stream-expertsbaseline (no DFlash)--stream-experts --dflash …-DFlashfinish_reason=nullfinish_reason=nullDFlash is net-negative on this hardware: parity on short, −55% on medium, server crash on long.
Acceptance pattern (this is the interesting part)
The DFlash cycle log shows a clear pathological pattern across hundreds of cycles:
acceptanceLenis consistently 0 or 1 out ofblock_size=16. The expected/healthy range for a well-aligned DFlash draft is much higher (your README's MLX section implies the draft is meant to commit several tokens per block on average).So we're paying a 17-position verify pass — which on
--stream-expertsmeans SSD reads for 17 positions × 8 experts each = ~136 expert weight reads per layer per cycle, vs ~8 reads per layer per token in vanilla — for ~1 committed token. That fan-out is most of the regression.Crash on long prompt
The server crashed silently somewhere in the long-prompt bench. The DFlash cycle log abruptly stops at
Cycle ~202followed by a single.and then the process is gone. No backtrace, no OOM marker visible in stdout/stderr.vm_statimmediately after showed plenty of free pages, so it doesn't look like classic system-wide memory pressure — could be MLX-internal (KV cache + draft block buffers compounding under sustained verify) or a DFlash-specific edge case. Happy to instrument and re-run if useful.Possible directions
Two hypotheses I'd love your read on:
Draft–target distribution mismatch on the MoE routing layer. If the 4-layer draft's hidden state at the routing boundaries differs slightly from the target's, the draft's predicted block routes through "wrong" experts, target rejects almost everything. Is the published draft a stable checkpoint, or is there a known iteration coming with better acceptance?
--stream-expertsinteraction, similar in spirit to SwiftLM's #72 (SSD streaming + vanilla--draft-modelcausing fan-out, ultimately auto-capped to 1 draft token). DFlash bypasses that auto-cap because it's a different code path. Worth knowing whether you've validated the 122B draft on a swap-bound / out-of-core path, or only fully-resident-in-RAM setups.If a fresh draft checkpoint or recommended config (smaller
block_size, sliding-window cap, etc.) would change the picture, happy to re-bench and report back.Repro
SWIFTLM_TOP_K=6 TEND_MOE_CACHE_SLOTS=16 \ SwiftLM \ --model <path>/Qwen3.5-122B-A10B-4bit \ --port 8002 \ --stream-experts \ --dflash \ --draft-model <path>/Qwen3.5-122B-A10B-DFlashBench script (single-request streaming, three prompts at 200/400/800 max-tokens) is the same one used for the M1 Ultra baseline numbers in SwiftLM #84. Happy to share JSON / logs if useful.
Cross-reference: also flagging this on the SwiftLM side since it intersects with their SSD-stream + DFlash code path.