Skip to content

Qwen3.5-122B-A10B-DFlash + SwiftLM SSD-stream on M1 Ultra: low acceptance + server crash #91

@ericjlake

Description

@ericjlake

First, thank you for the very fast turnaround on releasing z-lab/Qwen3.5-122B-A10B-DFlash (per #81 — released 2026-04-25). Sharing day-one feedback from M1 Ultra + SwiftLM SSD-streaming, since this is the most memory-constrained large-MoE setup the draft is likely to land on.

Setup

  • Hardware: Apple M1 Ultra, 64 GB unified memory, macOS 26.x, internal NVMe
  • Backend: SwiftLM b598-era + recent local fixes (cherry-picked off SharpAI/main), DFlash code path active. The target's Qwen3MoE+DFlash extension picks up Qwen3.5-122B-A10B-4bit correctly:
    [SwiftLM] DFlash: target model supports DFlashTargetModel
    [SwiftLM] DFlash draft model loaded (block_size=16, 6 target layers, mask_token=248077)
    [SwiftLM] Draft model loaded successfully (16 block size, DFlash mode)
    [SwiftLM] Using speculative decoding: …Qwen3.5-122B-A10B-DFlash → …Qwen3.5-122B-A10B-4bit (DFlash block-diffusion)
    
  • Target: mlx-community/Qwen3.5-122B-A10B-4bit (69.6 GB, 48 layers, A10B active per token)
  • SwiftLM flags: --stream-experts --dflash --draft-model …Qwen3.5-122B-A10B-DFlash, SWIFTLM_TOP_K=6 TEND_MOE_CACHE_SLOTS=16 (our standard SSD-stream config; used for the baseline below as well)

Results — generation throughput

Streaming bench via /v1/chat/completions, single request, temperature: 0.6. Same three prompts as the baseline measurement.

Configuration Short (~126 tok in) Medium (~400 tok in) Long (~800 tok in)
--stream-experts baseline (no DFlash) 6.30 tok/s · 153 tok generated · stop 6.11 tok/s · 246 tok · stop 6.22 tok/s · 800 tok · length
--stream-experts --dflash …-DFlash 6.30 tok/s · 200 tok · finish_reason=null 2.78 tok/s · 395 tok · finish_reason=null server crashed mid-run

DFlash is net-negative on this hardware: parity on short, −55% on medium, server crash on long.

Acceptance pattern (this is the interesting part)

The DFlash cycle log shows a clear pathological pattern across hundreds of cycles:

[DFlash] Cycle 180: blockLen=16, verifyLen=16, acceptanceLen=1, commitCount=2
[DFlash] Cycle 181: blockLen=16, verifyLen=16, acceptanceLen=1, commitCount=2
[DFlash] Cycle 182: blockLen=16, verifyLen=16, acceptanceLen=1, commitCount=2
[DFlash] Cycle 183: blockLen=16, verifyLen=16, acceptanceLen=0, commitCount=1
[DFlash] Cycle 184: blockLen=16, verifyLen=16, acceptanceLen=0, commitCount=1
[DFlash] Cycle 185: blockLen=16, verifyLen=16, acceptanceLen=1, commitCount=2
…repeated 200+ cycles, acceptanceLen always 0 or 1

acceptanceLen is consistently 0 or 1 out of block_size=16. The expected/healthy range for a well-aligned DFlash draft is much higher (your README's MLX section implies the draft is meant to commit several tokens per block on average).

So we're paying a 17-position verify pass — which on --stream-experts means SSD reads for 17 positions × 8 experts each = ~136 expert weight reads per layer per cycle, vs ~8 reads per layer per token in vanilla — for ~1 committed token. That fan-out is most of the regression.

Crash on long prompt

The server crashed silently somewhere in the long-prompt bench. The DFlash cycle log abruptly stops at Cycle ~202 followed by a single . and then the process is gone. No backtrace, no OOM marker visible in stdout/stderr. vm_stat immediately after showed plenty of free pages, so it doesn't look like classic system-wide memory pressure — could be MLX-internal (KV cache + draft block buffers compounding under sustained verify) or a DFlash-specific edge case. Happy to instrument and re-run if useful.

Possible directions

Two hypotheses I'd love your read on:

  1. Draft–target distribution mismatch on the MoE routing layer. If the 4-layer draft's hidden state at the routing boundaries differs slightly from the target's, the draft's predicted block routes through "wrong" experts, target rejects almost everything. Is the published draft a stable checkpoint, or is there a known iteration coming with better acceptance?

  2. --stream-experts interaction, similar in spirit to SwiftLM's #72 (SSD streaming + vanilla --draft-model causing fan-out, ultimately auto-capped to 1 draft token). DFlash bypasses that auto-cap because it's a different code path. Worth knowing whether you've validated the 122B draft on a swap-bound / out-of-core path, or only fully-resident-in-RAM setups.

If a fresh draft checkpoint or recommended config (smaller block_size, sliding-window cap, etc.) would change the picture, happy to re-bench and report back.

Repro

SWIFTLM_TOP_K=6 TEND_MOE_CACHE_SLOTS=16 \
  SwiftLM \
    --model <path>/Qwen3.5-122B-A10B-4bit \
    --port 8002 \
    --stream-experts \
    --dflash \
    --draft-model <path>/Qwen3.5-122B-A10B-DFlash

Bench script (single-request streaming, three prompts at 200/400/800 max-tokens) is the same one used for the M1 Ultra baseline numbers in SwiftLM #84. Happy to share JSON / logs if useful.

Cross-reference: also flagging this on the SwiftLM side since it intersects with their SSD-stream + DFlash code path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions