DFlash vLLM for DGX Spark
Plug & Play Block-Diffusion Speculative Decoding, optionally stacked with TurboQuant KV compression
docker pull ghcr.io/aeon-7/vllm-dflash:latest
Read First • Quick Start • Performance • TurboQuant • Which config? • Config • Troubleshooting
| GPU | NVIDIA Blackwell with compute capability SM120 / SM121 — DGX Spark (GB10), B200/GB200, RTX 5090 |
| Architecture | aarch64 (ARM64) for DGX Spark; x86_64 should work for B200 but is untested |
| Unified / GPU memory | ≥ 64 GB free (DGX Spark has 128 GB unified) |
| Disk | ≥ 50 GB free: 18 GB image + 20 GB model + 4 GB drafter + cache |
| NVIDIA driver | ≥ 580.x (GB10 support) |
| Docker | ≥ 25.x with nvidia-container-toolkit installed |
Will NOT work on: H100/H200 (Hopper), A100 (Ampere), L40/L4, RTX 40-series, or any pre-Blackwell hardware. NVFP4 tensor cores are Blackwell-exclusive.
- First boot takes 5–10 minutes after the model is on disk (weight load + CUDA graph capture + FlashInfer NVFP4 autotune). Subsequent boots with cached JIT artifacts are ~2 min.
- First inference request takes ~30 seconds even after the server reports healthy — vLLM does one-time CUDA graph specialization on the first real input. Always warm up 1–3 requests before measuring latency.
- Model download is 20 GB; DFlash drafter auto-downloads another ~4 GB on first container start.
- DFlash acceptance is content-dependent — expect 60 tok/s on code/reasoning, 30 tok/s on free prose. Random-token adversarial inputs will show no speedup.
- Set
VLLM_API_KEYas a persistent env var, not a one-shot$(openssl rand ...)expansion insidedocker run— you need the same value to call the API later. - TurboQuant is optional. The default image does not include it. See TurboQuant (optional) for when to enable it.
- Config sensitivity: the default tuned settings (
MAX_NUM_SEQS=16,MAX_NUM_BATCHED_TOKENS=32768) are sized for DGX Spark's 128 GB. On smaller Blackwell cards (e.g. RTX 5090 32 GB) dropMAX_MODEL_LENandGPU_MEMORY_UTILIZATIONproportionally.
Check the Troubleshooting section — most problems are one of: queue saturation at high concurrency, CUDA OOM from oversized MAX_MODEL_LEN, or model-name mismatch between the mount and the API call.
A pre-built vLLM container tuned for NVIDIA DGX Spark (GB10 Blackwell), serving
AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4
— a 27B hybrid linear-attention + full-attention model, NVFP4-quantized, vision-capable:
- DFlash block-diffusion speculative decoding (k=15) — 2–5× faster decode than vanilla vLLM depending on prompt class
- NVFP4 quantization with AWQ calibration — native Blackwell FP4 tensor cores
- OpenAI-compatible
/v1/chat/completions,/v1/completions,/v1/models,/health - Optional TurboQuant KV-cache compression for long-context / high-concurrency — see TurboQuant
All benchmarks below are on DGX Spark GB10 (128 GB unified memory, 273 GB/s LPDDR5X).
Three copy-paste steps. Plan for ~40 minutes end-to-end the first time — 20 min for model download (depends on your connection), 5–7 min for cold boot, a few seconds per warmup request.
pip install "huggingface_hub[cli]"
huggingface-cli download AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 \
--local-dir /models/DFlash-Qwen3.5-27B-Uncensored-NVFP4# Generate + remember an API key (save it — you'll need it for every request)
export VLLM_API_KEY=$(openssl rand -hex 32)
echo "VLLM_API_KEY=$VLLM_API_KEY" >> ~/.bashrc # optional: persist across shells
docker run -d --name vllm-dflash \
--gpus all --network host --ipc host --ulimit memlock=-1:-1 \
-v /models/DFlash-Qwen3.5-27B-Uncensored-NVFP4:/models/target:ro \
-e MODEL_PATH=/models/target \
-e SERVED_MODEL_NAME=qwen35-dflash \
-e DFLASH_DRAFTER=z-lab/Qwen3.5-27B-DFlash \
-e DFLASH_NUM_SPEC_TOKENS=15 \
-e MAX_MODEL_LEN=65536 \
-e MAX_NUM_SEQS=16 \
-e MAX_NUM_BATCHED_TOKENS=32768 \
-e GPU_MEMORY_UTILIZATION=0.85 \
-e ATTENTION_BACKEND=flash_attn \
-e VLLM_API_KEY=$VLLM_API_KEY \
ghcr.io/aeon-7/vllm-dflash:latestThe drafter (z-lab/Qwen3.5-27B-DFlash) auto-downloads on first run (~4 GB). Output tokens are addressed by the SERVED_MODEL_NAME you set above (qwen35-dflash here).
# Wait for healthy (cold start ~5–7 min — watch docker logs -f vllm-dflash if curious)
until curl -sf http://localhost:8000/health; do sleep 10; done
echo "server up"
# Warm up (first request does CUDA graph specialization — expect ~30 s)
for i in 1 2 3; do
curl -sf -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d '{"model":"qwen35-dflash","messages":[{"role":"user","content":"hi"}],"max_tokens":16,"temperature":0}' \
> /dev/null && echo "warmup $i ok"
done
# Real test
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d '{
"model": "qwen35-dflash",
"messages": [{"role": "user", "content": "Write a binary search in Python with type hints."}],
"max_tokens": 256,
"temperature": 0
}'You're running a 27B multimodal model with 2–5× speculative-decoding speedup on a 128 GB Spark.
All numbers below are on unmodified ghcr.io/aeon-7/vllm-dflash:latest running
AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4, DFlash k=15, 65K context, with the
recommended configuration above. Measurements use natural-language prompts with
temperature=0 for full determinism. See BENCHMARKS.md for the
reproducible script.
Post-warmup, 3 runs, variance <0.3%.
| Prompt style | Tok/s | TPOT p50 | Notes |
|---|---|---|---|
| Code (algorithm + docstrings) | 64.0 | 15.2 ms | Highly patterned — DFlash excels |
| Reasoning (math step-by-step) | 54.0 | 18.4 ms | Structured, predictable |
| Dialogue (chat continuation) | 38.4 | 26.0 ms | Natural conversational |
| Prose (free-form essay) | 29.5 | 33.6 ms | Creative text — DFlash hardest to apply |
DFlash acceptance length (tokens accepted per 15-token draft) ranges from ~2.0 on prose to ~5.5 on code. Per-position acceptance decays from ~78% at position 0 to <3% by position 8.
Code (best case for DFlash):
| Concurrency | Aggregate tok/s | Median per-req | TTFT p50 | TPOT p50 |
|---|---|---|---|---|
| c=1 | 64.0 | 64.0 tok/s | 239 ms | 15.2 ms |
| c=4 | 181.5 | 45.4 tok/s | 408 ms | 21.2 ms |
| c=8 | 262.8 | 32.9 tok/s | 564 ms | 29.4 ms |
| c=16 | 327.9 | 20.5 tok/s | 884 ms | 47.1 ms |
Prose (worst case):
| Concurrency | Aggregate tok/s | Median per-req | TTFT p50 | TPOT p50 |
|---|---|---|---|---|
| c=1 | 29.5 | 29.5 tok/s | 225 ms | 33.6 ms |
| c=4 | 83.5 | 21.1 tok/s | 432 ms | 46.8 ms |
| c=8 | 122.4 | 15.3 tok/s | 557 ms | 64.4 ms |
| c=16 | 151.8 | 9.5 tok/s | 860 ms | 104 ms |
At c=16 the container serves 328 tok/s on coding / 152 tok/s on prose, with TTFT below 900 ms.
| Metric | Value |
|---|---|
| Peak single-stream | 64.0 tok/s (code) |
| Peak aggregate (c=16) | 327.9 tok/s (code), 151.8 tok/s (prose) |
| TPOT p50 range | 15 ms (code, c=1) → 104 ms (prose, c=16) |
| TTFT p50 range | 225 ms (c=1) → 884 ms (c=16) |
| Model size | ~20 GB (NVFP4) |
| KV headroom | 70 GiB free after weights + graphs |
| Max context | 65K default (model supports up to 262K) |
| Your workload looks like… | Recommended | Why |
|---|---|---|
| Single interactive chat user, short-to-medium context | Baseline (this image, defaults above) | 3% overhead isn't worth the extra moving part |
| 4–16 concurrent users, typical chat <16K | Baseline | Tuned config already handles this well |
| Agent fleet: 32–128 concurrent, tool-calling under 16K | TurboQuant hybrid | KV capacity is the bottleneck, not decode throughput |
| Long-context work >32K, single-session | Depends — measure first | TurboQuant saves memory but -11% decode at 32K |
| Getting OOM at current concurrency / context | TurboQuant hybrid | 3.76× KV compression recovers the headroom |
| Latency-critical single-prompt serving | Baseline | Every 3% decode matters when p50 is 15 ms |
Short version: TurboQuant is a capacity unlock, not a throughput unlock. Enable it when you're memory-bound (many concurrent sessions or long contexts); skip it when you're compute-bound (few sessions, short prompts).
For long-context or high-concurrency workloads, the container can be extended with 0xSero/turboquant KV-cache compression (4-bit keys, 3-bit values, Hadamard-rotation + Lloyd-Max codebooks — paper: arXiv:2504.19874).
TurboQuant is not enabled in the default image. To use it, build the extension
Dockerfile in turboquant/ which pip-installs the plugin and wires
it in via a Python .pth bootstrap.
Measured on the same model + tuned config. TurboQuant overhead is ~3% across all modes, concurrencies, and prompt styles — essentially free on short-to-medium outputs.
| Concurrency | TQ off | TQ capture_only | TQ hybrid | Δ hybrid vs off |
|---|---|---|---|---|
| c=1 | 64.02 | 61.50 | 61.71 | -3.61% |
| c=4 | 181.47 | 175.71 | 175.79 | -3.13% |
| c=8 | 262.77 | 255.19 | 252.78 | -3.80% |
| c=16 | 327.89 | 314.93 | 318.36 | -2.91% |
| Concurrency | TQ off | TQ capture_only | TQ hybrid | Δ hybrid vs off |
|---|---|---|---|---|
| c=1 | 29.46 | 28.14 | 28.49 | -3.29% |
| c=4 | 83.53 | 80.28 | 80.72 | -3.36% |
| c=8 | 122.41 | 117.67 | 119.17 | -2.65% |
| c=16 | 151.81 | 147.43 | 148.80 | -1.98% |
TurboQuant's hybrid-mode decode cost is flat until the 128-token ring buffer overflows, then grows with context length because each decode step has to dequantize more compressed history. Short-to-medium contexts see no penalty; decode slows measurably at 32K+.
| Context tokens | TQ off decode | TQ hybrid decode | Δ |
|---|---|---|---|
| 4,000 | 31.81 tok/s | 33.35 tok/s | +4.85% |
| 16,000 | 23.92 tok/s | 24.20 tok/s | +1.18% |
| 32,000 | 19.43 tok/s | 17.22 tok/s | -11.38% |
- Multi-session long-context serving — the real win is KV capacity, letting you hold more simultaneous sessions at full context (not visible in c=1 microbenchmarks)
- Agentic workloads with long rolling context where freeing compressed history recovers VRAM for the next request
- Any use case hitting OOM on long contexts under default KV
- Short-context single-user chat (<16K) — the decode overhead isn't worth the complexity when there's no capacity pressure
- Pure latency-critical 32K+ single-request paths — you'll eat the ~11% decode cost without the capacity payoff
TQ_MODE |
What it does |
|---|---|
off |
Plugin installed but dormant — zero overhead |
capture_only |
Captures K/V into compressed store; attention still uses paged cache |
hybrid |
Attention reads from compressed history beyond a 128-token ring buffer |
full_tq |
(experimental) TQ handles prefill too |
Build and run the TurboQuant variant:
cd turboquant
docker build -t vllm-dflash-tq:latest .
docker run -d --name vllm-dflash-tq \
--gpus all --network host --ipc host --ulimit memlock=-1:-1 \
-v /models/DFlash-Qwen3.5-27B-Uncensored-NVFP4:/models/target:ro \
-e MODEL_PATH=/models/target \
-e DFLASH_DRAFTER=z-lab/Qwen3.5-27B-DFlash \
-e DFLASH_NUM_SPEC_TOKENS=15 \
-e MAX_MODEL_LEN=65536 \
-e MAX_NUM_SEQS=16 \
-e MAX_NUM_BATCHED_TOKENS=32768 \
-e GPU_MEMORY_UTILIZATION=0.85 \
-e ATTENTION_BACKEND=flash_attn \
-e ENABLE_TURBOQUANT=1 \
-e TQ_MODE=hybrid \
-e TQ_KEY_BITS=4 \
-e TQ_VALUE_BITS=3 \
vllm-dflash-tq:latest0xSero/turboquant currently requires a small patch to be CUDA-graph-safe
(PR #12). The Dockerfile in
turboquant/ applies that patch automatically. Once the PR is merged upstream,
the patch step will be removed.
| Variable | Default | Description |
|---|---|---|
MODEL_PATH |
required | Local path to target model |
DFLASH_DRAFTER |
required | HF repo or path of the DFlash drafter |
DFLASH_NUM_SPEC_TOKENS |
15 |
Speculative token count per draft |
MAX_MODEL_LEN |
65536 |
Maximum sequence length (model supports up to 262144) |
MAX_NUM_SEQS |
16 |
Concurrent sequences (default was 4; 16 is the Spark sweet spot) |
MAX_NUM_BATCHED_TOKENS |
32768 |
Scheduler token budget (default was 8192; 32768 unblocks c=8+) |
GPU_MEMORY_UTILIZATION |
0.85 |
VRAM fraction; keep at 0.85 on Spark to avoid swap |
ATTENTION_BACKEND |
flash_attn |
Try TRITON_ATTN if you hit FA kernel bugs |
VLLM_API_KEY |
unset | Bearer token required for all endpoints when set. Generate + save this value — you need the same string for every client call |
SERVED_MODEL_NAME |
basename($MODEL_PATH) |
Name clients pass as "model": in requests. Set this explicitly to avoid confusion |
EXTRA_ARGS |
unset | Passed verbatim to vllm serve |
| Variable | Default | Description |
|---|---|---|
TQ_MODE |
hybrid |
off / capture_only / hybrid / full_tq |
TQ_KEY_BITS |
4 |
Key quantization bits (3–4 typical) |
TQ_VALUE_BITS |
3 |
Value quantization bits (2–4; 2 loses quality) |
TQ_RING_CAPACITY |
128 |
Exact-precision tokens at tail of context |
TQ_INITIAL_LAYERS |
4 |
First N layers get key_bits+1 for quality |
The three env vars that matter most on GB10:
MAX_NUM_SEQS=16 # was 4 — unlocks c=8+ without queue saturation
MAX_NUM_BATCHED_TOKENS=32768 # was 8192 — matches scheduler's spec-decode headroom
GPU_MEMORY_UTILIZATION=0.85 # safe headroom; don't push higher on 128 GB unified
At defaults, c=8 hit TTFT p50 of 14.7 seconds due to queue saturation. With the tuned config, c=8 drops to 817 ms and c=16 becomes usable.
Container restarts or hangs on startup
First boot takes 5–7 minutes on DGX Spark:
- ~2 min: weight load
- ~1 min: DFlash drafter download (first run only; cached to volume after)
- ~2 min: CUDA graph capture + FlashInfer NVFP4 GEMM autotune
docker logs -f vllm-dflashLook for Application startup complete. If you see a Traceback, grab the full error text and file an issue.
API returns 404 / "model not found"
The model name in your request must match SERVED_MODEL_NAME (default: basename of MODEL_PATH). If you mounted at /models/target, the served name is target unless you set SERVED_MODEL_NAME explicitly. The Quick Start sets SERVED_MODEL_NAME=qwen35-dflash — use that exact string in "model": fields.
Check what's actually served:
curl -sf http://localhost:8000/v1/models -H "Authorization: Bearer $VLLM_API_KEY" | jqFirst request takes 30+ seconds even though /health is OK
Expected. vLLM does one-time CUDA graph specialization on the first real input. Always fire 1–3 warmup requests before benchmarking. See the Quick Start step 3 loop.
TTFT blows up to 10+ seconds at concurrency
You're queue-bound. The legacy default was MAX_NUM_SEQS=4 / MAX_NUM_BATCHED_TOKENS=8192, which saturates at c=8 with DFlash spec-decode. The Quick Start uses the tuned values (MAX_NUM_SEQS=16, MAX_NUM_BATCHED_TOKENS=32768); make sure your docker run has them.
CUDA out of memory
Lower GPU_MEMORY_UTILIZATION to 0.80 or drop MAX_MODEL_LEN (e.g., 65536 → 32768 → 16384). Spark's 128 GB is unified — the GPU shares it with the host, so leave 15–20 GB headroom. If you need more concurrent sessions at long context, enable TurboQuant for ~3.76× KV compression.
"Cannot copy between CPU and CUDA tensors" when enabling TurboQuant
You're running an unpatched 0xSero/turboquant. Use the extension Dockerfile in turboquant/ — it pins to the fix-branch hosting PR #12 which makes the QJL quantizer CUDA-graph-safe.
DFlash acceptance rate looks low
DFlash is content-sensitive. Expected accepted-tokens-per-draft (k=15):
| Prompt type | Accepted/draft | Effective speedup |
|---|---|---|
| Code / reasoning | ~5+ | ~2.1× |
| Dialogue | ~3 | ~1.5× |
| Free prose | ~2 | ~1.3× |
| Random / adversarial | ~1.5 | ~1.0× |
See the BENCHMARKS.md acceptance-profile table for exact per-position numbers.
Image won't pull or fails on my GPU
The image requires a Blackwell GPU with SM120 or SM121 (see Hardware requirements at the top). If nvidia-smi shows compute capability 8.x (Ampere) or 9.0 (Hopper), this image will not run. NVFP4 tensor cores are Blackwell-exclusive.
DFlash speeds up generation by speculating multiple tokens per step using a small draft model, then verifying them against the target model in a single forward pass. The drafter here is a 5-layer Qwen3 variant fine-tuned to predict the next 15 tokens from the target's intermediate hidden states at layers (1, 16, 31, 46, 61).
Key properties:
- Lossless — every accepted token matches what greedy decoding would produce
- Memory-bandwidth-bound-friendly — a single target-model pass verifies many candidate tokens
- Content-adaptive — structured text (code, math) wins more than free prose
See paper: arXiv:2602.06036.
NVIDIA's FP4 format (E2M1) is a native tensor-core datatype on Blackwell (B200, GB10, RTX 50×0). Unlike older INT4/GPTQ which introduce visible degradation, NVFP4 with AWQ_FULL calibration is effectively lossless. Our image autodetects NVFP4 checkpoints and routes through FlashInfer CUTLASS kernels.
Weights + activations are quantized; KV cache stays in BF16 by default (use TurboQuant to compress KV as well).
The model has 64 transformer layers arranged in a hybrid pattern:
- 48 linear-attention layers (Gated DeltaNet / Mamba-style recurrent state)
- 16 full-attention layers (classical attention with KV cache)
- 1 MTP head (used as the DFlash drafter anchor)
DFlash's target_layer_ids=[1,16,31,46,61] are the hidden-state checkpoints the
drafter consumes. TurboQuant compresses only the 16 full-attention layers' KV
cache; linear-attention layers have no K/V to compress (their recurrent state
is already compact).
DGX Spark is memory-bandwidth-bound (273 GB/s LPDDR5X unified). MoE experts require scatter/gather across the unified memory, which defeats the bandwidth budget. A dense 27B moves a predictable 20 GB of weights per token — ideal for the Spark's memory architecture. On coding/reasoning benchmarks it rivals or beats larger MoE variants that would OOM or thrash on this hardware.
- DFlash: Zheng et al., ICLR 2026 (arXiv:2602.06036)
- TurboQuant: Zandieh et al., ICLR 2026 (arXiv:2504.19874); this container uses 0xSero/turboquant as the plugin
- Model: AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4
- Drafter: z-lab/Qwen3.5-27B-DFlash
- vLLM: vllm-project/vllm 0.19.1
GPL-3.0 (inherited from 0xSero/turboquant when the TurboQuant extension is enabled;
the base DFlash container is MIT). See LICENSE.
If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.
Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.



