Skip to content

Add DFlash speculative decoding implementation#3

Open
0xClandestine wants to merge 3 commits into
TheTom:mainfrom
0xClandestine:dflash-upstream
Open

Add DFlash speculative decoding implementation#3
0xClandestine wants to merge 3 commits into
TheTom:mainfrom
0xClandestine:dflash-upstream

Conversation

@0xClandestine
Copy link
Copy Markdown

@0xClandestine 0xClandestine commented Apr 25, 2026

Summary

Implements DFlash (arXiv:2602.06036) - Block-Diffusion Speculative Decoding for lossless LLM acceleration on Apple Silicon.

What's Changed

New Module: swift/Sources/DFlash/ (1,800+ lines)

File Description
DFlashCore.swift Protocol definitions
DFlashDraftModel.swift Complete draft model
DFlashRuntime.swift Main speculative decoding runtime
DFlashEngines.swift Verify/rollback engines
DFlashDraftBackend.swift Draft generation helper
DFlash+MLXLLM.swift NEW Model registry with 50+ models

Supported Models (50+ MLXLLM models)

Pure Attention Models (dflashIsHybridGDN = false)

LlamaModel, Qwen3Model, Qwen2Model, GemmaModel, Gemma2Model, Gemma3TextModel, Gemma4Model, Gemma3nTextModel, PhiModel, Phi3Model, PhiMoEModel, MistralModel, CohereModel, Starcoder2Model, SmolLMModel, NanoChatModel, Internlm2Model, and more.

Hybrid Models (dflashIsHybridGDN = true)

Qwen35Model, Qwen3MoEModel, Qwen3NextModel, DeepseekV3Model, MiniMaxModel, MiniMaxM2Model, GraniteMoeHybridModel, LFM2Model, LFM2MoEModel, AfMoEModel, GLM4MoEModel, and more.

Adding DFlash to a New Model

Use generateDFlashConformance() to create extension code:

let code = generateDFlashConformance(modelName: "Qwen3Model", isHybrid: false)

Testing

swift build - OK
swift test --filter DFlashIntegrationTests - 2/2 passed

Based on DFlash (arXiv:2602.06036) - Block-Diffusion Speculative Decoding
for lossless acceleration on Apple Silicon.

## What's New

### Core Module (swift/Sources/DFlash/)

- DFlashCore.swift - Protocol definitions for DFlashTargetModel,
  DFlashDraftModelProtocol, DFlashDraftCacheProtocol,
  DFlashRollbackCacheProtocol, DFlashEngineProtocol, DFlashEvent,
  DFlashSummary, DFlashConfiguration

- DFlashDraftModel.swift - Complete draft model implementation with
  cross-attention, RoPE, sink-window cache

- DFlashEngines.swift - Verify/rollback engines (FullAttentionEngine,
  HybridGDNEngine stub)

- DFlashRuntime.swift - Main speculative decoding runtime with prefill,
  block-diffusion drafting, verify, accept/reject, rollback

- DFlashDraftBackend.swift - Draft generation helper

- DFlashTargetModelExtensions.swift - Model conformance examples

### Tests (swift/Tests/DFlashTests/)

- Unit tests for token utilities, cache management, config
- Integration tests for engine creation

### Package Updates (swift/Package.swift)

- Added DFlash static library target with MLXNN/MLX dependencies
- Added test target for DFlashTests

## Architecture Highlights

1. Protocol-based for easy extension to new model types
2. Abstraction layers for engines, caches, draft backend
3. Extensible for hybrid GDN models with tape-based rollback

## Next Steps to Complete Integration

1. Add DFlashTargetModel conformance to actual models
2. Implement callCapturing() on model containers
3. Add vsm_engine_dflash_* C API functions to Bridge.swift
4. Train/convert DFlash draft models for target architectures

Builds and tests pass.
This commit adds:
- DFlashForwardWithCapture protocol for models that support hidden state capture
- DFlashModelConformanceTemplate with lists of supported pure-attention and hybrid models
- Embedding.asLinear helper for tied weights
- Complete template documentation for adding conformance to any model
This commit adds:
- DFlashForwardWithCapture protocol for hidden state capture
- DFlashSupportedModels listing all ~50 MLXLLM models organized by type:
  - Pure attention models (Llama, Qwen3, Gemma, Phi, etc.)
  - Hybrid GDN models (Qwen3.5, Qwen3Next, DeepSeekV3, MiniMax, etc.)
- DFlashModelRegistry with model lists
- DFlashConformanceStatus for tracking conformance state
- generateDFlashConformance() template generator for easy extension creation
- Embedding.asLinear helper for tied weight models
TheTom added a commit that referenced this pull request May 7, 2026
Field reports from the v0.5.1 alpha (Tom's buddy) surfaced 5 obvious
bugs and 2 non-obvious ones (Metal-side; tracked separately). This
release fixes the obvious and locks them down with regression tests.

Bugs fixed:
- #1 vllm not declared as runtime dep. `pip install vllm-swift==0.5.1`
  left users at ModuleNotFoundError on first `vllm-swift serve`.
  pyproject now declares vllm>=0.10. Side benefit: narrows pip's
  resolver window, stops --pre pulling rc/dev safetensors / tokenizers
  / transformers.
- #3 reasoning-budget bump clobbered explicit small max_tokens. Client
  sent max_tokens=64, got completion_tokens=20480 because the bump
  fired unconditionally. Now respected when client sets <1024 (curl
  smokes, "say hello", token-count probes). The OpenCode/Hermes 4K-8K
  starvation case still bumps as before.
- #7 message.reasoning not normalized to message.reasoning_content.
  Some vLLM versions emit `reasoning` (their newer naming). Normalize
  to the OpenAI-standard `reasoning_content` so OpenAI clients (Hermes,
  openai-python) see the field they expect. Original `reasoning`
  preserved for back-compat.
- #6 longctx splice spammed 8 chunks regardless of relevance. Trivial
  "say hello" produced prompt_tokens=5423. Added cosine-score >= 0.20
  floor (env-tunable via LONGCTX_RELEVANCE_FLOOR) that drops noise
  chunks before splicing.
- #2 --max-model-len exceeding model's max_position_embeddings.
  Pre-flight reads model's config.json and warns with actual numbers
  ("65536 exceeds 40960; recommend --max-model-len 40960") instead of
  letting vLLM reject prompts later with a less specific error.

Plus a CI-fixing pass: tests/test_longctx_endpoint.py had stale
imports flagged by ruff F811/F401 + I001 (the v0.5.1 commit's CI
failed on this). All ruff lint clean now.

8 new regression tests in tests/test_longctx_endpoint.py pin all five
behaviors. 505/505 tests pass total.

NOT fixed in this release (separate Metal-kernel investigation):
- #4 KV-cache corruption signature under turbo4v2 4-bit + sustained
  decode. Workaround: drop --additional-config or use kv_bits: 8
  (asymmetric K8/V4) for the same scheme.
- #5 4× decode throughput decay (128 → 30 tok/s monotonic) — likely
  same root cause as #4. Same workaround.

Versions caught up:
  pyproject.toml      0.5.1 → 0.5.2
  __init__.py         0.5.1 → 0.5.2
  homebrew formula    0.5.1 → 0.5.2; bottle SHAs cleared
  scripts/build_bottle.sh  0.5.1 → 0.5.2

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant