Add DFlash speculative decoding implementation by 0xClandestine · Pull Request #3 · TheTom/vllm-swift

0xClandestine · 2026-04-25T05:44:45Z

Summary

Implements DFlash (arXiv:2602.06036) - Block-Diffusion Speculative Decoding for lossless LLM acceleration on Apple Silicon.

What's Changed

New Module: swift/Sources/DFlash/ (1,800+ lines)

File	Description
DFlashCore.swift	Protocol definitions
DFlashDraftModel.swift	Complete draft model
DFlashRuntime.swift	Main speculative decoding runtime
DFlashEngines.swift	Verify/rollback engines
DFlashDraftBackend.swift	Draft generation helper
DFlash+MLXLLM.swift	NEW Model registry with 50+ models

Supported Models (50+ MLXLLM models)

Pure Attention Models (dflashIsHybridGDN = false)

LlamaModel, Qwen3Model, Qwen2Model, GemmaModel, Gemma2Model, Gemma3TextModel, Gemma4Model, Gemma3nTextModel, PhiModel, Phi3Model, PhiMoEModel, MistralModel, CohereModel, Starcoder2Model, SmolLMModel, NanoChatModel, Internlm2Model, and more.

Hybrid Models (dflashIsHybridGDN = true)

Qwen35Model, Qwen3MoEModel, Qwen3NextModel, DeepseekV3Model, MiniMaxModel, MiniMaxM2Model, GraniteMoeHybridModel, LFM2Model, LFM2MoEModel, AfMoEModel, GLM4MoEModel, and more.

Adding DFlash to a New Model

Use generateDFlashConformance() to create extension code:

let code = generateDFlashConformance(modelName: "Qwen3Model", isHybrid: false)

Testing

swift build - OK
swift test --filter DFlashIntegrationTests - 2/2 passed

Based on DFlash (arXiv:2602.06036) - Block-Diffusion Speculative Decoding for lossless acceleration on Apple Silicon. ## What's New ### Core Module (swift/Sources/DFlash/) - DFlashCore.swift - Protocol definitions for DFlashTargetModel, DFlashDraftModelProtocol, DFlashDraftCacheProtocol, DFlashRollbackCacheProtocol, DFlashEngineProtocol, DFlashEvent, DFlashSummary, DFlashConfiguration - DFlashDraftModel.swift - Complete draft model implementation with cross-attention, RoPE, sink-window cache - DFlashEngines.swift - Verify/rollback engines (FullAttentionEngine, HybridGDNEngine stub) - DFlashRuntime.swift - Main speculative decoding runtime with prefill, block-diffusion drafting, verify, accept/reject, rollback - DFlashDraftBackend.swift - Draft generation helper - DFlashTargetModelExtensions.swift - Model conformance examples ### Tests (swift/Tests/DFlashTests/) - Unit tests for token utilities, cache management, config - Integration tests for engine creation ### Package Updates (swift/Package.swift) - Added DFlash static library target with MLXNN/MLX dependencies - Added test target for DFlashTests ## Architecture Highlights 1. Protocol-based for easy extension to new model types 2. Abstraction layers for engines, caches, draft backend 3. Extensible for hybrid GDN models with tape-based rollback ## Next Steps to Complete Integration 1. Add DFlashTargetModel conformance to actual models 2. Implement callCapturing() on model containers 3. Add vsm_engine_dflash_* C API functions to Bridge.swift 4. Train/convert DFlash draft models for target architectures Builds and tests pass.

This commit adds: - DFlashForwardWithCapture protocol for models that support hidden state capture - DFlashModelConformanceTemplate with lists of supported pure-attention and hybrid models - Embedding.asLinear helper for tied weights - Complete template documentation for adding conformance to any model

This commit adds: - DFlashForwardWithCapture protocol for hidden state capture - DFlashSupportedModels listing all ~50 MLXLLM models organized by type: - Pure attention models (Llama, Qwen3, Gemma, Phi, etc.) - Hybrid GDN models (Qwen3.5, Qwen3Next, DeepSeekV3, MiniMax, etc.) - DFlashModelRegistry with model lists - DFlashConformanceStatus for tracking conformance state - generateDFlashConformance() template generator for easy extension creation - Embedding.asLinear helper for tied weight models

Field reports from the v0.5.1 alpha (Tom's buddy) surfaced 5 obvious bugs and 2 non-obvious ones (Metal-side; tracked separately). This release fixes the obvious and locks them down with regression tests. Bugs fixed: - #1 vllm not declared as runtime dep. `pip install vllm-swift==0.5.1` left users at ModuleNotFoundError on first `vllm-swift serve`. pyproject now declares vllm>=0.10. Side benefit: narrows pip's resolver window, stops --pre pulling rc/dev safetensors / tokenizers / transformers. - #3 reasoning-budget bump clobbered explicit small max_tokens. Client sent max_tokens=64, got completion_tokens=20480 because the bump fired unconditionally. Now respected when client sets <1024 (curl smokes, "say hello", token-count probes). The OpenCode/Hermes 4K-8K starvation case still bumps as before. - #7 message.reasoning not normalized to message.reasoning_content. Some vLLM versions emit `reasoning` (their newer naming). Normalize to the OpenAI-standard `reasoning_content` so OpenAI clients (Hermes, openai-python) see the field they expect. Original `reasoning` preserved for back-compat. - #6 longctx splice spammed 8 chunks regardless of relevance. Trivial "say hello" produced prompt_tokens=5423. Added cosine-score >= 0.20 floor (env-tunable via LONGCTX_RELEVANCE_FLOOR) that drops noise chunks before splicing. - #2 --max-model-len exceeding model's max_position_embeddings. Pre-flight reads model's config.json and warns with actual numbers ("65536 exceeds 40960; recommend --max-model-len 40960") instead of letting vLLM reject prompts later with a less specific error. Plus a CI-fixing pass: tests/test_longctx_endpoint.py had stale imports flagged by ruff F811/F401 + I001 (the v0.5.1 commit's CI failed on this). All ruff lint clean now. 8 new regression tests in tests/test_longctx_endpoint.py pin all five behaviors. 505/505 tests pass total. NOT fixed in this release (separate Metal-kernel investigation): - #4 KV-cache corruption signature under turbo4v2 4-bit + sustained decode. Workaround: drop --additional-config or use kv_bits: 8 (asymmetric K8/V4) for the same scheme. - #5 4× decode throughput decay (128 → 30 tok/s monotonic) — likely same root cause as #4. Same workaround. Versions caught up: pyproject.toml 0.5.1 → 0.5.2 __init__.py 0.5.1 → 0.5.2 homebrew formula 0.5.1 → 0.5.2; bottle SHAs cleared scripts/build_bottle.sh 0.5.1 → 0.5.2 Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

0xClandestine added 3 commits April 25, 2026 01:43

lefromage mentioned this pull request Apr 26, 2026

TCP client failed to connect/validate to host 100.64.0.1:59782 - retrying #9

Closed

TheTom mentioned this pull request May 5, 2026

feat(serve): auto-detect tool + reasoning parsers (closes #13) #14

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add DFlash speculative decoding implementation#3

Add DFlash speculative decoding implementation#3
0xClandestine wants to merge 3 commits into
TheTom:mainfrom
0xClandestine:dflash-upstream

0xClandestine commented Apr 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

0xClandestine commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's Changed

New Module: swift/Sources/DFlash/ (1,800+ lines)

Supported Models (50+ MLXLLM models)

Pure Attention Models (dflashIsHybridGDN = false)

Hybrid Models (dflashIsHybridGDN = true)

Adding DFlash to a New Model

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

0xClandestine commented Apr 25, 2026 •

edited

Loading