Add DFlash speculative decoding implementation#3
Open
0xClandestine wants to merge 3 commits into
Open
Conversation
Based on DFlash (arXiv:2602.06036) - Block-Diffusion Speculative Decoding for lossless acceleration on Apple Silicon. ## What's New ### Core Module (swift/Sources/DFlash/) - DFlashCore.swift - Protocol definitions for DFlashTargetModel, DFlashDraftModelProtocol, DFlashDraftCacheProtocol, DFlashRollbackCacheProtocol, DFlashEngineProtocol, DFlashEvent, DFlashSummary, DFlashConfiguration - DFlashDraftModel.swift - Complete draft model implementation with cross-attention, RoPE, sink-window cache - DFlashEngines.swift - Verify/rollback engines (FullAttentionEngine, HybridGDNEngine stub) - DFlashRuntime.swift - Main speculative decoding runtime with prefill, block-diffusion drafting, verify, accept/reject, rollback - DFlashDraftBackend.swift - Draft generation helper - DFlashTargetModelExtensions.swift - Model conformance examples ### Tests (swift/Tests/DFlashTests/) - Unit tests for token utilities, cache management, config - Integration tests for engine creation ### Package Updates (swift/Package.swift) - Added DFlash static library target with MLXNN/MLX dependencies - Added test target for DFlashTests ## Architecture Highlights 1. Protocol-based for easy extension to new model types 2. Abstraction layers for engines, caches, draft backend 3. Extensible for hybrid GDN models with tape-based rollback ## Next Steps to Complete Integration 1. Add DFlashTargetModel conformance to actual models 2. Implement callCapturing() on model containers 3. Add vsm_engine_dflash_* C API functions to Bridge.swift 4. Train/convert DFlash draft models for target architectures Builds and tests pass.
This commit adds: - DFlashForwardWithCapture protocol for models that support hidden state capture - DFlashModelConformanceTemplate with lists of supported pure-attention and hybrid models - Embedding.asLinear helper for tied weights - Complete template documentation for adding conformance to any model
This commit adds: - DFlashForwardWithCapture protocol for hidden state capture - DFlashSupportedModels listing all ~50 MLXLLM models organized by type: - Pure attention models (Llama, Qwen3, Gemma, Phi, etc.) - Hybrid GDN models (Qwen3.5, Qwen3Next, DeepSeekV3, MiniMax, etc.) - DFlashModelRegistry with model lists - DFlashConformanceStatus for tracking conformance state - generateDFlashConformance() template generator for easy extension creation - Embedding.asLinear helper for tied weight models
8 tasks
TheTom
added a commit
that referenced
this pull request
May 7, 2026
Field reports from the v0.5.1 alpha (Tom's buddy) surfaced 5 obvious bugs and 2 non-obvious ones (Metal-side; tracked separately). This release fixes the obvious and locks them down with regression tests. Bugs fixed: - #1 vllm not declared as runtime dep. `pip install vllm-swift==0.5.1` left users at ModuleNotFoundError on first `vllm-swift serve`. pyproject now declares vllm>=0.10. Side benefit: narrows pip's resolver window, stops --pre pulling rc/dev safetensors / tokenizers / transformers. - #3 reasoning-budget bump clobbered explicit small max_tokens. Client sent max_tokens=64, got completion_tokens=20480 because the bump fired unconditionally. Now respected when client sets <1024 (curl smokes, "say hello", token-count probes). The OpenCode/Hermes 4K-8K starvation case still bumps as before. - #7 message.reasoning not normalized to message.reasoning_content. Some vLLM versions emit `reasoning` (their newer naming). Normalize to the OpenAI-standard `reasoning_content` so OpenAI clients (Hermes, openai-python) see the field they expect. Original `reasoning` preserved for back-compat. - #6 longctx splice spammed 8 chunks regardless of relevance. Trivial "say hello" produced prompt_tokens=5423. Added cosine-score >= 0.20 floor (env-tunable via LONGCTX_RELEVANCE_FLOOR) that drops noise chunks before splicing. - #2 --max-model-len exceeding model's max_position_embeddings. Pre-flight reads model's config.json and warns with actual numbers ("65536 exceeds 40960; recommend --max-model-len 40960") instead of letting vLLM reject prompts later with a less specific error. Plus a CI-fixing pass: tests/test_longctx_endpoint.py had stale imports flagged by ruff F811/F401 + I001 (the v0.5.1 commit's CI failed on this). All ruff lint clean now. 8 new regression tests in tests/test_longctx_endpoint.py pin all five behaviors. 505/505 tests pass total. NOT fixed in this release (separate Metal-kernel investigation): - #4 KV-cache corruption signature under turbo4v2 4-bit + sustained decode. Workaround: drop --additional-config or use kv_bits: 8 (asymmetric K8/V4) for the same scheme. - #5 4× decode throughput decay (128 → 30 tok/s monotonic) — likely same root cause as #4. Same workaround. Versions caught up: pyproject.toml 0.5.1 → 0.5.2 __init__.py 0.5.1 → 0.5.2 homebrew formula 0.5.1 → 0.5.2; bottle SHAs cleared scripts/build_bottle.sh 0.5.1 → 0.5.2 Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements DFlash (arXiv:2602.06036) - Block-Diffusion Speculative Decoding for lossless LLM acceleration on Apple Silicon.
What's Changed
New Module: swift/Sources/DFlash/ (1,800+ lines)
Supported Models (50+ MLXLLM models)
Pure Attention Models (dflashIsHybridGDN = false)
LlamaModel, Qwen3Model, Qwen2Model, GemmaModel, Gemma2Model, Gemma3TextModel, Gemma4Model, Gemma3nTextModel, PhiModel, Phi3Model, PhiMoEModel, MistralModel, CohereModel, Starcoder2Model, SmolLMModel, NanoChatModel, Internlm2Model, and more.
Hybrid Models (dflashIsHybridGDN = true)
Qwen35Model, Qwen3MoEModel, Qwen3NextModel, DeepseekV3Model, MiniMaxModel, MiniMaxM2Model, GraniteMoeHybridModel, LFM2Model, LFM2MoEModel, AfMoEModel, GLM4MoEModel, and more.
Adding DFlash to a New Model
Use generateDFlashConformance() to create extension code:
Testing
swift build - OK
swift test --filter DFlashIntegrationTests - 2/2 passed