Turbo dflash#103
Conversation
EAGLE3 is an encoder-decoder based speculative decoding method: - Extracts features from target model at specific layers - Uses feature fusion layer to compress target features - Generates draft tokens with single-layer decoder - Maps draft vocabulary to target vocabulary via d2t tensor Key changes: - Add LLM_ARCH_EAGLE3 architecture - Add EAGLE3 encoder/decoder graph (src/models/eagle3.cpp) - Add feature extraction from target model layers - Add g_embeddings handling for decoder input - Add GGML_TENSOR_FLAG_SYNC for GPU synchronization - Add --eagle3 flag for speculative-simple example - Add EAGLE3 model conversion in convert_hf_to_gguf.py
|
Trying to take this for a spin, looks like a bunch of attention biases were recently renamed upstream (and merged to the turboquant fork): Got it compiling here: Now the question is how to test it on llama-server (dflash flag doesn't seem allowed?) |
|
I naively resolved merge conflicts and enabled @aminya Any pointers on what it should take to make this work with |
|
Yes. The server isn't set up but the CLI works. I couldn't get the speed ups I was expecting when using Qwopus. I will try with more models |
|
Currently only llama_set_eagle3 is called. You should similarly add a call to llama_set_dflash. llama-cpp-turboquant/tools/server/server-context.cpp:801 if (params_base.speculative.eagle3) {
// EAGLE3 current limitation: extracted target features are per-context; multiple slots would overwrite each other
if (params_base.n_parallel > 1) {
SRV_ERR("%s", "EAGLE3 speculative decoding is not supported with n_parallel > 1\n");
return false;
}
llama_set_eagle3(ctx, model_dft.get());
SRV_INF("%s", "EAGLE3 feature extraction enabled on target model\n");
}
// TODO: params_base.speculative.dflash |
|
Thanks @taniguchi-taku-softm. Got it running on @aminya Do you think it's moot trying to get this technique working for a VRAM starved setup like mine? I was thinking a tiny draft model like DFlash could help, even at the expense of consuming some VRAM from the main model. |
|
Vulkan is support? |
|
@nybblr I've tried many different patches and configurations over the weekend for my single 3090 setup. There's no benefit in Dflash I can see. I cannot reproduce any of the claimed speed ups in real workflows. |
Overview
This cherry-picks and fixes the conflicts of Dflash PR.
ggml-org#22105
This PR adds DFlash speculative decoding to llama.cpp, achieving up to 8x speedup (Qwen3) with full numerical equivalence to the reference original implementation.
Compared to EAGLE3 - which uses an autoregressive draft and generates one token per draft step, DFlash produces an entire block of candidates in a single draft forward pass, resulting in higher per-iteration draft throughput. However, DFlash relies on multiple transformer layers for its draft model, whereas EAGLE3 uses only a single transformer layer.
There is still quite meaningful headroom for further performance improvements with current implementation, summarized in the Future Performance Work section below.
Performance Evaluation (NVIDIA L40S 48GB)
Numbers below were collected with
--draft-max 16,--temp 0 --top-k 1 --seed 42,n=256. Baseline isllama-clirunning the target model alone with the same sampling parameters.Qwen3-8B
Draft:
z-lab/Qwen3-8B-DFlash(bf16), Target:Qwen/Qwen3-8B(bf16)Qwen3-4B
Draft:
z-lab/Qwen3-4B-DFlash(bf16), Target:Qwen/Qwen3-4B(bf16)GPT-OSS-20B
Draft:
z-lab/gpt-oss-20b-DFlash(bf16), Target:openai/gpt-oss-20b(bf16)For MoE targets (gpt-oss-20b), DFlash speedup is generally smaller than for dense attention targets because more experts get activated during the parallel verification step than during single-token autoregressive decoding (same observation as in ggml-org#18039 for gpt-oss EAGLE3).
Qwen3.5-4B (With Performance Issue)
Draft:
z-lab/Qwen3.5-4B-DFlash(bf16), Target:Qwen/Qwen3.5-4B(bf16)Speedup is intrinsically limited on hybrid target models:
[id_last + draft block]before acceptance is known.seq_rm; hybrid targets cannot, because recurrent state is not decomposable by token position.examples/speculative-simple/speculative-simple.cpp:How to run DFlash in llama.cpp
Step 1: Convert models to GGUF
Step 2: Build llama.cpp
Step 3: Run DFlash speculative decoding
Future Performance Work
KV cache / graph reuse for the DFlash decoder
The DFlash decoder currently rebuilds its graph every iteration (
graphs reused = 0). The main cause is thatcross.n_enc(the length ofaccumulated_target_ctx) grows monotonically, which changes the shape oftarget_ctxand invalidates all downstream tensor shapes.Possible improvements:
add a draft-side KV cache to the DFlash decoder.
This would make the implementation closer to the original reference: committed target-context K/V would be materialized once and reused across iterations, instead of recomputing K/V from the full accumulated context every step. This reduces draft-side compute and also makes graph shapes much more stable, which should improve graph reuse. Since the DFlash decoder attention includes both cross-attention and self-attention, the current llama.cpp implementation does not support this pattern well.
keep the current no-cache design, but fix the
target_ctxinput shape.Instead of letting
target_ctxgrow every iteration, reserve a fixed-size buffer, track the active length separately, and mask out the padded region in attention. This preserves the current semantics while allowing the decoder graph to be reused. This method is not ideal compared to using a KV cache.Hybrid target model performance improvement (For all speculative decoding methods)
Hybrid targets (e.g. Qwen3.5) are slower because the problem is no longer just draft-side graph reuse. During target verify, llama.cpp writes KV / recurrent state for the full draft block before acceptance is known. Pure-attention target models can discard rejected suffixes with
seq_rm, but hybrid targets cannot, because their recurrent state is not decomposable by token position.The current workaround is:
This is correct, but each rejected step may require one extra target forward, which is the main reason hybrid speedup lags pure-attention.
A more fundamental future improvement would be target-side deferred commit (SGLang Implementation): verify would compute temporary recurrent states, and only the accepted-prefix state would be committed. That would remove replay from the hybrid path, but it requires deeper changes to llama.cpp’s recurrent-state update flow.
Note this applies to all hybrid models used as target models in speculative decoding methods, not just DFlash.
More (Low Priority)
Requirements