Please take a look at this discussion on Reddit (Qwen-Scope news thread): https://www.reddit.com/r/LocalLLaMA/comments/1szrbub/comment/oj3yzop/
And this comment:
https://www.reddit.com/r/LocalLLaMA/comments/1szrbub/comment/oj3se4q/?context=3
The moderator deleted the comment...
Anyway, in the Qwen-Scope thread, one user asked:
Soooooooo did i not get something or this is perfect for speculative decoding?
Another user replied:
I fed an article about DFlash and Qwen-Scope to an LLM model, and… summing it all up in the final paragraphs, it said, quote:
Summary for speculative decoding developers: If you're building something like DFlash or EAGLE-3, then Qwen-Scope is your X-ray machine. It lets you understand exactly what your drafter should be "peeking at" from the large model. It turns the "black box" of hidden states into a clear list of instructions: "Right now we're writing Python code, use features #102 and #554." This will make token block predictions much more accurate, which directly translates into speedup. Instead of 6x speedup with DFlash, using SAE guidance could help break through the ceiling and achieve even higher performance by reducing the number of rejected blocks.
So in theory, yes - but this is purely for Qwen models. For other models, you'd have to train your own SAE. Thankfully, Qwen has shared how to do it, and thanks to them for that.
One simply needs to ask Jian Chen about this.
You can obtain a distillate from this SAE - a smaller model - and then use that.
You can take the dataset on which DFlash is trained, run it through the SAE, and in theory obtain more precise semantic labels for training the drafter.
In both cases, the DFlash architecture needs to be changed. Maybe other labs will take this on, or maybe not.
Here is the reply to this comment:
I don't think there is any previous attempt to use SAE for speculative decoding (maybe "sparsify the features"/selectively use top-k related features...? Assuming it will be of a good quality, is it really can be faster or my hunch says that it will be heavily bottlenecked?) or DFlash-type adapter model training.
What do you think about this? Is there any ground for future research or improvements to DFlash?
Please take a look at this discussion on Reddit (Qwen-Scope news thread): https://www.reddit.com/r/LocalLLaMA/comments/1szrbub/comment/oj3yzop/
And this comment:https://www.reddit.com/r/LocalLLaMA/comments/1szrbub/comment/oj3se4q/?context=3
The moderator deleted the comment...
Anyway, in the Qwen-Scope thread, one user asked:
Another user replied:
Here is the reply to this comment:
What do you think about this? Is there any ground for future research or improvements to DFlash?