-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rollback support for speculative decoding? #117
Comments
Rollback is currently not implemented, but it wouldn't be super-hard to add. However there are two other APIs that relevant:
Another API we may want to add is In some situations it won't be possible to compute masks for all draft tokens, so one would have to do rejection sampling in that case. Note that rejection sampling is not equivalent to mask-and-sample for top_p/k (but is equivalent for temperature and argmax). Let me know if any of these help! |
Actually, let me keep this open until Python interface is available. Right now, Python uses Constraint which wraps TokenParser, which may not be the best way forward. |
Does llguidance support a state rollback primitive for use in draft-model speculative decoding (where some tokens need to be generated subject to guidance, and then only some of those tokens are accepted for continued generation)?
As of now, the only structured output backend in vLLM which supports this feature is xGrammar. I am curious if this exists in llguidance, or if it is on the roadmap / compatible with the design.
Thanks to all maintainers for a great contribution to the open-source community.
The text was updated successfully, but these errors were encountered: