Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollback support for speculative decoding? #117

Open
benchislett opened this issue Feb 7, 2025 · 2 comments · Fixed by #126
Open

Rollback support for speculative decoding? #117

benchislett opened this issue Feb 7, 2025 · 2 comments · Fixed by #126

Comments

@benchislett
Copy link

Does llguidance support a state rollback primitive for use in draft-model speculative decoding (where some tokens need to be generated subject to guidance, and then only some of those tokens are accepted for continued generation)?

As of now, the only structured output backend in vLLM which supports this feature is xGrammar. I am curious if this exists in llguidance, or if it is on the roadmap / compatible with the design.

Thanks to all maintainers for a great contribution to the open-source community.

@mmoskal
Copy link
Collaborator

mmoskal commented Feb 7, 2025

Rollback is currently not implemented, but it wouldn't be super-hard to add.

However there are two other APIs that relevant:

  • you can clone the whole constraint; you can do either sharing or not sharing lexer state (the lexer states is protected by a mutex, so if you're sharing it the grammars cannot compute masks in parallel)
  • you can validate a number of tokens (validate_tokens_raw() method) in the current context, without modifying the state of the constraint - this is quite cheap

Another API we may want to add is compute_mask_after_tokens() that would save constraint state, consume a number of tokens, compute mask, and restore state (this would be easier than a general rollback).

In some situations it won't be possible to compute masks for all draft tokens, so one would have to do rejection sampling in that case. Note that rejection sampling is not equivalent to mask-and-sample for top_p/k (but is equivalent for temperature and argmax).

Let me know if any of these help!

@mmoskal
Copy link
Collaborator

mmoskal commented Feb 21, 2025

Actually, let me keep this open until Python interface is available. Right now, Python uses Constraint which wraps TokenParser, which may not be the best way forward.

@mmoskal mmoskal reopened this Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants