Reimplement guided decoding with xgrammar for PyTorch Engine #4028
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
The original outlines-based guided decoding in the PyTorch engine has three major problems:
Token-level mismatch & poor performance
Outlines works on characters, not on tokens.
Outdated & incompatible dependency
outlines<0.1.0
which is more than one year old and hard-coupled to numpy 1.x.Life-cycle bug
xgrammar is a token-level GPU-native grammar engine:
Modification
guided_process.py
GuidedDecodingManager
that wraps xgrammar.compile_json_schema()
/compile_regex_grammar()
allocate_token_bitmask()
/apply_token_bitmask_inplace()
session_id + seq_id
processor cachelogits_process.py
_guided_sampling
,guided_input_ids
, …)FusedLogitsProcessor
receives aGuidedDecodingManager
instanceforward()
:– batch-allocate one bitmask tensor
– fill it for every guided sequence
– apply in-place on GPU
accept_token()
advances each matchermodel_agent.py / sampling.py
model_agent
keeps the singletonGuidedDecodingManager
ARSamplingStrategy
buildssession_ctx
(session/seq IDs) andsession_to_cleanup
listSamplingInputs
carries the two new fields instead ofguided_input_ids
engine.py
end_session()
now callssampling_strategy.on_session_end()
→session_to_cleanup
→ next forward deletes the processors, guaranteeing immediate release.requirements
outlines<0.1.0
, addxgrammar
for all backends (cuda/rocm/ascend/camb/maca).tests