-
Notifications
You must be signed in to change notification settings - Fork 608
Guided decoding with xgrammar for TurboMind #3965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guided decoding with xgrammar for TurboMind #3965
Conversation
8b3e766
to
8fd6d05
Compare
good job! |
0362250
to
8bcbfff
Compare
9817089
to
4516ac7
Compare
Done |
4516ac7
to
be27768
Compare
Should tenatively solved in #4028 |
May update the "structed_output.md" |
Done |
Motivation
LMDeploy’s TurboMind backend is the fastest inference stack in the ecosystem, yet it still lacks Guided Decoding – a feature that is already available in the PyTorch backend and heavily requested by the community.
This PR closes the gap by bringing token-level, C++ native Guided Decoding to TurboMind while keeping the API 100 % compatible with the existing PyTorch backend.
The implementation is built on xGrammar (Apache-2.0), a high-performance C++ library that compiles JSON / Choice / Regex grammars into token FSMs and applies them with negligible overhead.
Modification
Build-system
xgrammar
as a header-only dependency via CMakeFetchContent
(CUDA & Python bindings disabled).xgrammar::tokenizer_info
andxgrammar::grammar_compiler
symbols underlmdeploy::xgrammar
.Core C++ changes
DynamicDecodeLayer
pipeline extended with two new layers:GuidedDecodeMaskLayer
: insetup()
compiles / reuses grammar → builds per-request token bitmask; inforward()
launches a light CUDA kernel to mask disallowed logits to-INF
.GuidedDecodeUpdateLayer
: inforward()
callsmatcher->AcceptToken(output_id)
to advance the FSM.Python frontend
guided_decoding
utilities from PyTorch backend; no new API surface.turbo.TurboMindEngine
now accepts the sameresponse_format=
/guided_json=
/guided_choice=
arguments.Checklist