Guided decoding with xgrammar for TurboMind #3965

windreamer · 2025-09-12T09:16:51Z

Motivation

LMDeploy’s TurboMind backend is the fastest inference stack in the ecosystem, yet it still lacks Guided Decoding – a feature that is already available in the PyTorch backend and heavily requested by the community.
This PR closes the gap by bringing token-level, C++ native Guided Decoding to TurboMind while keeping the API 100 % compatible with the existing PyTorch backend.
The implementation is built on xGrammar (Apache-2.0), a high-performance C++ library that compiles JSON / Choice / Regex grammars into token FSMs and applies them with negligible overhead.

Modification

Build-system
- Add xgrammar as a header-only dependency via CMake FetchContent (CUDA & Python bindings disabled).
- Export xgrammar::tokenizer_info and xgrammar::grammar_compiler symbols under lmdeploy::xgrammar.
Core C++ changes
- DynamicDecodeLayer pipeline extended with two new layers:
  - GuidedDecodeMaskLayer: in setup() compiles / reuses grammar → builds per-request token bitmask; in forward() launches a light CUDA kernel to mask disallowed logits to -INF.
  - GuidedDecodeUpdateLayer: in forward() calls matcher->AcceptToken(output_id) to advance the FSM.
- Grammar compiler cache (LRU, keyed by schema hash) shared across all sessions to avoid re-compilation.
Python frontend
- Re-use existing guided_decoding utilities from PyTorch backend; no new API surface.
- turbo.TurboMindEngine now accepts the same response_format= / guided_json= / guided_choice= arguments.

Checklist

Pre-commit hooks (clang-format, flake8, mypy) passed.
Document updated

shell-nlp · 2025-09-16T15:34:45Z

good job!

windreamer · 2025-10-09T07:36:42Z

Could we split this PR into two separate ones? One for the TurboMind engine and another for the PyTorch engine.

Done

windreamer · 2025-10-11T07:22:55Z

I don't know much about guided decoding. But I think there are bugs in the pytorch implementation (from the main branch)
The matcher is maintained in instances of RegexLogitsProcessor or JSONLogitsProcessor. And _get_guided_logits_processor will only cache 32 instances. Different request with same guide would get the same processor and old processor would be removed if more than 32 guide request comes in.

You are right! It is a bit tough...

Should tenatively solved in #4028

lvhan028 · 2025-10-13T09:03:39Z

May update the "structed_output.md"

windreamer · 2025-10-13T09:18:51Z

May update the "structed_output.md"

Done

windreamer changed the title ~~Guided decoding with xgrammar~~ [WIP] Guided decoding with xgrammar Sep 12, 2025

windreamer force-pushed the guided_decoding_with_xgrammar branch 3 times, most recently from 8b3e766 to 8fd6d05 Compare September 12, 2025 09:44

windreamer force-pushed the guided_decoding_with_xgrammar branch 25 times, most recently from 0362250 to 8bcbfff Compare September 22, 2025 12:41

windreamer added 11 commits October 9, 2025 15:34

fix: fix some bug and add initial tests

3d67937

feat: restructure the interface

2813c82

feat: speedup with cuda inplace kernel

ce47c8e

fix: fix test case

afa8a46

fix: use stream from context instead of the default stream

8c56d7f

test: add matrix grammar test

8e35f44

fix: simplify the bitmap apply kernel

6a3ce38

feat: move tensor allocation to ctor

ea76369

test: temporarily disable pytorch engine tests as it is faulty

2f96258

test: move timm to test requirements

a5b4850

fix: enable openai guided decoding function for turbomind

a06c57c

windreamer force-pushed the guided_decoding_with_xgrammar branch from 9817089 to 4516ac7 Compare October 9, 2025 07:35

windreamer changed the title ~~Guided decoding with xgrammar~~ Guided decoding with xgrammar for TurboMind Oct 9, 2025

windreamer marked this pull request as ready for review October 9, 2025 07:36

fix: fix schema not found issue by enforce pydantic serialize_by_alias

be27768

windreamer force-pushed the guided_decoding_with_xgrammar branch from 4516ac7 to be27768 Compare October 9, 2025 11:35

windreamer mentioned this pull request Oct 11, 2025

Reimplement guided decoding with xgrammar for PyTorch Engine #4028

Merged

lvhan028 approved these changes Oct 13, 2025

View reviewed changes

windreamer requested a review from lzhangzz October 13, 2025 09:01

lzhangzz approved these changes Oct 13, 2025

View reviewed changes

docs: modify docs for structured output

a1bf199

lvhan028 merged commit aef6363 into InternLM:main Oct 13, 2025
9 checks passed

windreamer deleted the guided_decoding_with_xgrammar branch October 13, 2025 11:39

This was referenced Oct 16, 2025

[Bug] pytorch后端结构化输出报错 #3581

Closed

使用TurboMind 推理 + Python 代码集成的方式报错 #1835

Closed

[Bug] structed_output cannot be used in cu118 with the lated docker images #3120

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Guided decoding with xgrammar for TurboMind #3965

Guided decoding with xgrammar for TurboMind #3965

Uh oh!

windreamer commented Sep 12, 2025 •

edited

Loading

Uh oh!

shell-nlp commented Sep 16, 2025

Uh oh!

windreamer commented Oct 9, 2025

Uh oh!

windreamer commented Oct 11, 2025

Uh oh!

lvhan028 commented Oct 13, 2025

Uh oh!

windreamer commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Guided decoding with xgrammar for TurboMind #3965

Guided decoding with xgrammar for TurboMind #3965

Uh oh!

Conversation

windreamer commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modification

Checklist

Uh oh!

shell-nlp commented Sep 16, 2025

Uh oh!

windreamer commented Oct 9, 2025

Uh oh!

windreamer commented Oct 11, 2025

Uh oh!

lvhan028 commented Oct 13, 2025

Uh oh!

windreamer commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

windreamer commented Sep 12, 2025 •

edited

Loading