Skip to content

Conversation

windreamer
Copy link
Collaborator

@windreamer windreamer commented Sep 12, 2025

Motivation

LMDeploy’s TurboMind backend is the fastest inference stack in the ecosystem, yet it still lacks Guided Decoding – a feature that is already available in the PyTorch backend and heavily requested by the community.
This PR closes the gap by bringing token-level, C++ native Guided Decoding to TurboMind while keeping the API 100 % compatible with the existing PyTorch backend.
The implementation is built on xGrammar (Apache-2.0), a high-performance C++ library that compiles JSON / Choice / Regex grammars into token FSMs and applies them with negligible overhead.

Modification

  1. Build-system

    • Add xgrammar as a header-only dependency via CMake FetchContent (CUDA & Python bindings disabled).
    • Export xgrammar::tokenizer_info and xgrammar::grammar_compiler symbols under lmdeploy::xgrammar.
  2. Core C++ changes

    • DynamicDecodeLayer pipeline extended with two new layers:
      • GuidedDecodeMaskLayer: in setup() compiles / reuses grammar → builds per-request token bitmask; in forward() launches a light CUDA kernel to mask disallowed logits to -INF.
      • GuidedDecodeUpdateLayer: in forward() calls matcher->AcceptToken(output_id) to advance the FSM.
    • Grammar compiler cache (LRU, keyed by schema hash) shared across all sessions to avoid re-compilation.
  3. Python frontend

    • Re-use existing guided_decoding utilities from PyTorch backend; no new API surface.
    • turbo.TurboMindEngine now accepts the same response_format= / guided_json= / guided_choice= arguments.

Checklist

  • Pre-commit hooks (clang-format, flake8, mypy) passed.
  • Document updated

@windreamer windreamer changed the title Guided decoding with xgrammar [WIP] Guided decoding with xgrammar Sep 12, 2025
@windreamer windreamer force-pushed the guided_decoding_with_xgrammar branch 3 times, most recently from 8b3e766 to 8fd6d05 Compare September 12, 2025 09:44
@shell-nlp
Copy link
Contributor

good job!

@windreamer windreamer force-pushed the guided_decoding_with_xgrammar branch 25 times, most recently from 0362250 to 8bcbfff Compare September 22, 2025 12:41
@windreamer windreamer force-pushed the guided_decoding_with_xgrammar branch from 9817089 to 4516ac7 Compare October 9, 2025 07:35
@windreamer windreamer changed the title Guided decoding with xgrammar Guided decoding with xgrammar for TurboMind Oct 9, 2025
@windreamer windreamer marked this pull request as ready for review October 9, 2025 07:36
@windreamer
Copy link
Collaborator Author

Could we split this PR into two separate ones? One for the TurboMind engine and another for the PyTorch engine.

Done

@windreamer
Copy link
Collaborator Author

I don't know much about guided decoding. But I think there are bugs in the pytorch implementation (from the main branch)
The matcher is maintained in instances of RegexLogitsProcessor or JSONLogitsProcessor. And _get_guided_logits_processor will only cache 32 instances. Different request with same guide would get the same processor and old processor would be removed if more than 32 guide request comes in.

You are right! It is a bit tough...

Should tenatively solved in #4028

@windreamer windreamer requested a review from lzhangzz October 13, 2025 09:01
@lvhan028
Copy link
Collaborator

May update the "structed_output.md"

@windreamer
Copy link
Collaborator Author

May update the "structed_output.md"

Done

@lvhan028 lvhan028 merged commit aef6363 into InternLM:main Oct 13, 2025
9 checks passed
@windreamer windreamer deleted the guided_decoding_with_xgrammar branch October 13, 2025 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] turbomind后端是否会支持guided_decoding

5 participants