Skip to content

Conversation

windreamer
Copy link
Collaborator

@windreamer windreamer commented Oct 9, 2025

Motivation

The original outlines-based guided decoding in the PyTorch engine has three major problems:

  1. Token-level mismatch & poor performance
    Outlines works on characters, not on tokens.

    • It builds the FSM on the string vocabulary and then maps every character transition back to the token space at each forward pass.
    • This character-token conversion dominates the latency; for long schemas or large vocabularies the overhead is often larger than the model forward itself.
  2. Outdated & incompatible dependency

    • We are pinned to outlines<0.1.0 which is more than one year old and hard-coupled to numpy 1.x.
    • Upgrading to the newest outlines requires a full rewrite of our logits-processor layer because the internal FSM and tokenizer APIs have been redesigned and no longer expose the hooks we rely on.
  3. Life-cycle bug

    • The global LRU cache (size=32) keeps processors alive across sessions.
    • When the 33-rd different guide appears, an old but still running processor is evicted and its matcher state is lost; the next request with the same guide re-uses a dirty matcher and generates illegal tokens.

xgrammar is a token-level GPU-native grammar engine:

  • The FSM is compiled directly on the tokenizer vocabulary, eliminating the character-token round trip.
  • Bit-mask generation and excellent performance.

Modification

  1. guided_process.py

    • New GuidedDecodingManager that wraps xgrammar.
    • compile_json_schema() / compile_regex_grammar()
    • allocate_token_bitmask() / apply_token_bitmask_inplace()
    • per-session_id + seq_id processor cache
  2. logits_process.py

    • Remove all outlines glue code (_guided_sampling, guided_input_ids, …)
    • FusedLogitsProcessor receives a GuidedDecodingManager instance
    • Inside forward():
      – batch-allocate one bitmask tensor
      – fill it for every guided sequence
      – apply in-place on GPU
    • After sampling: accept_token() advances each matcher
  3. model_agent.py / sampling.py

    • model_agent keeps the singleton GuidedDecodingManager
    • ARSamplingStrategy builds session_ctx (session/seq IDs) and session_to_cleanup list
    • SamplingInputs carries the two new fields instead of guided_input_ids
  4. engine.py

    • end_session() now calls sampling_strategy.on_session_end()session_to_cleanup → next forward deletes the processors, guaranteeing immediate release.
  5. requirements

    • Drop outlines<0.1.0, add xgrammar for all backends (cuda/rocm/ascend/camb/maca).
  6. tests

    • Re-enable PyTorch-backend grammar tests that were previously skipped.

@windreamer windreamer changed the title Reimplement guided decoding with xgrammar for TurboMind Reimplement guided decoding with xgrammar for Pytorch Oct 9, 2025
@windreamer windreamer force-pushed the guided_decoding_with_xgrammar_pt branch 2 times, most recently from 893563c to 0b2c106 Compare October 10, 2025 07:33
@windreamer windreamer changed the title Reimplement guided decoding with xgrammar for Pytorch Reimplement guided decoding with xgrammar for PyTorch Engine Oct 11, 2025
@windreamer windreamer force-pushed the guided_decoding_with_xgrammar_pt branch from 2a3bbdf to 18628b4 Compare October 11, 2025 07:58
@windreamer windreamer force-pushed the guided_decoding_with_xgrammar_pt branch from 65afd9b to 443db85 Compare October 13, 2025 11:39
@windreamer windreamer marked this pull request as ready for review October 13, 2025 11:40
@windreamer windreamer requested a review from grimoire October 13, 2025 11:41
@windreamer windreamer force-pushed the guided_decoding_with_xgrammar_pt branch from d4b6ddb to b7c5426 Compare October 15, 2025 05:04
@windreamer windreamer requested a review from grimoire October 15, 2025 06:03
Copy link
Collaborator

@grimoire grimoire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@windreamer windreamer requested a review from lvhan028 October 15, 2025 07:36
@lvhan028 lvhan028 added the enhancement New feature or request label Oct 15, 2025
@lvhan028 lvhan028 merged commit 1d20160 into InternLM:main Oct 15, 2025
22 checks passed
@windreamer windreamer deleted the guided_decoding_with_xgrammar_pt branch October 16, 2025 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants