fix: fix guided decoding state corruption in turbomind when tp>1 #4167

windreamer · 2025-11-28T07:37:17Z

Motivation

When serving models with tensor parallelism (--tp >= 2), enabling guided decoding (e.g., response_format: {"type": "json_schema"}) causes segmentation faults。

Root Cause: The guided decoding state management layers (GuidedDecodeMaskLayer and GuidedDecodeUpdateLayer) were being instantiated on all tensor parallelism (TP) ranks. Each rank independently modified shared decoding state, causing race conditions and memory corruption. In distributed inference, only rank 0 should orchestrate guided decoding logic while other ranks perform parallel computation.

Modification

Eliminate shared mutable state by allocating an independent GrammarMatcher instance per TP rank. Enhance the decoding pipeline with rank awareness so each thread operates on its own isolated matcher, ensuring thread safety while maintaining deterministic behavior across ranks.

Fixes #4152

irexyc · 2025-11-28T07:44:52Z

The sampled token are not broadcasted from tp_rank0, so all ranks should do the same sampling process to make sure the next token is same on all ranks.

I think the problem may be the state of GrammarMatcher can not be shared by all ranks. Currently, all ranks share the same std::shared_ptr<xgrammar::GrammarMatcher> in the request.

windreamer · 2025-11-28T08:00:43Z

GrammarMatcher

OK so we need to copy the GrammarMatcher instead ?

windreamer · 2025-11-28T08:06:13Z

The sampled token are not broadcasted from tp_rank0, so all ranks should do the same sampling process to make sure the next token is same on all ranks.

I think the problem may be the state of GrammarMatcher can not be shared by all ranks. Currently, all ranks share the same std::shared_ptr<xgrammar::GrammarMatcher> in the request.

I believe that a quick fix should be making GuidedDecodeUpdateLayer only executed in rank 0. As we only modify GrammarMatcher here. But I have no idea how to ensure we call GuidedDecodeUpdateLayer::Forward only when all ranks finish there sampling.

irexyc · 2025-11-28T08:56:45Z

I believe that a quick fix should be making GuidedDecodeUpdateLayer only executed in rank 0.

All ranks should have the same next token and the next token are computed by dynamic decoding, so we should make sure the dynamic decoding process are same on all ranks.

I think we can construct n_ranks of matchers here like r->matchers = ... https://github.com/InternLM/lmdeploy/blob/main/src/turbomind/engine/model_request.cc#L131C9-L131C19

and chose the correspond matcher here like matchers_.push_back(r->matchers[rank_]);
https://github.com/InternLM/lmdeploy/blob/main/src/turbomind/layers/sampling_layers/GuidedDecodeMaskLayer.cc#L36
https://github.com/InternLM/lmdeploy/blob/main/src/turbomind/layers/sampling_layers/GuidedDecodeUpdateLayer.cc#L32

lzhangzz · 2025-12-01T10:21:58Z

src/turbomind/layers/sampling_layers/GuidedDecodeUpdateLayer.cc

Pass h_tp_group to GuidedDecodeUpdateLayer

h_tp_group->Sync() here so that all ranks completed filling their host mask buffer before any rank tries to update matcher state.

AcceptToken when h_tp_group->rank() == 0

In addition, a stream sync is required after the copy. need_apply as in GuidedDecodeMaskLayer is neede here to avoid the copy / sync cost when guided decoding is not needed.

Sadly, not only AcceptToken but also FillNextTokenBitmask modify the shared matcher state. So if we need multiple times of sync and shared bit mask, that will kill the performance I believe.

So I take @irexyc 's advice to just dup the state for each thread.

lvhan028 · 2025-12-01T13:22:20Z

Vote for making GuidedDecodeUpdateLayer only executed in rank 0 and calling GuidedDecodeUpdateLayer::Forward only when all ranks finish the sampling.
It doesn't make sense to me that ModelRequest holds a field tp_size_.

windreamer requested review from irexyc and lzhangzz November 28, 2025 07:37

windreamer self-assigned this Nov 28, 2025

windreamer marked this pull request as ready for review November 28, 2025 07:37

windreamer marked this pull request as draft November 28, 2025 08:00

windreamer force-pushed the fix_guided_decoding_tp branch from 8ada1ea to e7a7055 Compare November 28, 2025 10:05

windreamer marked this pull request as ready for review November 28, 2025 10:11

fix: fix guided decoding state corruption in turbomind when tp>1

adb0148

windreamer force-pushed the fix_guided_decoding_tp branch from e7a7055 to adb0148 Compare November 28, 2025 10:27

irexyc approved these changes Nov 28, 2025

View reviewed changes

windreamer requested a review from lvhan028 November 29, 2025 03:22

lzhangzz reviewed Dec 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: fix guided decoding state corruption in turbomind when tp>1 #4167

fix: fix guided decoding state corruption in turbomind when tp>1 #4167

windreamer commented Nov 28, 2025 •

edited

Loading

Uh oh!

irexyc commented Nov 28, 2025

Uh oh!

windreamer commented Nov 28, 2025

Uh oh!

windreamer commented Nov 28, 2025

Uh oh!

irexyc commented Nov 28, 2025

Uh oh!

lzhangzz Dec 1, 2025

Uh oh!

windreamer Dec 1, 2025

Uh oh!

lvhan028 commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: fix guided decoding state corruption in turbomind when tp>1 #4167

Are you sure you want to change the base?

fix: fix guided decoding state corruption in turbomind when tp>1 #4167

Conversation

windreamer commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modification

Uh oh!

irexyc commented Nov 28, 2025

Uh oh!

windreamer commented Nov 28, 2025

Uh oh!

windreamer commented Nov 28, 2025

Uh oh!

irexyc commented Nov 28, 2025

Uh oh!

lzhangzz Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

windreamer Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

lvhan028 commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

windreamer commented Nov 28, 2025 •

edited

Loading