Skip to content

UPSTREAM PR #21870: common: skip reasoning budget sampler when no budget is requested#1349

Open
loci-dev wants to merge 2 commits intomainfrom
loci/pr-21870-fix-reasoning-budget-skip-unlimited
Open

UPSTREAM PR #21870: common: skip reasoning budget sampler when no budget is requested#1349
loci-dev wants to merge 2 commits intomainfrom
loci/pr-21870-fix-reasoning-budget-skip-unlimited

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: ggml-org/llama.cpp#21870

Overview

After I added thinking_start_tag / thinking_end_tag for gemma4 in #21697, the reasoning budget sampler gets unconditionally created even if no budget is configured. The same applies to kimi_k2, lfm2, lfm2_5, and ministral_3 which also set these tags. The budget gets converted to INT_MAX, so the sampler never actually forces any tokens but still runs per-token checks (start tag matching in IDLE state, token-to-piece conversion + UTF-8 checks in COUNTING state).

More importantly, the existence of the sampler (i.e., non-null rbudget) disables backend sampling. Backend sampling lets the GPU select tokens directly, avoiding a full logits transfer from GPU to CPU every token. This could potentially explain the 30% speed regression reported in #21784 (98 t/s to 70 t/s on Vulkan).

So I added a reasoning_budget_tokens >= 0 check to the sampler creation condition. When the budget is unlimited, the sampler is not created, backend sampling stays enabled, and no per-token overhead is added. When a budget is explicitly set (0, 128, 1024, etc.), the sampler is created and works as before.

Should fix #21784

Additional information

Note that I do not have a Vulkan environment so I can replicate neither the original degradation nor that the fix works in terms of performance. However, I did play with the backend-sampling flag myself manually on CUDA and my commit gave me around 5% improvement (~215t/s vs ~225t/s). In any case, it is logically cleaner and should be an improvement in all cases.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES: I asked my local agent to identify the potential sources of the performance regression on Vulkan, and then I manually validated and tested and validated the logic.

After I added thinking_start_tag / thinking_end_tag for gemma4 in #21697, the reasoning budget sampler gets unconditionally created even when no budget is configured (the default -1). The same applies to kimi_k2, lfm2, lfm2_5, and ministral_3 which also set these tags. The budget gets converted to INT_MAX, so the sampler never actually forces any tokens but still runs per-token checks (start tag matching in IDLE state, token-to-piece conversion + UTF-8 checks in COUNTING state).

More importantly, the mere existence of the sampler (non-null rbudget) disables backend sampling. Backend sampling lets the GPU select tokens directly, avoiding a full logits transfer from GPU to CPU every token. This could explain the 30% speed regression reported in #21784 (98 t/s to 70 t/s on Vulkan).

So I added a reasoning_budget_tokens >= 0 check to the sampler creation condition. When the budget is unlimited, the sampler is not created, backend sampling stays enabled, and no per-token overhead is added. When a budget is explicitly set (0, 128, 1024, etc.), the sampler is created and works as before.
Following up on the review feedback on #21870: keep the reasoning budget sampler when grammar_lazy is true, so the thinking-block grammar suppression from #20970 still works when tools are in use. This way, we only skip the sampler when both no budget is set AND grammar is not lazy.
@loci-review
Copy link
Copy Markdown

loci-review Bot commented Apr 14, 2026

Overview

Analysis of 127,120 functions across 15 binaries shows minimal performance impact from commits optimizing sampler initialization logic. 74 functions modified (0.058%), 0 new, 0 removed. Changes isolated to common/sampling.cpp with no impact on critical inference paths (matrix operations, attention, quantization, KV cache).

Power consumption changes:

  • build.bin.llama-tts: -0.055%
  • build.bin.llama-bench: +0.005%
  • build.bin.llama-cvector-generator: -0.025%
  • build.bin.llama-tokenize: +0.023%
  • build.bin.llama-quantize: +0.041%
  • build.bin.libmtmd.so: -0.001%
  • build.bin.libllama.so: 0.0%
  • build.bin.libggml-cpu.so: 0.0%
  • build.bin.libggml.so: 0.0%
  • build.bin.libggml-base.so: 0.0%
  • build.bin.llama-minicpmv-cli: 0.0%
  • build.bin.llama-llava-cli: 0.0%
  • build.bin.llama-gguf-split: 0.0%
  • build.bin.llama-gemma3-cli: 0.0%
  • build.bin.llama-qwen2vl-cli: 0.0%

All changes within measurement noise (±0.055%).

Function Analysis

common_sampler_accept() (build.bin.llama-tts)

  • Response time: 655ns → 614ns (-6.3%, -41ns)
  • Throughput time: 221ns → 180ns (-18.6%, -41ns)
  • Source: Compiler optimization from sampler initialization changes in common/sampling.cpp. Function called per accepted token during inference; 41ns improvement compounds across generation sequences.

std::vector::end() (build.bin.llama-tts)

  • Response time: 265ns → 81ns (-69.3%, -183ns)
  • Throughput time: 243ns → 60ns (-75.4%, -183ns)
  • Source: Compiler optimization consolidated entry block (9→7 blocks), no source changes. STL template instantiation improvement.

httplib::encode_uri() (build.bin.llama-tts)

  • Response time: 851ns → 963ns (+13.2%, +112ns)
  • Throughput time: 670ns → 783ns (+16.8%, +112ns)
  • Source: Compiler-introduced entry-point trampoline overhead. Non-critical HTTP operation during initialization only.

Remaining analyzed functions are STL template instantiations showing compiler-driven variations (±40-180ns) in non-critical utility code with negligible practical impact.

Additional Findings

Core inference libraries (libllama.so, libggml-cpu.so, libggml.so, libggml-base.so) show 0.0% power consumption change, confirming no modifications to performance-critical operations. The source code changes successfully optimize sampler initialization (skip reasoning budget sampler when budget unlimited and lazy grammar inactive) without affecting inference performance. All 73 non-application functions are compiler optimization artifacts, not functional changes.

💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 7 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants