UPSTREAM PR #21870: common: skip reasoning budget sampler when no budget is requested by loci-dev · Pull Request #1349 · auroralabs-loci/llama.cpp

loci-dev · 2026-04-14T03:10:55Z

Note

Source pull request: ggml-org/llama.cpp#21870

Overview

After I added thinking_start_tag / thinking_end_tag for gemma4 in #21697, the reasoning budget sampler gets unconditionally created even if no budget is configured. The same applies to kimi_k2, lfm2, lfm2_5, and ministral_3 which also set these tags. The budget gets converted to INT_MAX, so the sampler never actually forces any tokens but still runs per-token checks (start tag matching in IDLE state, token-to-piece conversion + UTF-8 checks in COUNTING state).

More importantly, the existence of the sampler (i.e., non-null rbudget) disables backend sampling. Backend sampling lets the GPU select tokens directly, avoiding a full logits transfer from GPU to CPU every token. This could potentially explain the 30% speed regression reported in #21784 (98 t/s to 70 t/s on Vulkan).

So I added a reasoning_budget_tokens >= 0 check to the sampler creation condition. When the budget is unlimited, the sampler is not created, backend sampling stays enabled, and no per-token overhead is added. When a budget is explicitly set (0, 128, 1024, etc.), the sampler is created and works as before.

Should fix #21784

Additional information

Note that I do not have a Vulkan environment so I can replicate neither the original degradation nor that the fix works in terms of performance. However, I did play with the backend-sampling flag myself manually on CUDA and my commit gave me around 5% improvement (~215t/s vs ~225t/s). In any case, it is logically cleaner and should be an improvement in all cases.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES: I asked my local agent to identify the potential sources of the performance regression on Vulkan, and then I manually validated and tested and validated the logic.

After I added thinking_start_tag / thinking_end_tag for gemma4 in #21697, the reasoning budget sampler gets unconditionally created even when no budget is configured (the default -1). The same applies to kimi_k2, lfm2, lfm2_5, and ministral_3 which also set these tags. The budget gets converted to INT_MAX, so the sampler never actually forces any tokens but still runs per-token checks (start tag matching in IDLE state, token-to-piece conversion + UTF-8 checks in COUNTING state). More importantly, the mere existence of the sampler (non-null rbudget) disables backend sampling. Backend sampling lets the GPU select tokens directly, avoiding a full logits transfer from GPU to CPU every token. This could explain the 30% speed regression reported in #21784 (98 t/s to 70 t/s on Vulkan). So I added a reasoning_budget_tokens >= 0 check to the sampler creation condition. When the budget is unlimited, the sampler is not created, backend sampling stays enabled, and no per-token overhead is added. When a budget is explicitly set (0, 128, 1024, etc.), the sampler is created and works as before.

Following up on the review feedback on #21870: keep the reasoning budget sampler when grammar_lazy is true, so the thinking-block grammar suppression from #20970 still works when tools are in use. This way, we only skip the sampler when both no budget is set AND grammar is not lazy.

loci-review · 2026-04-14T04:41:21Z

Overview

Analysis of 127,120 functions across 15 binaries shows minimal performance impact from commits optimizing sampler initialization logic. 74 functions modified (0.058%), 0 new, 0 removed. Changes isolated to common/sampling.cpp with no impact on critical inference paths (matrix operations, attention, quantization, KV cache).

Power consumption changes:

build.bin.llama-tts: -0.055%
build.bin.llama-bench: +0.005%
build.bin.llama-cvector-generator: -0.025%
build.bin.llama-tokenize: +0.023%
build.bin.llama-quantize: +0.041%
build.bin.libmtmd.so: -0.001%
build.bin.libllama.so: 0.0%
build.bin.libggml-cpu.so: 0.0%
build.bin.libggml.so: 0.0%
build.bin.libggml-base.so: 0.0%
build.bin.llama-minicpmv-cli: 0.0%
build.bin.llama-llava-cli: 0.0%
build.bin.llama-gguf-split: 0.0%
build.bin.llama-gemma3-cli: 0.0%
build.bin.llama-qwen2vl-cli: 0.0%

All changes within measurement noise (±0.055%).

Function Analysis

common_sampler_accept() (build.bin.llama-tts)

Response time: 655ns → 614ns (-6.3%, -41ns)
Throughput time: 221ns → 180ns (-18.6%, -41ns)
Source: Compiler optimization from sampler initialization changes in common/sampling.cpp. Function called per accepted token during inference; 41ns improvement compounds across generation sequences.

std::vector::end() (build.bin.llama-tts)

Response time: 265ns → 81ns (-69.3%, -183ns)
Throughput time: 243ns → 60ns (-75.4%, -183ns)
Source: Compiler optimization consolidated entry block (9→7 blocks), no source changes. STL template instantiation improvement.

httplib::encode_uri() (build.bin.llama-tts)

Response time: 851ns → 963ns (+13.2%, +112ns)
Throughput time: 670ns → 783ns (+16.8%, +112ns)
Source: Compiler-introduced entry-point trampoline overhead. Non-critical HTTP operation during initialization only.

Remaining analyzed functions are STL template instantiations showing compiler-driven variations (±40-180ns) in non-critical utility code with negligible practical impact.

Additional Findings

Core inference libraries (libllama.so, libggml-cpu.so, libggml.so, libggml-base.so) show 0.0% power consumption change, confirming no modifications to performance-critical operations. The source code changes successfully optimize sampler initialization (skip reasoning budget sampler when budget unlimited and lazy grammar inactive) without affecting inference performance. All 73 non-application functions are compiler optimization artifacts, not functional changes.

💬 Questions? Tag @loci-dev

berkidem added 2 commits April 13, 2026 17:19

loci-dev temporarily deployed to PROD__AL_DEMO April 14, 2026 03:11 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 7 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #21870: common: skip reasoning budget sampler when no budget is requested#1349

UPSTREAM PR #21870: common: skip reasoning budget sampler when no budget is requested#1349
loci-dev wants to merge 2 commits intomainfrom
loci/pr-21870-fix-reasoning-budget-skip-unlimited

loci-dev commented Apr 14, 2026

Uh oh!

loci-review Bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Apr 14, 2026

Overview

Additional information

Requirements

Uh oh!

loci-review Bot commented Apr 14, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants