UPSTREAM PR #21870: common: skip reasoning budget sampler when no budget is requested#1349
UPSTREAM PR #21870: common: skip reasoning budget sampler when no budget is requested#1349
Conversation
After I added thinking_start_tag / thinking_end_tag for gemma4 in #21697, the reasoning budget sampler gets unconditionally created even when no budget is configured (the default -1). The same applies to kimi_k2, lfm2, lfm2_5, and ministral_3 which also set these tags. The budget gets converted to INT_MAX, so the sampler never actually forces any tokens but still runs per-token checks (start tag matching in IDLE state, token-to-piece conversion + UTF-8 checks in COUNTING state). More importantly, the mere existence of the sampler (non-null rbudget) disables backend sampling. Backend sampling lets the GPU select tokens directly, avoiding a full logits transfer from GPU to CPU every token. This could explain the 30% speed regression reported in #21784 (98 t/s to 70 t/s on Vulkan). So I added a reasoning_budget_tokens >= 0 check to the sampler creation condition. When the budget is unlimited, the sampler is not created, backend sampling stays enabled, and no per-token overhead is added. When a budget is explicitly set (0, 128, 1024, etc.), the sampler is created and works as before.
Following up on the review feedback on #21870: keep the reasoning budget sampler when grammar_lazy is true, so the thinking-block grammar suppression from #20970 still works when tools are in use. This way, we only skip the sampler when both no budget is set AND grammar is not lazy.
OverviewAnalysis of 127,120 functions across 15 binaries shows minimal performance impact from commits optimizing sampler initialization logic. 74 functions modified (0.058%), 0 new, 0 removed. Changes isolated to Power consumption changes:
All changes within measurement noise (±0.055%). Function Analysis
Remaining analyzed functions are STL template instantiations showing compiler-driven variations (±40-180ns) in non-critical utility code with negligible practical impact. Additional FindingsCore inference libraries (libllama.so, libggml-cpu.so, libggml.so, libggml-base.so) show 0.0% power consumption change, confirming no modifications to performance-critical operations. The source code changes successfully optimize sampler initialization (skip reasoning budget sampler when budget unlimited and lazy grammar inactive) without affecting inference performance. All 73 non-application functions are compiler optimization artifacts, not functional changes. 💬 Questions? Tag @loci-dev |
7638ab4 to
f1b46d5
Compare
Note
Source pull request: ggml-org/llama.cpp#21870
Overview
After I added thinking_start_tag / thinking_end_tag for gemma4 in #21697, the reasoning budget sampler gets unconditionally created even if no budget is configured. The same applies to kimi_k2, lfm2, lfm2_5, and ministral_3 which also set these tags. The budget gets converted to INT_MAX, so the sampler never actually forces any tokens but still runs per-token checks (start tag matching in IDLE state, token-to-piece conversion + UTF-8 checks in COUNTING state).
More importantly, the existence of the sampler (i.e., non-null rbudget) disables backend sampling. Backend sampling lets the GPU select tokens directly, avoiding a full logits transfer from GPU to CPU every token. This could potentially explain the 30% speed regression reported in #21784 (98 t/s to 70 t/s on Vulkan).
So I added a reasoning_budget_tokens >= 0 check to the sampler creation condition. When the budget is unlimited, the sampler is not created, backend sampling stays enabled, and no per-token overhead is added. When a budget is explicitly set (0, 128, 1024, etc.), the sampler is created and works as before.
Should fix #21784
Additional information
Note that I do not have a Vulkan environment so I can replicate neither the original degradation nor that the fix works in terms of performance. However, I did play with the backend-sampling flag myself manually on CUDA and my commit gave me around 5% improvement (~215t/s vs ~225t/s). In any case, it is logically cleaner and should be an improvement in all cases.
Requirements