-
Notifications
You must be signed in to change notification settings - Fork 12.9k
Closed
Labels
Description
Name and Version
$./bin/llama-cli --version
version: 6161 (291f531cd)
built with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.6.0
Operating systems
Mac
GGML backends
Metal
Hardware
M3 Max 64GB
Models
- Thinking model: IBM Granite 3.2 8b
- Non-thinking model: IBM Granite Code 2b
Problem description & steps to reproduce
Description
This issue is a bug in a recently introduced change to support the "enable_thinking"
toggle for Qwen3. It relates to this comment thread: https://github.com/ggml-org/llama.cpp/pull/13196/files#r2282737348.
The problem is that in response to https://github.com/ggml-org/llama.cpp/pull/13196/files#r2134714258, the logical condition of inputs.enable_thinking
was inverted in a056e53. The result is that for a model that does support thinking, but needs to explicitly disable it (--reasoning-budget 0
), attempting to apply the chat template (either for /apply-template
or a full prefill operation), will result in triggering the error "Assistant response prefill is incompatible with enable_thinking."
.
Repro
# Boot the server with Granite 3.2 8B and thinking disabled
./bin/llama-server -hf ibm-research/granite-3.2-8b-instruct-GGUF --jinja --reasoning-budget 0
# Send apply-template with a final assistant turn
curl http://localhost:8081/apply-template -d '{"messages": [{"role": "user", "content": "hello world"}, {"role": "assistant", "content": "hi hi"}]}'
Response
{"error":{"code":500,"message":"Assistant response prefill is incompatible with enable_thinking.","type":"server_error"}}
Expected
{"prompt":"<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.\nToday's Date: August 18, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>\n<|start_of_role|>user<|end_of_role|>hello world<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>hi hi"}
First Bad Commit
Relevant log output
main: server is listening on http://127.0.0.1:8081 - starting the main loop
srv update_slots: all slots are idle
got exception: {"code":500,"message":"Assistant response prefill is incompatible with enable_thinking.","type":"server_error"}
srv log_server_r: request: POST /apply-template 127.0.0.1 500
arichiardi