Skip to content

Eval bug: Thinking model with thinking disabled cannot use /apply-template with final assistant turn #15401

@gabe-l-hart

Description

@gabe-l-hart

Name and Version

$./bin/llama-cli --version
version: 6161 (291f531cd)
built with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.6.0

Operating systems

Mac

GGML backends

Metal

Hardware

M3 Max 64GB

Models

  • Thinking model: IBM Granite 3.2 8b
  • Non-thinking model: IBM Granite Code 2b

Problem description & steps to reproduce

Description

This issue is a bug in a recently introduced change to support the "enable_thinking" toggle for Qwen3. It relates to this comment thread: https://github.com/ggml-org/llama.cpp/pull/13196/files#r2282737348.

The problem is that in response to https://github.com/ggml-org/llama.cpp/pull/13196/files#r2134714258, the logical condition of inputs.enable_thinking was inverted in a056e53. The result is that for a model that does support thinking, but needs to explicitly disable it (--reasoning-budget 0), attempting to apply the chat template (either for /apply-template or a full prefill operation), will result in triggering the error "Assistant response prefill is incompatible with enable_thinking.".

Repro

# Boot the server with Granite 3.2 8B and thinking disabled
./bin/llama-server -hf ibm-research/granite-3.2-8b-instruct-GGUF --jinja --reasoning-budget 0
# Send apply-template with a final assistant turn
curl http://localhost:8081/apply-template -d '{"messages": [{"role": "user", "content": "hello world"}, {"role": "assistant", "content": "hi hi"}]}'

Response

{"error":{"code":500,"message":"Assistant response prefill is incompatible with enable_thinking.","type":"server_error"}}

Expected

{"prompt":"<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.\nToday's Date: August 18, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>\n<|start_of_role|>user<|end_of_role|>hello world<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>hi hi"}

First Bad Commit

a056e53

Relevant log output

main: server is listening on http://127.0.0.1:8081 - starting the main loop
srv  update_slots: all slots are idle
got exception: {"code":500,"message":"Assistant response prefill is incompatible with enable_thinking.","type":"server_error"}
srv  log_server_r: request: POST /apply-template 127.0.0.1 500

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions