End-to-end inference test fails with --cache-type turbo4 on CPU

Description:

I attempted to run an end-to-end inference test using your code. I wrote a simple script to ask a basic question, and found that when using --cache-type turbo4, the model returns incorrect results.

Test case:

Question: "1+1等于几？只回答数字" (What is 1+1? Answer with only the number)

Expected answer: 2

Observed behavior:

The response contained 0 characters of content and 102 characters of reasoning

The reasoning output was garbled/nonsensical

Debug output:

text
[DEBUG] Sending query to http://127.0.0.1:8094/v1/chat/completions
[DEBUG] Prompt: '1+1等于几？只回答数字'
[DEBUG] Waiting for response...
[DEBUG] Response: 0 chars content, 102 chars reasoning
[思考过程]:
----------------------------------------
<think>
嗯，用户，我现在需要回答的问题是“1等于几？”首先，我需要思考一下这个问题，用户可能是在数学题目的，或者数学题号，或者数学题，或者数学题，或者数学题，或者数学题，或者数学题题号，或者数学题
Additional context:
The NIAH test I attempted to run also failed with --cache-type turbo4. This occurs when running on CPU — the turbo4 cache type cannot be executed on CPU.

Environment:

Hardware: CPU (not GPU)

Cache type: turbo4 (fails)

Expected behavior:

The model should return the correct answer (2) for the simple arithmetic question

The turbo4 cache type should either work correctly on CPU or provide a clear error message that CPU is not supported

Potential issue:

--cache-type turbo4 may have GPU-specific dependencies or memory layout requirements that are not met on CPU

The fallback or error handling for unsupported hardware appears to produce silent corruption (garbled output) rather than a clear error



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

End-to-end inference test fails with --cache-type turbo4 on CPU #72

text
[DEBUG] Sending query to http://127.0.0.1:8094/v1/chat/completions
[DEBUG] Prompt: '1+1等于几？只回答数字'
[DEBUG] Waiting for response...
[DEBUG] Response: 0 chars content, 102 chars reasoning
[思考过程]:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

End-to-end inference test fails with --cache-type turbo4 on CPU #72

Description

text [DEBUG] Sending query to http://127.0.0.1:8094/v1/chat/completions [DEBUG] Prompt: '1+1等于几？只回答数字' [DEBUG] Waiting for response... [DEBUG] Response: 0 chars content, 102 chars reasoning [思考过程]:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

text
[DEBUG] Sending query to http://127.0.0.1:8094/v1/chat/completions
[DEBUG] Prompt: '1+1等于几？只回答数字'
[DEBUG] Waiting for response...
[DEBUG] Response: 0 chars content, 102 chars reasoning
[思考过程]: