Description:
I attempted to run an end-to-end inference test using your code. I wrote a simple script to ask a basic question, and found that when using --cache-type turbo4, the model returns incorrect results.
Test case:
Question: "1+1等于几?只回答数字" (What is 1+1? Answer with only the number)
Expected answer: 2
Observed behavior:
The response contained 0 characters of content and 102 characters of reasoning
The reasoning output was garbled/nonsensical
Debug output:
text
[DEBUG] Sending query to http://127.0.0.1:8094/v1/chat/completions
[DEBUG] Prompt: '1+1等于几?只回答数字'
[DEBUG] Waiting for response...
[DEBUG] Response: 0 chars content, 102 chars reasoning
[思考过程]:
嗯,用户,我现在需要回答的问题是“1等于几?”首先,我需要思考一下这个问题,用户可能是在数学题目的,或者数学题号,或者数学题,或者数学题,或者数学题,或者数学题,或者数学题题号,或者数学题
Additional context:
The NIAH test I attempted to run also failed with --cache-type turbo4. This occurs when running on CPU — the turbo4 cache type cannot be executed on CPU.
Environment:
Hardware: CPU (not GPU)
Cache type: turbo4 (fails)
Expected behavior:
The model should return the correct answer (2) for the simple arithmetic question
The turbo4 cache type should either work correctly on CPU or provide a clear error message that CPU is not supported
Potential issue:
--cache-type turbo4 may have GPU-specific dependencies or memory layout requirements that are not met on CPU
The fallback or error handling for unsupported hardware appears to produce silent corruption (garbled output) rather than a clear error
Description:
I attempted to run an end-to-end inference test using your code. I wrote a simple script to ask a basic question, and found that when using --cache-type turbo4, the model returns incorrect results.
Test case:
Question: "1+1等于几?只回答数字" (What is 1+1? Answer with only the number)
Expected answer: 2
Observed behavior:
The response contained 0 characters of content and 102 characters of reasoning
The reasoning output was garbled/nonsensical
Debug output:
text
嗯,用户,我现在需要回答的问题是“1等于几?”首先,我需要思考一下这个问题,用户可能是在数学题目的,或者数学题号,或者数学题,或者数学题,或者数学题,或者数学题,或者数学题题号,或者数学题 Additional context: The NIAH test I attempted to run also failed with --cache-type turbo4. This occurs when running on CPU — the turbo4 cache type cannot be executed on CPU.[DEBUG] Sending query to http://127.0.0.1:8094/v1/chat/completions
[DEBUG] Prompt: '1+1等于几?只回答数字'
[DEBUG] Waiting for response...
[DEBUG] Response: 0 chars content, 102 chars reasoning
[思考过程]:
Environment:
Hardware: CPU (not GPU)
Cache type: turbo4 (fails)
Expected behavior:
The model should return the correct answer (2) for the simple arithmetic question
The turbo4 cache type should either work correctly on CPU or provide a clear error message that CPU is not supported
Potential issue:
--cache-type turbo4 may have GPU-specific dependencies or memory layout requirements that are not met on CPU
The fallback or error handling for unsupported hardware appears to produce silent corruption (garbled output) rather than a clear error