Skip to content

End-to-end inference test fails with --cache-type turbo4 on CPU #72

@MELANCHOLY828

Description

@MELANCHOLY828

Description:

I attempted to run an end-to-end inference test using your code. I wrote a simple script to ask a basic question, and found that when using --cache-type turbo4, the model returns incorrect results.

Test case:

Question: "1+1等于几?只回答数字" (What is 1+1? Answer with only the number)

Expected answer: 2

Observed behavior:

The response contained 0 characters of content and 102 characters of reasoning

The reasoning output was garbled/nonsensical

Debug output:

text
[DEBUG] Sending query to http://127.0.0.1:8094/v1/chat/completions
[DEBUG] Prompt: '1+1等于几?只回答数字'
[DEBUG] Waiting for response...
[DEBUG] Response: 0 chars content, 102 chars reasoning
[思考过程]:

嗯,用户,我现在需要回答的问题是“1等于几?”首先,我需要思考一下这个问题,用户可能是在数学题目的,或者数学题号,或者数学题,或者数学题,或者数学题,或者数学题,或者数学题题号,或者数学题 Additional context: The NIAH test I attempted to run also failed with --cache-type turbo4. This occurs when running on CPU — the turbo4 cache type cannot be executed on CPU.

Environment:

Hardware: CPU (not GPU)

Cache type: turbo4 (fails)

Expected behavior:

The model should return the correct answer (2) for the simple arithmetic question

The turbo4 cache type should either work correctly on CPU or provide a clear error message that CPU is not supported

Potential issue:

--cache-type turbo4 may have GPU-specific dependencies or memory layout requirements that are not met on CPU

The fallback or error handling for unsupported hardware appears to produce silent corruption (garbled output) rather than a clear error

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions