-
-
Notifications
You must be signed in to change notification settings - Fork 320
Description
OS
Linux
GPU Library
CUDA 12.x
Python version
3.11
Pytorch version
2.6.0
Model
No response
Describe the bug
I am using Exllamav2 Base generator and generate_sample with the following parameters:
temperature = 0
top_p = 1
stop_token = tokenizer.eos_token_id
The max_new_token = 3000
The model is the GPTQ quantized version of LLama 3.3 70B to 4 bits.
I have tried this model on VLLM and it works like a charm with no repeatition of response or part of the response.
This is my prompt:
Prompt = Use your reasoning and solve the following equation step by step: x* (sin(x) + x) = 0
The response is strange:
It is like this:
Step1: break down the problem to smaller pieces
Step2: since the right hand side is 0, so perhaps either x is zero or sin(x) is equal to -x.
. . .
Step6: Hence the answer is 0.
Step1: there are multiple ways to answer this problem, we should first break down the problem to more manageable pieces.
Step2: remember the right hand side is 0, hence either x = 0 or sin(x) + x = 0
Step6: Therefore x only can be 0
That was just one example, that the response is only repeated twice, sometimes it is repeated 4 times, sometimes part of
The response is repeated.
Does anyone know any solution or any parameter that can prevent this:
I have tried tweaking:
temperature, stop_token, repeatition_penalty
But no luck. The same model works with no problem in VLLM.
Note: Due to this repartition, the time takes to respond to the prompt is high as well, like what should be taken 20s at most takes 3 minutes.
Reproduction steps
Described above.
Expected behavior
Generate response with no repetition.
Logs
No response
Additional context
No response
Acknowledgements
- I have looked for similar issues before submitting this one.
- I understand that the developers have lives and my issue will be answered when possible.
- I understand the developers of this program are human, and I will ask my questions politely.