[BUG]Exllamav2 repeats itself in the answer

### OS

Linux

### GPU Library

CUDA 12.x

### Python version

3.11

### Pytorch version

2.6.0

### Model

_No response_

### Describe the bug

I am using Exllamav2 Base generator and generate_sample with the following parameters:

```
temperature = 0
top_p = 1
stop_token = tokenizer.eos_token_id
The max_new_token = 3000
```
The model is the GPTQ quantized version of LLama 3.3 70B to 4 bits.
I have tried this model on VLLM and it works like a charm with no repeatition of response or part of the response. 

This is my prompt:
`Prompt  = Use your reasoning and solve the following equation step by step: x* (sin(x) + x) = 0`

The response is strange:
It is like this:

```
Step1: break down the problem to smaller pieces
Step2: since the right hand side is 0, so perhaps either x is zero or sin(x) is equal to -x.

.  .  .

Step6: Hence the answer is 0.
Step1: there are multiple ways to answer this problem, we should first break down the problem to more manageable pieces.
Step2: remember the right hand side is 0, hence either x = 0 or sin(x) + x  = 0

Step6: Therefore x only can be 0

```
That was just one example, that the response is only repeated twice, sometimes it is repeated 4 times, sometimes part of
The response is repeated. 
Does anyone know any solution or any parameter that can prevent this:
I have tried tweaking:

`temperature, stop_token, repeatition_penalty`

But no luck. The same model works with no problem in VLLM.

**Note: Due to this repartition, the time takes to respond to the prompt is high as well, like what should be taken 20s at most takes 3 minutes.** 




### Reproduction steps

Described above.

### Expected behavior

Generate response with no repetition. 

### Logs

_No response_

### Additional context

_No response_

### Acknowledgements

- [x] I have looked for similar issues before submitting this one.
- [x] I understand that the developers have lives and my issue will be answered when possible.
- [x] I understand the developers of this program are human, and I will ask my questions politely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BUG]Exllamav2 repeats itself in the answer #764

OS

GPU Library

Python version

Pytorch version

Model

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[BUG]Exllamav2 repeats itself in the answer #764

Description

OS

GPU Library

Python version

Pytorch version

Model

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions