Description
OS
Linux
GPU Library
CUDA 12.x
Python version
3.10
Pytorch version
2.6.0+cu124
Model
https://huggingface.co/LatentWanderer/THUDM_GLM-4-32B-0414-6.5bpw-h8-exl2
Describe the bug
When using the model examples/chat.py it only produces nonsense output.
Reproduction steps
Run python ./examples/chat.py -m THUDM_GLM-4-32B-0414-6.5bpw-h8-exl2 --mode glm
and give it any prompt.
Expected behavior
It should reply to the prompt as intended.
Logs
` -- Model: THUDM_GLM-4-32B-0414-6.5bpw-h8-exl2
-- Options: ['gpu_split: 20,20']
-- Loading tokenizer...
-- Loading model...
-- Loading model...
-- Prompt format: glm
-- System prompt:
You are a helpful AI assistant.
User: Who are you
siegeseice
siegele e ebe ce e treats
cosea ceta seatste te pe ce se coast cops te taipse t se
t t te cop cap te taits taup seat taic ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta t ta ta ta ta
ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta_print ta ta ta ta taic ta ta ta ta_tr ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta tata ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta_t ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta_tr ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta ta tac ta ta ta ta ta ta ta ta ta tacer ta tata ta ta ta ta ta ta ta ta ta`
Additional context
I pulled the latest dev branch and run pip install . --upgrade
before trying to use the model. Changing the temperature or sampler does not seem to change anything.
The exl3 version works as expected, both quants were created using the same files.
Acknowledgements
- I have looked for similar issues before submitting this one.
- I understand that the developers have lives and my issue will be answered when possible.
- I understand the developers of this program are human, and I will ask my questions politely.