Skip to content

Support q2-k to q4-k #434

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wenhuach21 opened this issue Feb 12, 2025 · 5 comments
Open

Support q2-k to q4-k #434

wenhuach21 opened this issue Feb 12, 2025 · 5 comments
Assignees

Comments

@wenhuach21
Copy link
Contributor

need to support double quant in algorithm part

@jerryzh168
Copy link

jerryzh168 commented Apr 11, 2025

Hi @wenhuach21 @n1ck-guo, does export for q4_k work right now? I tried to adapt that for torchao, and tried to serve with vllm vllm serve ./phi4-mini-torchao-ar-gguf-q4_k-3.8B-Q4_K_S.gguf --tokenizer microsoft/Phi-4-mini-instruct --device cuda -O3 there seems to be some issue with shape mismatch:

  File ".../llama.cpp/gguf-py/gguf/gguf_reader.py", line 364, in _build_tensors
    data = self._get(data_offs, item_type, item_count).reshape(np_dims),
ValueError: cannot reshape array of size 1536 into shape (3072,)

can you help me take a look at https://gist.github.com/jerryzh168/fac8f8c8f89c65ef7cc3d76fdc74ba04#file-gistfile1-txt-L48, wondering if the argument list for ggml_quant are correct or not:

data = ggml_quant(
+                            float_data,
+                            data_qtype.name.lower(),
+                            scale,
+                            None,
+                            wmin_m=wmin_m,
+                            d_scale=d_scale,
+                            d_wmin_m=d_wmin_m)

float_data is the original floating point data, and scale, wmin_m, d_scale, d_wmin_m are calculated with

def quant_tensor_asym_dq(tensor, bits=4, group_size=-1, v=0, min_scale=1.0, max_scale=1.0, scale_dtype=torch.float16,

@n1ck-guo
Copy link
Contributor

Hi @wenhuach21 @n1ck-guo, does export for q4_k work right now? I tried to adapt that for torchao, and tried to serve with vllm vllm serve ./phi4-mini-torchao-ar-gguf-q4_k-3.8B-Q4_K_S.gguf --tokenizer microsoft/Phi-4-mini-instruct --device cuda -O3 there seems to be some issue with shape mismatch:

  File ".../llama.cpp/gguf-py/gguf/gguf_reader.py", line 364, in _build_tensors
    data = self._get(data_offs, item_type, item_count).reshape(np_dims),
ValueError: cannot reshape array of size 1536 into shape (3072,)

can you help me take a look at https://gist.github.com/jerryzh168/fac8f8c8f89c65ef7cc3d76fdc74ba04#file-gistfile1-txt-L48, wondering if the argument list for ggml_quant are correct or not:

data = ggml_quant(
+                            float_data,
+                            data_qtype.name.lower(),
+                            scale,
+                            None,
+                            wmin_m=wmin_m,
+                            d_scale=d_scale,
+                            d_wmin_m=d_wmin_m)

float_data is the original floating point data, and scale, wmin_m, d_scale, d_wmin_m are calculated with

auto-round/auto_round/data_type/int.py

Line 77 in 37341f5

def quant_tensor_asym_dq(tensor, bits=4, group_size=-1, v=0, min_scale=1.0, max_scale=1.0, scale_dtype=torch.float16,

Thank you for the reporting, we will check the related issues immediately

@jerryzh168
Copy link

@n1ck-guo if you want to repro the issue, here are the steps:

  1. create a conda env
  2. patch Using autoround's implementation for gguf q4_k pytorch/ao#2042 for torchao
  3. install torchao from source: python setup.py develop
  4. use https://gist.github.com/jerryzh168/898b2d84c380fdd8d10ee97c5546af85 to upload the checkpoint
  5. patch [not4land] temp change to convert torchao checkpoint to gguf #504
  6. use https://gist.github.com/jerryzh168/25f6d2fd0687d1df1246c55706f061e7 to convert the model to gguf
  7. serve with vllm: vllm serve ./phi4-mini-torchao-ar-gguf-q4_k-3.8B-Q4_K_S.gguf --tokenizer microsoft/Phi-4-mini-instruct --device cuda -O3
    where ./phi4-mini-torchao-ar-gguf-q4_k-3.8B-Q4_K_S.gguf is generated gguf file from step 6.

@n1ck-guo
Copy link
Contributor

We have tested the code of export q4_k_s. For some other models, it works well. But for microsoft/Phi-4-mini-instruct, we cannot export and it will raise error. This is because our code relying on the original export code from llama-cpp (convert_hf_to_gguf.py) and it seems it not works well with Phi-4.
We will also try to reproduce this problem according to the method you provided and try to find the problem.
Thank you again for your question, we will try our best to solve it.

@n1ck-guo
Copy link
Contributor

@jerryzh168 Thank you for waiting. This issue seems to be caused by a problem with the llama.cpp version. Could you please try with this pr #524 and the lastest gguf-py.
You can use the following command to install the lastest gguf-py:
git clone https://github.com/ggml-org/llama.cpp.git && cd llama.cpp/gguf-py && pip install .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants