Replies: 3 comments 7 replies
-
To judge whether or not this method would be worthwhile, please do a comparison vs. llama.cpp/ggml quantization methods like q8_0 (1 scale per 32 values) instead of "int8" (what I assume to be 1 scale per tensor). Generally speaking, I think that the precision of q8_0 and q6_K is sufficiently high and that efforts in that BPW range should focus primarily on optimizing performance. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Have you tested inference speed vs Q8_0 on same CPU? I think this is the most important part (aside from model size). Do you have numbers for this? I think llama.cpp is faster than PyTorch on CPU, so you cannot properly compare it without a llama.cpp implementation. There is already a "lossless" compression algorithm for Efficient GPU Inference (DFloat11), with a compression ratio of 1.42x (assuming I am calculating this correctly). Its main issue is decompression time, so not ideal for using on CPU. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello llama.cpp team and community,
I've developed and benchmarked a new 8-bit quantization method, Product Quantization with Residuals (PQ-R), that shows significant quality improvements over standard INT8 methods on CPU.
My goal was to find a practical way to achieve high-fidelity 8-bit quantization without the extreme performance cost of methods like full K-Means. The results are very promising:
I've published a full technical write-up with all my benchmark results, methodology, and graphs on Medium:
[Medium article]
My full testing suite is also available for review on GitHub:
[GitHub repository]
The core
pqr_core.py
implementation is currently proprietary, but I believe this technique could be a powerful addition to the quantization methods available inllama.cpp
. I would be very open to discussing a potential integration or collaboration.Is this something that could be of interest to the project?
Thank you for your time and for the incredible work on
llama.cpp
.Beta Was this translation helpful? Give feedback.
All reactions