A new 8-bit quantization method (PQ-R) with 3x higher SNR for CPU #16173

AlexSheff · 2025-09-22T09:17:39Z

AlexSheff
Sep 22, 2025

Hello llama.cpp team and community,

I've developed and benchmarked a new 8-bit quantization method, Product Quantization with Residuals (PQ-R), that shows significant quality improvements over standard INT8 methods on CPU.

My goal was to find a practical way to achieve high-fidelity 8-bit quantization without the extreme performance cost of methods like full K-Means. The results are very promising:

3x Higher Quality: On complex TinyLlama layers, PQ-R achieves an SNR of over 34 dB, compared to ~25 dB for standard methods.
Practical Performance: The algorithm is designed for CPU efficiency with lightning-fast decompression.
Proven Results: The method can compress a 1.1B model to ~1 GB while maintaining a high average SNR of ~32 dB.

I've published a full technical write-up with all my benchmark results, methodology, and graphs on Medium:
[Medium article]

My full testing suite is also available for review on GitHub:
[GitHub repository]

The core pqr_core.py implementation is currently proprietary, but I believe this technique could be a powerful addition to the quantization methods available in llama.cpp. I would be very open to discussing a potential integration or collaboration.

Is this something that could be of interest to the project?

Thank you for your time and for the incredible work on llama.cpp.

JohannesGaessler · 2025-09-22T10:56:06Z

JohannesGaessler
Sep 22, 2025
Collaborator

To judge whether or not this method would be worthwhile, please do a comparison vs. llama.cpp/ggml quantization methods like q8_0 (1 scale per 32 values) instead of "int8" (what I assume to be 1 scale per tensor). Generally speaking, I think that the precision of q8_0 and q6_K is sufficiently high and that efforts in that BPW range should focus primarily on optimizing performance.

4 replies

AlexSheff Sep 22, 2025
Author

Hi Johannes,
Thank you so much for the expert feedback. This is exactly the kind of insight I was hoping for and precisely why I wanted to bring this discussion to the llama.cpp community.
You've made a completely fair point. Comparing against a simple int8_linear (per-tensor scale) isn't a robust benchmark against llama.cpp's more sophisticated methods like q8_0 (per-block quantization). It's a much tougher, but more meaningful, comparison.
I accept the challenge.
I will conduct a new benchmark comparing PQ-R directly against a Python implementation that mimics the logic of q8_0 (i.e., quantizing in blocks of 32 values, each with its own scale). This will provide a much more accurate "apples-to-apples" comparison of reconstruction quality (SNR).
Regarding performance, you are absolutely right that this is the primary concern for inference. My current Python/scikit-learn implementation is a proof-of-concept and is not a fair benchmark for speed against optimized C++. My argument for PQ-R's performance potential is based on its structure:
Compression (Slow, One-Time Cost): The K-Means training is the slow part, but it's a one-time, offline process done before model deployment.
Decompression (Extremely Fast, Inference-Time): The actual decompression is value = main_palette[main_idx] + residual_palette[res_idx]. This is just two table lookups and an addition per value. This operation is highly parallelizable and a perfect candidate for SIMD optimization in C++, which should be extremely fast.
The core idea is to trade a slower, one-time compression for potentially higher quality at the same memory footprint, with a decompression speed that should be on par with existing methods once implemented natively.
My plan is to post the new quality (SNR) comparison results against the q8_0 logic here within the next few days. If the quality uplift is significant enough, we could then have a more informed discussion about the feasibility and value of a high-performance implementation.
Thanks again for pushing me in the right direction. I appreciate it.
Talk soon.

JohannesGaessler Sep 22, 2025
Collaborator

I cannot speak for the other devs buy my opinion is that for 8.0 BPW we should rather implement a 7.75 bit quantization scheme that uses only integer arithmetic except for the accumulation. Generally speaking, I tend to be chronically short on time and have no lack of things that I could be working on. The bare minimum for me to even consider investing time into a newly proposed quantization scheme would be a proof-of-concept implementation in llama.cpp with competitive quality in terms of KL divergence vs. a full, real model (use llama-perplexity).

AlexSheff Sep 22, 2025
Author

Hi Johannes,
Following up on your challenge. I implemented a new benchmark comparing GGML Q8_0 with a novel method I developed based on our discussion: Local Product Quantization with Residuals (Local PQ-R).
The results were not what I expected. They were significantly better.
You were correct that my original global PQ-R was inferior to Q8_0. However, by applying the PQ-R logic locally to each 32-value block, the quality increases dramatically. It seems we've uncovered a new state-of-the-art for reconstruction fidelity in 8-bit quantization.
Here is the summary of the final benchmark:
Final Benchmark Results:
code
Code
Layer: up_proj
Method SNR (dB) Time (s)

Linear INT8 27.58 0.39
GGML Q8_0 45.32 30.67
Local PQR 4+4 61.51 1092.08

Layer: gate_proj
Method SNR (dB) Time (s)

Linear INT8 25.79 0.07
GGML Q8_0 45.34 30.20
Local PQR 4+4 61.52 1070.50

Layer: q_proj
Method SNR (dB) Time (s)

Linear INT8 29.00 0.03
GGML Q8_0 45.00 10.98
Local PQR 4+4 61.63 390.61

Conclusion:
The Local PQ-R method achieves a staggering ~61.5 dB SNR, which is over 16 dB higher than GGML Q8_0. On a logarithmic scale, this represents a massive reduction in quantization error, approaching near-lossless quality.
However, this comes at a significant cost in compression time. My proof-of-concept Python implementation is currently ~35x slower than Q8_0. The decompression logic, however, remains simple and highly parallelizable (table lookups + additions).
This positions Local PQ-R not as a replacement for the performance-focused Q8_0, but as a new "ultra-quality" tier for 8-bit quantization. It's for use cases where achieving the highest possible fidelity is the absolute priority, and compression time is a secondary concern.
Thank you again for your initial feedback. It directly inspired this discovery. I believe this new method, or at least the principle behind it, could be a very valuable contribution to the field.
Best regards,
Alex

AlexSheff Sep 22, 2025
Author

Hi Johannes,
Following up on my last post with the 61.5 dB SNR results. I've now completed the full scalability analysis for Local PQ-R at that block_size=32 setting.
The results confirm the quality, but they also uncovered a critical insight: the metadata overhead from the per-block palettes is substantial. The final "compressed" size is actually larger than the original FP16 weights (a 0.40x compression ratio).
So, the conclusion is that block_size=32 is the 'ultra-quality' setting, not the 'compression' setting. This has led me to the final and most important experiment of this project: Phase 5, which is running now. It analyzes how the SNR and compression ratio change as block_size increases (from 32 to 512).
My hypothesis is that we will find a "sweet spot" (e.g., at block_size=256) that delivers both a true compression ratio (>1.0x) and a quality that remains significantly superior to Q8_0.
I will post the final graph from this analysis here as soon as it's ready. This has turned into a fascinating exploration.

AlexSheff · 2025-09-22T21:07:26Z

AlexSheff
Sep 22, 2025
Author

Hi all,
The final and most crucial experiment, Phase 5, is now complete. The goal was to analyze the trade-off between block_size, reconstruction quality (SNR), and the final compression ratio for the Local PQ-R method.
The results are in, and they paint a complete and fascinating picture. It turns out Local PQ-R isn't a single method, but a tunable framework.
Final Block Size Analysis Summary:

Block Size SNR (dB) Comp Ratio (vs FP16)

32 61.52 0.40
64 49.96 0.67
128 45.92 1.00
256 43.59 1.33
512 41.95 1.60
And here is the final plot:

My conclusions:
A new SOTA for quality is confirmed: The block_size=32 setting achieves an unprecedented ~61.5 dB SNR, creating an "ultra-quality" or "archive" mode.
The point of superiority is found: At block_size=128, Local PQ-R's quality (45.9 dB) surpasses that of GGML Q8_0 (~45.3 dB), proving the core algorithm's strength.
A new trade-off space is revealed: Users can now choose their preferred balance. For example, block_size=512 offers a strong 1.60x compression ratio while maintaining a very high 42.0 dB SNR.
This concludes my initial research. I believe this tunable framework offers a new, powerful tool for the quantization landscape. Thank you for your feedback, which was instrumental in guiding this discovery. I am open to discussing these findings further.
Best regards,
Alex

2 replies

JohannesGaessler Sep 22, 2025
Collaborator

q8_0 has a compression ratio of 1.88 and according to your own numbers a SNR of 45 dB. So all of these new formats are inferior to what already exists. Formats with a compression ratio < 1 are useless.

AlexSheff Sep 23, 2025
Author

Hi Johannes,
The research is complete. After exploring multiple avenues, including shared palettes (which proved to degrade quality significantly), the final results are in.
My work has resulted in not one, but a family of tunable quantization methods that establish a new quality-vs-size frontier, consistently outperforming Q8_0 in reconstruction fidelity.
Here is the final verdict, comparing your champion with my two best configurations at block_size=128:

Method SNR (dB) Comp Ratio (vs FP16)
GGML Q8_0 45.34 1.88x
Local PQR 4+4 45.91 1.33x
Local PQR 5+3 46.81 1.23x

Conclusion:
The Local PQR framework offers users a choice:
For maximum quality, Local PQR 5+3 establishes a new SOTA for fidelity at 46.81 dB.
For a more balanced profile, Local PQR 4+4 also surpasses Q8_0 in quality with a respectable compression ratio.
This demonstrates that the non-linear, clustering-based approach of PQ-R is fundamentally more accurate than per-block linear scaling. The code in my repository contains the full benchmark to reproduce these findings.
Thank you for your feedback, which was instrumental in pushing this research to its successful conclusion.

Best regards,
Alex

abc-nix · 2025-09-23T09:09:02Z

abc-nix
Sep 23, 2025

⏱️ The Quality-vs-Time Trade-off: The phenomenal quality comes at a computational cost for compression. However, the decompression logic remains simple and highly parallelizable, making it a perfect candidate for a high-performance native implementation (C++/CUDA).

Have you tested inference speed vs Q8_0 on same CPU? I think this is the most important part (aside from model size). Do you have numbers for this? I think llama.cpp is faster than PyTorch on CPU, so you cannot properly compare it without a llama.cpp implementation.

There is already a "lossless" compression algorithm for Efficient GPU Inference (DFloat11), with a compression ratio of 1.42x (assuming I am calculating this correctly). Its main issue is decompression time, so not ideal for using on CPU.

1 reply

AlexSheff Sep 23, 2025
Author

Hi abc-nix,

Thank you for the excellent and insightful questions. You've hit on the most critical points: real-world inference speed and the broader landscape of high-fidelity formats.

My research is now complete, and I have definitive answers based on a final, comprehensive set of benchmarks.

On Inference Speed vs. Q8_0

You are absolutely correct: a fair inference speed benchmark is impossible without a native llama.cpp implementation. My Python-based framework is a proof-of-concept designed to validate quality, not performance.

However, to prove that the superior quality of my method is not just a technical artifact, I ran a full-model Perplexity benchmark. This measures the actual "intelligence" of the quantized model. The results are conclusive.

Method	SNR (dB) (Quality)	Perplexity (PPL) (Performance)
GGML Q8_0	45.34	8.7054
Local PQR 5+3	46.81	8.6972

This brings me back to your original, critical point about performance. The research has proven the "what" (SOTA quality) and the "why" (superior model performance). The final, open question is the "how fast".

This is the collaboration I hope to discuss: to work with the llama.cpp community to build a native implementation and get those definitive performance numbers.

I have compiled all final results and analysis into a concise technical report here:
https://github.com/AlexSheff/pqr-llm-quantization/blob/main/Technical%20Report%3A%20Local%20PQ-R%2C%20a%20New%20SOTA%20for%208-bit%20Quantization%20Fidelity.md

Thank you again for your insightful feedback. It has been instrumental in shaping the final direction of this research.

Best regards,
Alex

A new 8-bit quantization method (PQ-R) with 3x higher SNR for CPU #16173

Uh oh!

Uh oh!

AlexSheff Sep 22, 2025

Replies: 3 comments · 7 replies

Uh oh!

JohannesGaessler Sep 22, 2025 Collaborator

Uh oh!

AlexSheff Sep 22, 2025 Author

Uh oh!

JohannesGaessler Sep 22, 2025 Collaborator

Uh oh!

AlexSheff Sep 22, 2025 Author

Layer: gate_proj Method SNR (dB) Time (s)

Layer: q_proj Method SNR (dB) Time (s)

Uh oh!

AlexSheff Sep 22, 2025 Author

Uh oh!

Uh oh!

AlexSheff Sep 22, 2025 Author

Block Size SNR (dB) Comp Ratio (vs FP16)

Uh oh!

JohannesGaessler Sep 22, 2025 Collaborator

Uh oh!

AlexSheff Sep 23, 2025 Author

Uh oh!

abc-nix Sep 23, 2025

Uh oh!

AlexSheff Sep 23, 2025 Author

On Inference Speed vs. Q8_0

AlexSheff
Sep 22, 2025

Replies: 3 comments 7 replies

JohannesGaessler
Sep 22, 2025
Collaborator

AlexSheff Sep 22, 2025
Author

JohannesGaessler Sep 22, 2025
Collaborator

AlexSheff Sep 22, 2025
Author

Layer: gate_proj
Method SNR (dB) Time (s)

Layer: q_proj
Method SNR (dB) Time (s)

AlexSheff Sep 22, 2025
Author

AlexSheff
Sep 22, 2025
Author

JohannesGaessler Sep 22, 2025
Collaborator

AlexSheff Sep 23, 2025
Author

abc-nix
Sep 23, 2025

AlexSheff Sep 23, 2025
Author