Compression instead of Quantization? #14784

darksylinc · 2025-07-20T15:27:51Z

darksylinc
Jul 20, 2025

Hi!

I come from graphics development background. I'm still learning about LLMs and AI in general.

One thing that struck me is that llama-cpp doesn't seem to be taking advantage of compression schemes outside of Quantization methods:

ASTC / BC6H / BC7 / BC4/5 compression. For an industry that uses GPU so much, it strikes me odd that these aren't being used. It's true that these compression schemes are lossy and meant to encode color space (they aren't so good at noisy values). But not all compressions are equal. For example BC4/5 were designed to tolerate high frequency data. In fact all compression modes could be used, and we externally store which method was used to store a particular block. So I'm wondering if anyone has tried? Potential issues? By "using all compression modes" I mean something like this:

switch( data[blockIdx].compressionScheme )
{
   case BC6H: return load( dataBC6H, data[blockIdx].offset );
   case BC4: return load( dataBC4, data[blockIdx].offset );
    ...
}

// Note:  data[blockIdx].offset does not have to be contiguous to data[blockIdx+1].offset
// We aren't trying to store the same block as both BC6H and BC4 at the same time.
// We are trying to store block 0 as BC6H and block 1 as BC4; and both would have an offset of 0.

Use of FFT or Spherical Harmonics encoding. Rather than storing weights and token vectors as raw numbers (or ASTC/BC-compressed values) use FFT or SH (SH is basically FFT on steroids) to lossy-encode the values. Furthermore SH encodings could be an unexplored territory because it may be possible to perform meaningful operations between tokens as SH instead of vectors.
- Of course FFT/SH encoding trades memory saving for ALU, because now data needs to be decoded for each fetch.
A combination of the previous 2? (e.g. storing SH coefficients in BC6H is a popular optimization in modern games).

The idea is to fit big models in reasonable amount of VRAM while losing some precision, which quantization methods prove there is room for some acceptable loss. Also lower VRAM => lower bandwidth, and with that, higher performance (unless ALU bound) and lower power consumption.

Furthermore Quantization could be used together with compression, as K-Quants stores similar numbers together, which should severely increase the accuracy of BC compression.

Am I the first to propose these things? Has anyone tried?

I realize AI is in its infancy, and these optimizations that I'd consider basic in my own field may just not have been tried (at least in the open source AI space) because simply the field is too novel.

steampunque · 2025-07-20T22:47:39Z

steampunque
Jul 20, 2025

Am I the first to propose these things? Has anyone tried?

The compression idea comes around at least once or twice a year. Its not viable because the weights (at around 8 b/w) are essentially at shannon capacity (maximized entropy) already. That means if you try to compress an 8b quant with an entropy compressor such as LZ you will see no to very tiny compression. Thus by information theoretic arguments there is essentially nothing to be gained with an entropy compressor since the weights are already at maximum entropy (dumping the tech lingo, that means the bits in the weights don't have any information redundancy to eliminate, so any form of source coding is a predetermined bust). Quantization is a form of lossy compression and there are avangadros number of different schemes out there which try to optimize it to some arbitrary "goodness" metric. I don't really trust any of them since most seem to go after some arbitrary global l2 minimization of quantization noise which may or may not have anything to do with optimizing the goodness of performance of any given model. I also have zero confidence in use of perplexity as a robust optimization metric. When I optimize my hybrid layer quants I base the optimization on testing if the model actually can solve prompts, i.e. closed loop. I have a bunch of hybrid quants on my hf page https://huggingface.co/steampunque where I tune the layer quants on a model by model basis. I can compress the size of the models with this technique and achieve significant reduction on model size compared to homogenous layer quants. It takes a lot of patience to do this on some models which behave very squirrely / non intuitvely in terms of losing knowledge or reasoning as a function of different layer quants. I think there is a lot more that can be done factoring some information theory techniques into the mix which I haven't got around to yet but is on my todo radar list. Its a theoretically hard problem because quantization is nonlinear and how different models respond to nonlinear distortion as a function of model layer is not a straightforward problem to analyze, nor is there any obvious optimization metric to use.

I realize AI is in its infancy, and these optimizations that I'd consider basic in my own field may just not have been tried (at least in the open source AI space) because simply the field is too novel.

AI is not in its infancy, I would put it in its late teens as of today, i.e. right on the verge of becoming very competent. Newton did his gravity stuff when he was 18/19 so I would not underestimate the potential of a high powered teen. The attention is all you need paper came out in 2017 which is ancient history on todays rate of technology advancement exponential. A few new ideas are trickling out here and there but for the most part most SOTA models as of today are extremely powerful and quite mature across many different domains, even the open weight ones.

0 replies

darksylinc · 2025-07-21T00:09:22Z

darksylinc
Jul 21, 2025
Author

Its not viable because the weights (at around 8 b/w) are essentially at shannon capacity (maximized entropy) already. That means if you try to compress an 8b quant with an entropy compressor such as LZ you will see no to very tiny compression.

This is completely missing my point. You're talking about lossless compression. I'm talking about lossy compression.

A PNG picture is already at shannon capacity after compressing a BMP (which usually achieves very low compression ratios), yet a JPEG compressing can go much, much further. At worst JPEG gets 0.1x of the PNG size. It can get much more depending on how much quality (accuracy) you're willing to sacrifice.

The question is not whether FFT can lossy compress data (it definitely can), the question is whether results are still reasonable after doing so, which from what you're telling me, it hasn't been tried. FFT can even compress random noise. It makes a terrible job at it, but it can do it. The same concept applies here.

1 reply

steampunque Jul 21, 2025

Its not viable because the weights (at around 8 b/w) are essentially at shannon capacity (maximized entropy) already. That means if you try to compress an 8b quant with an entropy compressor such as LZ you will see no to very tiny compression.

This is completely missing my point. You're talking about lossless compression. I'm talking about lossy compression.

A PNG picture is already at shannon capacity after compressing a BMP (which usually achieves very low compression ratios), yet a JPEG compressing can go much, much further. At worst JPEG gets 0.1x of the PNG size. It can get much more depending on how much quality (accuracy) you're willing to sacrifice.

The question is not whether FFT can lossy compress data (it definitely can), the question is whether results are still reasonable after doing so, which from what you're telling me, it hasn't been tried. FFT can even compress random noise. It makes a terrible job at it, but it can do it. The same concept applies here.

I already explained in my note why source coding does not work with model weights. Change of basis via FFT or hadamard is meaningless to information encoded into the model latent space. Lossy compression is done with model weights via quantization, and optimizing that is the subject of a lot of ongoing research as I explained. Its generally possible to get down to around 3.5b/w and maintain good model performance but much below that too much information is lost and model performance degrades rapidly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compression instead of Quantization? #14784

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Compression instead of Quantization? #14784

Uh oh!

Uh oh!

darksylinc Jul 20, 2025

Replies: 2 comments · 1 reply

Uh oh!

steampunque Jul 20, 2025

Uh oh!

darksylinc Jul 21, 2025 Author

Uh oh!

steampunque Jul 21, 2025

darksylinc
Jul 20, 2025

Replies: 2 comments 1 reply

steampunque
Jul 20, 2025

darksylinc
Jul 21, 2025
Author