Compression instead of Quantization? #14784
Replies: 2 comments 1 reply
-
The compression idea comes around at least once or twice a year. Its not viable because the weights (at around 8 b/w) are essentially at shannon capacity (maximized entropy) already. That means if you try to compress an 8b quant with an entropy compressor such as LZ you will see no to very tiny compression. Thus by information theoretic arguments there is essentially nothing to be gained with an entropy compressor since the weights are already at maximum entropy (dumping the tech lingo, that means the bits in the weights don't have any information redundancy to eliminate, so any form of source coding is a predetermined bust). Quantization is a form of lossy compression and there are avangadros number of different schemes out there which try to optimize it to some arbitrary "goodness" metric. I don't really trust any of them since most seem to go after some arbitrary global l2 minimization of quantization noise which may or may not have anything to do with optimizing the goodness of performance of any given model. I also have zero confidence in use of perplexity as a robust optimization metric. When I optimize my hybrid layer quants I base the optimization on testing if the model actually can solve prompts, i.e. closed loop. I have a bunch of hybrid quants on my hf page https://huggingface.co/steampunque where I tune the layer quants on a model by model basis. I can compress the size of the models with this technique and achieve significant reduction on model size compared to homogenous layer quants. It takes a lot of patience to do this on some models which behave very squirrely / non intuitvely in terms of losing knowledge or reasoning as a function of different layer quants. I think there is a lot more that can be done factoring some information theory techniques into the mix which I haven't got around to yet but is on my todo radar list. Its a theoretically hard problem because quantization is nonlinear and how different models respond to nonlinear distortion as a function of model layer is not a straightforward problem to analyze, nor is there any obvious optimization metric to use.
AI is not in its infancy, I would put it in its late teens as of today, i.e. right on the verge of becoming very competent. Newton did his gravity stuff when he was 18/19 so I would not underestimate the potential of a high powered teen. The attention is all you need paper came out in 2017 which is ancient history on todays rate of technology advancement exponential. A few new ideas are trickling out here and there but for the most part most SOTA models as of today are extremely powerful and quite mature across many different domains, even the open weight ones. |
Beta Was this translation helpful? Give feedback.
-
This is completely missing my point. You're talking about lossless compression. I'm talking about lossy compression. A PNG picture is already at shannon capacity after compressing a BMP (which usually achieves very low compression ratios), yet a JPEG compressing can go much, much further. At worst JPEG gets 0.1x of the PNG size. It can get much more depending on how much quality (accuracy) you're willing to sacrifice. The question is not whether FFT can lossy compress data (it definitely can), the question is whether results are still reasonable after doing so, which from what you're telling me, it hasn't been tried. FFT can even compress random noise. It makes a terrible job at it, but it can do it. The same concept applies here. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi!
I come from graphics development background. I'm still learning about LLMs and AI in general.
One thing that struck me is that llama-cpp doesn't seem to be taking advantage of compression schemes outside of Quantization methods:
The idea is to fit big models in reasonable amount of VRAM while losing some precision, which quantization methods prove there is room for some acceptable loss. Also lower VRAM => lower bandwidth, and with that, higher performance (unless ALU bound) and lower power consumption.
Furthermore Quantization could be used together with compression, as K-Quants stores similar numbers together, which should severely increase the accuracy of BC compression.
Am I the first to propose these things? Has anyone tried?
I realize AI is in its infancy, and these optimizations that I'd consider basic in my own field may just not have been tried (at least in the open source AI space) because simply the field is too novel.
Beta Was this translation helpful? Give feedback.
All reactions