Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
bartowski 
posted an update Aug 15, 2024
Post
6233
As some of you know, I try to convert models to either fp32 or bf16 depending on theirs size before doing imatrix and quantization

Today I decided to see if that matters, and the results have me.. for lack of a better word, perplexed

My setup:

Mistral Nemo Instruct 2407
- convert to FP32, calculate imatrix, quantize to Q8_0 and Q4_K_M
- convert to FP16, calculate imatrix, quantize to Q8_0 and Q4_K_M

I calculated the kld base from the FP32 model:
./llama-perplexity -m /models/Mistral-Nemo-Instruct-2407-f32.gguf -f /training_data/wikitext-2-raw/wiki.test.raw --kl-divergence-base /training_data/mistral-nemo-f32.kld -ngl 35 -fa -sm row

then calculated the divergence itself for each like so:
./llama-perplexity -m /models/Mistral-Nemo-Instruct-2407-Q8_0.gguf -f /training_data/wikitext-2-raw/wiki.test.raw --kl-divergence-base /training_data/mistral-nemo-f32.kld --kl-divergence -ngl 50 -fa -sm row

Q4_K_M from fp16 and fp32 were similar, trading blows across statistics, odd since i expected fp32 to be strictly better but it's not

Q8_0 is where things get weird. Despite each file being slightly different size, and the sha256sum of course being different, they each get *completely identical* scores, down to 6 decimal places of precision on the statistics.

How is this possible? Is there something I don't understand about llama.cpp that makes it always convert to fp16 before it does quantization? Am I wasting time using FP32/BF16??

Here's a table showing the main results:

Metric Q4_K_M from FP32 Q4_K_M from FP16 Q8_0 from FP32 Q8_0 from FP16
Mean PPL(Q) 6.445459 ± 0.039767 6.445574 ± 0.039771 6.344932 ± 0.038989 6.344932 ± 0.038989
Mean PPL(base) 6.337070 ± 0.038896 6.337070 ± 0.038896 6.337070 ± 0.038896 6.337070 ± 0.038896
Cor(ln(PPL(Q)), ln(PPL(base))) 99.62% 99.62% 99.98% 99.98%
Mean PPL(Q)/PPL(base) 1.017104 ± 0.000548 1.017122 ± 0.000549 1.001241 ± 0.000131 1.001241 ± 0.000131
Mean KLD 0.018110 ± 0.000112 0.018119 ± 0.000114 0.000859 ± 0.000005 0.000859 ± 0.000005
Maximum KLD 3.371759 2.833701 0.377813 0.377813
Median KLD 0.009176 0.009167 0.000549 0.000549
Mean Δp -0.256 ± 0.010 % -0.251 ± 0.010 % -0.017 ± 0.002 % -0.017 ± 0.002 %
RMS Δp 3.966 ± 0.033 % 3.978 ± 0.033 % 0.848 ± 0.007 % 0.848 ± 0.007 %
Same top p 93.893 ± 0.062 % 93.864 ± 0.062 % 98.515 ± 0.031 % 98.515 ± 0.031 %

Assuming the original weights are BF16 the q8_0 results are to be expected. There is no precision loss when converting BF16 to FP32. Assuming the absolute values of all weights are in the interval [6e-5, 65504] they can also be converted to FP16 without precision loss. And even if individual weights were to fall outside this interval the scales for quantized data are always FP16 so it probably would still result in the exact same models.

The quantization process for k-quants is more complicated than for q8_0. Intuitively I would assume that the differences seen there are the result of arithmetic operations where the model values are cast to different data types so the floating point rounding error ends up slightly different.

·

Yeah the BF16 -> FP32 being lossless makes sense to me, I'm just surprised that BF16 -> FP16 -> Q8 is identical to BF16 -> FP32 -> Q8, unless ALL values are within that range as you mentioned I would expect at minimum some noise

I could possibly find a way to check if all the weights are in that interval, and if they are, that would mean that fp16 is also lossless I suppose

But basically you're suggesting that at the end of the day, whether I convert to FP32, BF16, or FP16 (assuming a BF16 origin), the arithmetic in llama.cpp will make it so that it's irrelevant?

@bartowski Your quantized versions are very good! Thanks for carrying out quantization of my many models.