UPSTREAM PR #21652: Prevent the sum of the dequantized activation in q8_1 from overflowing#1350
UPSTREAM PR #21652: Prevent the sum of the dequantized activation in q8_1 from overflowing#1350
Conversation
OverviewAnalysis of 125,386 functions across 15 binaries revealed 5 modified, 2 new, 0 removed, and 125,379 unchanged functions. Changes focus on adding FP16 overflow protection to Q8_1 quantization reference implementation, with minimal system-wide impact. Power Consumption Changes:
Function Analysis
Additional FindingsCommits (48f1d71, d071411, 835acb7) systematically added overflow protection across Q8_1 quantization implementations. The 2% performance cost in the reference implementation prevents catastrophic numerical failures when activation values exceed FP16 representable range (±65504). Production inference paths (libllama.so, libggml-cpu.so) show zero power consumption change, confirming no impact on SIMD-optimized or GPU-accelerated inference workloads. Net effect on memory allocation chain is positive: 💬 Questions? Tag @loci-dev |
7638ab4 to
f1b46d5
Compare
Note
Source pull request: ggml-org/llama.cpp#21652
Overview
During Mistral 4 small quantization and subsequent testing, I found that the PPL of
Q4_1ended up withNaNWhen testing the reason, it only happened when later
FFN_DOWNlayers were quantized toQ4_1, IE:Works as expected, but:
(note the --tensor-type ffn_down=q4_1) gets
NaNwith PPLAfter digging around with Claude and debug code, found that 16
Q8_1blocks haves = Infbecause the fp16 value is overflowingIn Claude's words:
Additional information
I ran the same model with the updated activation code and yielded a PPL of
5.5535 +/- 0.1235For completeness, also tested with ignoring the pre-computed
svalue and recalculating the results asf32, and got a PPL of5.5725 +/- 0.12469Note that in either case, the PPL without this change was
NaN, so while this clamping is lossy, it does result in a model that produces literally anything at all instead of failing spectacularlyNote that this only updates the reference, AVX2, AVX1, and CUDA implementations, not familiar enough with the other archs to touch those
Mistral 4 small PPL before these changes
Mistral 4 small PPL after these changes
Also tested on a
Q4_1quant of Qwen 3.5 9B and got identical PPL results both with and without this changeQwen 3.5 9B before these changes
Qwen 3.5 9B before these changes
Requirements