Skip to content

UPSTREAM PR #21652: Prevent the sum of the dequantized activation in q8_1 from overflowing#1350

Open
loci-dev wants to merge 3 commits intomainfrom
loci/pr-21652-mistral4-q4_0
Open

UPSTREAM PR #21652: Prevent the sum of the dequantized activation in q8_1 from overflowing#1350
loci-dev wants to merge 3 commits intomainfrom
loci/pr-21652-mistral4-q4_0

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: ggml-org/llama.cpp#21652

Overview

During Mistral 4 small quantization and subsequent testing, I found that the PPL of Q4_1 ended up with NaN

When testing the reason, it only happened when later FFN_DOWN layers were quantized to Q4_1, IE:

llama-quantize ./Mistral-Small-4-119B-2603-bf16.gguf Mistral-Small-4-119B-2603-Q4_0.gguf Q4_0

Works as expected, but:

llama-quantize --tensor-type ffn_down=q4_1 ./Mistral-Small-4-119B-2603-bf16.gguf Mistral-Small-4-119B-2603-Q4_0.gguf Q4_0

(note the --tensor-type ffn_down=q4_1) gets NaN with PPL

After digging around with Claude and debug code, found that 16 Q8_1 blocks have s = Inf because the fp16 value is overflowing

In Claude's words:

Q8_1's s field stores sum * d in fp16 (max 65504), but when activation values in a 32-element block are large and same-sign, sum * d ≈ 32 * amax can exceed 65504. The max finite |s| is only 410, so the 16 overflowing blocks are massive outliers — their activations must be ~160x larger than typical.

Additional information

I ran the same model with the updated activation code and yielded a PPL of 5.5535 +/- 0.1235

For completeness, also tested with ignoring the pre-computed s value and recalculating the results as f32, and got a PPL of 5.5725 +/- 0.12469

Note that in either case, the PPL without this change was NaN, so while this clamping is lossy, it does result in a model that produces literally anything at all instead of failing spectacularly

Note that this only updates the reference, AVX2, AVX1, and CUDA implementations, not familiar enough with the other archs to touch those

Mistral 4 small PPL before these changes
[1]nan,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,[9]nan,[10]nan,[11]nan,[12]nan,[13]nan,[14]nan,[15]nan,[16]nan,[17]nan,[18]nan,[19]nan,[20]nan,[21]nan,[22]nan,[23]nan,[24]nan,[25]nan,[26]nan,[27]nan,[28]nan,[29]nan,[30]nan,[31]nan,[32]nan,[33]nan,[34]nan,[35]nan,[36]nan,[37]nan,[38]nan,[39]nan,[40]nan,[41]nan,[42]nan,[43]nan,[44]nan,[45]nan,[46]nan,[47]nan,[48]nan,[49]nan,[50]nan,[51]nan,[52]nan,[53]nan,[54]nan,[55]nan,[56]nan,[57]nan,[58]nan,[59]nan,[60]nan,[61]nan,[62]nan,[63]nan,[64]nan,[65]nan,[66]nan,[67]nan,[68]nan,[69]nan,[70]nan,[71]nan,[72]nan,[73]nan,[74]nan,[75]nan,[76]nan,[77]nan,[78]nan,[79]nan,[80]nan,[81]nan,[82]nan,[83]nan,[84]nan,[85]nan,[86]nan,[87]nan,[88]nan,[89]nan,[90]nan,[91]nan,[92]nan,[93]nan,[94]nan,[95]nan,[96]nan,[97]nan,[98]nan,[99]nan,[100]nan,
Unexpected negative standard deviation of log(prob)
Mistral 4 small PPL after these changes
[1]3.4955,[2]5.1043,[3]4.3632,[4]4.0977,[5]4.2305,[6]4.4037,[7]4.5087,[8]4.5073,[9]4.4639,[10]4.5297,[11]4.5263,[12]4.5587,[13]4.7890,[14]4.8800,[15]4.9211,[16]5.1019,[17]4.9447,[18]5.0835,[19]5.3203,[20]5.2572,[21]5.2755,[22]5.2618,[23]5.2392,[24]5.0943,[25]4.9508,[26]4.8844,[27]4.7809,[28]4.7516,[29]4.6887,[30]4.6612,[31]4.7435,[32]4.7837,[33]4.9112,[34]4.9299,[35]4.9606,[36]5.0317,[37]5.1810,[38]5.2880,[39]5.2647,[40]5.3024,[41]5.3430,[42]5.3550,[43]5.3820,[44]5.4165,[45]5.3988,[46]5.3975,[47]5.4033,[48]5.4926,[49]5.5758,[50]5.5696,[51]5.5633,[52]5.5685,[53]5.5910,[54]5.6138,[55]5.6804,[56]5.6624,[57]5.7382,[58]5.7408,[59]5.7636,[60]5.8286,[61]5.8487,[62]5.8500,[63]5.8483,[64]5.8840,[65]5.9183,[66]5.9979,[67]6.0463,[68]6.0607,[69]6.0864,[70]6.0978,[71]6.1055,[72]6.0793,[73]6.1318,[74]6.1260,[75]6.1347,[76]6.1301,[77]6.1563,[78]6.1131,[79]6.1374,[80]6.0724,[81]6.0041,[82]5.9799,[83]5.9689,[84]5.9874,[85]5.9820,[86]5.9677,[87]5.9715,[88]6.0430,[89]6.0806,[90]6.0899,[91]6.0997,[92]6.0917,[93]6.1328,[94]6.1264,[95]6.1512,[96]6.1638,[97]6.1756,[98]6.1676,[99]6.1591,[100]6.1809,
Final estimate: PPL = 6.1809 +/- 0.09843

Also tested on a Q4_1 quant of Qwen 3.5 9B and got identical PPL results both with and without this change

Qwen 3.5 9B before these changes
[1]5.4693,[2]7.8183,[3]7.9967,[4]7.6863,[5]7.6045,[6]7.8830,[7]8.1620,[8]8.6953,[9]9.0948,[10]9.4159,[11]9.2208,[12]9.2591,[13]9.7531,[14]9.2597,[15]9.1784,[16]9.2925,[17]8.7051,[18]8.7208,[19]8.6739,[20]8.6143,[21]8.3104,[22]8.2161,[23]7.9049,[24]7.5473,[25]7.4064,[26]7.2133,[27]7.0963,[28]7.0035,[29]6.9969,[30]6.9612,[31]6.9099,[32]6.9075,[33]6.8637,[34]6.9363,[35]7.0285,[36]7.1741,[37]7.2542,[38]7.2405,[39]7.2368,[40]7.2920,[41]7.3035,[42]7.3447,[43]7.3416,[44]7.3447,[45]7.4416,[46]7.4029,[47]7.5285,[48]7.5930,[49]7.5287,[50]7.5751,[51]7.5716,[52]7.6133,[53]7.6466,[54]7.6818,[55]7.6809,[56]7.6989,[57]7.7229,[58]7.7238,[59]7.7321,[60]7.7508,[61]7.7775,[62]7.8264,[63]7.8687,[64]7.9271,[65]7.9943,[66]8.0362,[67]8.1292,[68]8.1672,[69]8.1757,[70]8.1486,[71]8.2084,[72]8.2052,[73]8.2490,[74]8.2449,[75]8.2189,[76]8.2017,[77]8.2362,[78]8.2535,[79]8.1724,[80]8.1116,[81]8.0884,[82]8.1005,[83]8.1097,[84]8.1072,[85]8.1208,[86]8.1595,[87]8.1614,[88]8.1653,[89]8.1234,[90]8.0978,[91]8.0926,[92]8.0734,[93]8.0991,[94]8.1069,[95]8.1173,[96]8.1096,[97]8.0955,[98]8.0777,[99]8.0775,[100]8.0963,
Final estimate: PPL = 8.0963 +/- 0.12933
Qwen 3.5 9B before these changes
[1]5.4693,[2]7.8183,[3]7.9967,[4]7.6863,[5]7.6045,[6]7.8830,[7]8.1620,[8]8.6953,[9]9.0948,[10]9.4159,[11]9.2208,[12]9.2591,[13]9.7531,[14]9.2597,[15]9.1784,[16]9.2925,[17]8.7051,[18]8.7208,[19]8.6739,[20]8.6143,[21]8.3104,[22]8.2161,[23]7.9049,[24]7.5473,[25]7.4064,[26]7.2133,[27]7.0963,[28]7.0035,[29]6.9969,[30]6.9612,[31]6.9099,[32]6.9075,[33]6.8637,[34]6.9363,[35]7.0285,[36]7.1741,[37]7.2542,[38]7.2405,[39]7.2368,[40]7.2920,[41]7.3035,[42]7.3447,[43]7.3416,[44]7.3447,[45]7.4416,[46]7.4029,[47]7.5285,[48]7.5930,[49]7.5287,[50]7.5751,[51]7.5716,[52]7.6133,[53]7.6466,[54]7.6818,[55]7.6809,[56]7.6989,[57]7.7229,[58]7.7238,[59]7.7321,[60]7.7508,[61]7.7775,[62]7.8264,[63]7.8687,[64]7.9271,[65]7.9943,[66]8.0362,[67]8.1292,[68]8.1672,[69]8.1757,[70]8.1486,[71]8.2084,[72]8.2052,[73]8.2490,[74]8.2449,[75]8.2189,[76]8.2017,[77]8.2362,[78]8.2535,[79]8.1724,[80]8.1116,[81]8.0884,[82]8.1005,[83]8.1097,[84]8.1072,[85]8.1208,[86]8.1595,[87]8.1614,[88]8.1653,[89]8.1234,[90]8.0978,[91]8.0926,[92]8.0734,[93]8.0991,[94]8.1069,[95]8.1173,[96]8.1096,[97]8.0955,[98]8.0777,[99]8.0775,[100]8.0963,
Final estimate: PPL = 8.0963 +/- 0.12933

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, Claude was used extensively for discovering the issue through trial/error and debugging code

@loci-review
Copy link
Copy Markdown

loci-review Bot commented Apr 15, 2026

Overview

Analysis of 125,386 functions across 15 binaries revealed 5 modified, 2 new, 0 removed, and 125,379 unchanged functions. Changes focus on adding FP16 overflow protection to Q8_1 quantization reference implementation, with minimal system-wide impact.

Power Consumption Changes:

  • build.bin.libggml-base.so: +0.113% (+84.07 nJ)
  • All other binaries (libllama.so, libggml-cpu.so, libggml.so, libmtmd.so, llama-bench, llama-cvector-generator, llama-tts, llama-quantize, llama-qwen2vl-cli, llama-tokenize, llama-gemma3-cli, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli): 0.000% change

Function Analysis

quantize_row_q8_1_ref (build.bin.libggml-base.so)

  • Response time: 2,350ns → 2,398ns (+48ns, +2.05%)
  • Throughput time: 1,444ns → 1,479ns (+34ns, +2.37%)
  • Source change: Added FP16 range clamping via fminf(65504.0f, fmaxf(-65504.0f, sum*d)) to prevent overflow in Q8_1 quantization sum field
  • Justification: Intentional correctness fix preventing numerical instability in Q4_1/Q5_1 dot products. Reference implementation not used in production inference hot paths. Regression is acceptable trade-off for preventing inference corruption.

_S_max_size (build.bin.libggml-base.so)

  • Response time: 262ns → 140ns (-121ns, -46.35%)
  • Throughput time: 225ns → 103ns (-121ns, -54.05%)
  • Source change: None. STL internal function improved through compiler optimizations (better instruction selection, code consolidation)
  • Impact: Incidental improvement in vector allocation operations

_M_realloc_insert (build.bin.libggml-base.so)

  • Response time: 10,335ns → 10,362ns (+27ns, +0.26%)
  • Throughput time: 297ns → 322ns (+25ns, +8.42%)
  • Source change: None. STL vector reallocation showing expected recompilation variation
  • Impact: Negligible - within measurement noise for model loading operations

Additional Findings

Commits (48f1d71, d071411, 835acb7) systematically added overflow protection across Q8_1 quantization implementations. The 2% performance cost in the reference implementation prevents catastrophic numerical failures when activation values exceed FP16 representable range (±65504). Production inference paths (libllama.so, libggml-cpu.so) show zero power consumption change, confirming no impact on SIMD-optimized or GPU-accelerated inference workloads. Net effect on memory allocation chain is positive: _S_max_size improvement (-121ns) offsets _M_realloc_insert regression (+26ns).

💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 6 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants