UPSTREAM PR #21652: Prevent the sum of the dequantized activation in q8_1 from overflowing by loci-dev · Pull Request #1350 · auroralabs-loci/llama.cpp

loci-dev · 2026-04-15T02:19:09Z

Note

Source pull request: ggml-org/llama.cpp#21652

Overview

During Mistral 4 small quantization and subsequent testing, I found that the PPL of Q4_1 ended up with NaN

When testing the reason, it only happened when later FFN_DOWN layers were quantized to Q4_1, IE:

llama-quantize ./Mistral-Small-4-119B-2603-bf16.gguf Mistral-Small-4-119B-2603-Q4_0.gguf Q4_0

Works as expected, but:

llama-quantize --tensor-type ffn_down=q4_1 ./Mistral-Small-4-119B-2603-bf16.gguf Mistral-Small-4-119B-2603-Q4_0.gguf Q4_0

(note the --tensor-type ffn_down=q4_1) gets NaN with PPL

After digging around with Claude and debug code, found that 16 Q8_1 blocks have s = Inf because the fp16 value is overflowing

In Claude's words:

Q8_1's s field stores sum * d in fp16 (max 65504), but when activation values in a 32-element block are large and same-sign, sum * d ≈ 32 * amax can exceed 65504. The max finite |s| is only 410, so the 16 overflowing blocks are massive outliers — their activations must be ~160x larger than typical.

Additional information

I ran the same model with the updated activation code and yielded a PPL of 5.5535 +/- 0.1235

For completeness, also tested with ignoring the pre-computed s value and recalculating the results as f32, and got a PPL of 5.5725 +/- 0.12469

Note that in either case, the PPL without this change was NaN, so while this clamping is lossy, it does result in a model that produces literally anything at all instead of failing spectacularly

Note that this only updates the reference, AVX2, AVX1, and CUDA implementations, not familiar enough with the other archs to touch those

Mistral 4 small PPL before these changes

[1]nan,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,[9]nan,[10]nan,[11]nan,[12]nan,[13]nan,[14]nan,[15]nan,[16]nan,[17]nan,[18]nan,[19]nan,[20]nan,[21]nan,[22]nan,[23]nan,[24]nan,[25]nan,[26]nan,[27]nan,[28]nan,[29]nan,[30]nan,[31]nan,[32]nan,[33]nan,[34]nan,[35]nan,[36]nan,[37]nan,[38]nan,[39]nan,[40]nan,[41]nan,[42]nan,[43]nan,[44]nan,[45]nan,[46]nan,[47]nan,[48]nan,[49]nan,[50]nan,[51]nan,[52]nan,[53]nan,[54]nan,[55]nan,[56]nan,[57]nan,[58]nan,[59]nan,[60]nan,[61]nan,[62]nan,[63]nan,[64]nan,[65]nan,[66]nan,[67]nan,[68]nan,[69]nan,[70]nan,[71]nan,[72]nan,[73]nan,[74]nan,[75]nan,[76]nan,[77]nan,[78]nan,[79]nan,[80]nan,[81]nan,[82]nan,[83]nan,[84]nan,[85]nan,[86]nan,[87]nan,[88]nan,[89]nan,[90]nan,[91]nan,[92]nan,[93]nan,[94]nan,[95]nan,[96]nan,[97]nan,[98]nan,[99]nan,[100]nan,
Unexpected negative standard deviation of log(prob)

Mistral 4 small PPL after these changes

[1]3.4955,[2]5.1043,[3]4.3632,[4]4.0977,[5]4.2305,[6]4.4037,[7]4.5087,[8]4.5073,[9]4.4639,[10]4.5297,[11]4.5263,[12]4.5587,[13]4.7890,[14]4.8800,[15]4.9211,[16]5.1019,[17]4.9447,[18]5.0835,[19]5.3203,[20]5.2572,[21]5.2755,[22]5.2618,[23]5.2392,[24]5.0943,[25]4.9508,[26]4.8844,[27]4.7809,[28]4.7516,[29]4.6887,[30]4.6612,[31]4.7435,[32]4.7837,[33]4.9112,[34]4.9299,[35]4.9606,[36]5.0317,[37]5.1810,[38]5.2880,[39]5.2647,[40]5.3024,[41]5.3430,[42]5.3550,[43]5.3820,[44]5.4165,[45]5.3988,[46]5.3975,[47]5.4033,[48]5.4926,[49]5.5758,[50]5.5696,[51]5.5633,[52]5.5685,[53]5.5910,[54]5.6138,[55]5.6804,[56]5.6624,[57]5.7382,[58]5.7408,[59]5.7636,[60]5.8286,[61]5.8487,[62]5.8500,[63]5.8483,[64]5.8840,[65]5.9183,[66]5.9979,[67]6.0463,[68]6.0607,[69]6.0864,[70]6.0978,[71]6.1055,[72]6.0793,[73]6.1318,[74]6.1260,[75]6.1347,[76]6.1301,[77]6.1563,[78]6.1131,[79]6.1374,[80]6.0724,[81]6.0041,[82]5.9799,[83]5.9689,[84]5.9874,[85]5.9820,[86]5.9677,[87]5.9715,[88]6.0430,[89]6.0806,[90]6.0899,[91]6.0997,[92]6.0917,[93]6.1328,[94]6.1264,[95]6.1512,[96]6.1638,[97]6.1756,[98]6.1676,[99]6.1591,[100]6.1809,
Final estimate: PPL = 6.1809 +/- 0.09843

Also tested on a Q4_1 quant of Qwen 3.5 9B and got identical PPL results both with and without this change

Qwen 3.5 9B before these changes

[1]5.4693,[2]7.8183,[3]7.9967,[4]7.6863,[5]7.6045,[6]7.8830,[7]8.1620,[8]8.6953,[9]9.0948,[10]9.4159,[11]9.2208,[12]9.2591,[13]9.7531,[14]9.2597,[15]9.1784,[16]9.2925,[17]8.7051,[18]8.7208,[19]8.6739,[20]8.6143,[21]8.3104,[22]8.2161,[23]7.9049,[24]7.5473,[25]7.4064,[26]7.2133,[27]7.0963,[28]7.0035,[29]6.9969,[30]6.9612,[31]6.9099,[32]6.9075,[33]6.8637,[34]6.9363,[35]7.0285,[36]7.1741,[37]7.2542,[38]7.2405,[39]7.2368,[40]7.2920,[41]7.3035,[42]7.3447,[43]7.3416,[44]7.3447,[45]7.4416,[46]7.4029,[47]7.5285,[48]7.5930,[49]7.5287,[50]7.5751,[51]7.5716,[52]7.6133,[53]7.6466,[54]7.6818,[55]7.6809,[56]7.6989,[57]7.7229,[58]7.7238,[59]7.7321,[60]7.7508,[61]7.7775,[62]7.8264,[63]7.8687,[64]7.9271,[65]7.9943,[66]8.0362,[67]8.1292,[68]8.1672,[69]8.1757,[70]8.1486,[71]8.2084,[72]8.2052,[73]8.2490,[74]8.2449,[75]8.2189,[76]8.2017,[77]8.2362,[78]8.2535,[79]8.1724,[80]8.1116,[81]8.0884,[82]8.1005,[83]8.1097,[84]8.1072,[85]8.1208,[86]8.1595,[87]8.1614,[88]8.1653,[89]8.1234,[90]8.0978,[91]8.0926,[92]8.0734,[93]8.0991,[94]8.1069,[95]8.1173,[96]8.1096,[97]8.0955,[98]8.0777,[99]8.0775,[100]8.0963,
Final estimate: PPL = 8.0963 +/- 0.12933

Qwen 3.5 9B before these changes

[1]5.4693,[2]7.8183,[3]7.9967,[4]7.6863,[5]7.6045,[6]7.8830,[7]8.1620,[8]8.6953,[9]9.0948,[10]9.4159,[11]9.2208,[12]9.2591,[13]9.7531,[14]9.2597,[15]9.1784,[16]9.2925,[17]8.7051,[18]8.7208,[19]8.6739,[20]8.6143,[21]8.3104,[22]8.2161,[23]7.9049,[24]7.5473,[25]7.4064,[26]7.2133,[27]7.0963,[28]7.0035,[29]6.9969,[30]6.9612,[31]6.9099,[32]6.9075,[33]6.8637,[34]6.9363,[35]7.0285,[36]7.1741,[37]7.2542,[38]7.2405,[39]7.2368,[40]7.2920,[41]7.3035,[42]7.3447,[43]7.3416,[44]7.3447,[45]7.4416,[46]7.4029,[47]7.5285,[48]7.5930,[49]7.5287,[50]7.5751,[51]7.5716,[52]7.6133,[53]7.6466,[54]7.6818,[55]7.6809,[56]7.6989,[57]7.7229,[58]7.7238,[59]7.7321,[60]7.7508,[61]7.7775,[62]7.8264,[63]7.8687,[64]7.9271,[65]7.9943,[66]8.0362,[67]8.1292,[68]8.1672,[69]8.1757,[70]8.1486,[71]8.2084,[72]8.2052,[73]8.2490,[74]8.2449,[75]8.2189,[76]8.2017,[77]8.2362,[78]8.2535,[79]8.1724,[80]8.1116,[81]8.0884,[82]8.1005,[83]8.1097,[84]8.1072,[85]8.1208,[86]8.1595,[87]8.1614,[88]8.1653,[89]8.1234,[90]8.0978,[91]8.0926,[92]8.0734,[93]8.0991,[94]8.1069,[95]8.1173,[96]8.1096,[97]8.0955,[98]8.0777,[99]8.0775,[100]8.0963,
Final estimate: PPL = 8.0963 +/- 0.12933

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, Claude was used extensively for discovering the issue through trial/error and debugging code

loci-review · 2026-04-15T03:21:51Z

Overview

Analysis of 125,386 functions across 15 binaries revealed 5 modified, 2 new, 0 removed, and 125,379 unchanged functions. Changes focus on adding FP16 overflow protection to Q8_1 quantization reference implementation, with minimal system-wide impact.

Power Consumption Changes:

build.bin.libggml-base.so: +0.113% (+84.07 nJ)
All other binaries (libllama.so, libggml-cpu.so, libggml.so, libmtmd.so, llama-bench, llama-cvector-generator, llama-tts, llama-quantize, llama-qwen2vl-cli, llama-tokenize, llama-gemma3-cli, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli): 0.000% change

Function Analysis

quantize_row_q8_1_ref (build.bin.libggml-base.so)

Response time: 2,350ns → 2,398ns (+48ns, +2.05%)
Throughput time: 1,444ns → 1,479ns (+34ns, +2.37%)
Source change: Added FP16 range clamping via fminf(65504.0f, fmaxf(-65504.0f, sum*d)) to prevent overflow in Q8_1 quantization sum field
Justification: Intentional correctness fix preventing numerical instability in Q4_1/Q5_1 dot products. Reference implementation not used in production inference hot paths. Regression is acceptable trade-off for preventing inference corruption.

_S_max_size (build.bin.libggml-base.so)

Response time: 262ns → 140ns (-121ns, -46.35%)
Throughput time: 225ns → 103ns (-121ns, -54.05%)
Source change: None. STL internal function improved through compiler optimizations (better instruction selection, code consolidation)
Impact: Incidental improvement in vector allocation operations

_M_realloc_insert (build.bin.libggml-base.so)

Response time: 10,335ns → 10,362ns (+27ns, +0.26%)
Throughput time: 297ns → 322ns (+25ns, +8.42%)
Source change: None. STL vector reallocation showing expected recompilation variation
Impact: Negligible - within measurement noise for model loading operations

Additional Findings

Commits (48f1d71, d071411, 835acb7) systematically added overflow protection across Q8_1 quantization implementations. The 2% performance cost in the reference implementation prevents catastrophic numerical failures when activation values exceed FP16 representable range (±65504). Production inference paths (libllama.so, libggml-cpu.so) show zero power consumption change, confirming no impact on SIMD-optimized or GPU-accelerated inference workloads. Net effect on memory allocation chain is positive: _S_max_size improvement (-121ns) offsets _M_realloc_insert regression (+26ns).

💬 Questions? Tag @loci-dev

bartowski1182 added 3 commits April 8, 2026 18:34

Prevent the sum of the dequantized activation in q8_1 from overflowing

835acb7

Merge branch 'ggml-org:master' into mistral4-q4_0

d071411

Add CUDA overflow protection

48f1d71

loci-dev temporarily deployed to PROD__AL_DEMO April 15, 2026 02:19 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 6 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #21652: Prevent the sum of the dequantized activation in q8_1 from overflowing#1350

UPSTREAM PR #21652: Prevent the sum of the dequantized activation in q8_1 from overflowing#1350
loci-dev wants to merge 3 commits intomainfrom
loci/pr-21652-mistral4-q4_0

loci-dev commented Apr 15, 2026

Uh oh!

loci-review Bot commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Apr 15, 2026

Overview

Additional information

Requirements

Uh oh!

loci-review Bot commented Apr 15, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants