imatrix : use GGUF to store importance matrices #9400

compilade · 2024-09-10T02:14:44Z

Follow-up from ikawrakow/ik_llama.cpp#15 (reply in thread).

Using GGUF as the format for imatrix files will be useful for further experiments (e.g. with L²QER) and compatibility with existing or future GGUF tooling (e.g. GGUF previews on HuggingFace, graphical GGUF viewer(s) #6715, some kind of gguf-diff, etc.).

There are multiple problems with imatrix which this is addressing:

Ad-hoc format which isn't really readable by other projects (and which has no way to backward-compatibly be extended except by adding more stuff at the end)
Non-deterministic tensor order depending on unordered_map iteration order (makes sha256sum useless to compare imatrix files made on the same dataset)
Broken behavior at small -ub (intermediate saves happen waaay too often)
Can't use bigger batch size than chunk size

Summary of changes

Use GGUF to store imatrix data.
- general.type is imatrix
- no general.architecture
  - can't really know the architecture from old imatrix files.
- store *.in_sum2 and *.counts for each tensors with imatrix data.
  - *.in_sum2 are the per-channel sums of squared activations
    - Stored in F32, like before.
  - *.counts are the number of activations (also the number of tokens), useful to calculate the mean squared activations (which is used by llama-quantize)
    - Why not simply store the mean? To allow merging imatrix files together with --in-file.
    - It's stored in F32 even though it's integer values, because when calculating the mean it would be converted to F32 anyway to perform the division.
~~Add convert_legacy_imatrix_to_gguf.py to convert old imatrix.dat files to imatrix.gguf~~
- Conversion is either not necessary (since llama-quantize can still read the old format (with a warning)) or can be converted with llama-imatrix directly (when the output file has the .gguf suffix).
Like llama-perplexity since perplexity : support using multiple sequences to allow larger batch sizes #5946, allow computing multiple chunks per batch with llama-imatrix
- This should be useful for huge models like Llama-405B when they don't fit completely in RAM.
~~Use fused-multiply-add (with std::fma) when accumulating the sums of activations~~
- (decided against using it for now, for easier comparisons with llama-imatrix from master)
- Shouldn't hurt to somewhat reduce rounding errors
  - (obviously f64 would be even better, but I'm not use it's worth it yet. For the curious, using double for the intermediate accumulations can be tried by changing only one line in IMatrixStats: vector<float> values to vector<double> values.)
Sort the tensor names before serializing
- This makes the tensor order deterministic, because otherwise it depended on the iteration order of unordered_map.
  - Determinism between runs means sha256sum can be meaningfully used to compare imatrix files generated in very similar conditions.

TODO

Compare old llama-quantize with old imatrix.dat with new llama-quantize using converted imatrix.gguf
- Seemed to work, but might need to re-test. The resulting quantized model(s) should have the same sha256sum.
Test new llama-imatrix at different batch sizes
- Same checksums with -ub 64 -b 512 and -ub 512 -b 2048 for a chunk size of 512 (-c 512)
Perplexity test(s) with i-quants with old llama-imatrix vs new llama-imatrix
Test with MoE models (perplexity with i-quants should be in the same ballpark as before)
Test --in-file with llama-imatrix
- single .imatrix or .gguf imatrix (for round-trip conversions)
- multiple (for merging)
(maybe) Implement cleaner general.architecture exclusion.
- Currently, this uses a subclass to make self.add_architecture() a no-op, but maybe general.architecture should simply be excluded when self.arch == "". Not sure how to prevent using the other self.add_* (in GGUFWriter) which expect self.arch to be something.
- Or maybe the architecture should be included?
  - What about conversions from older imatrix.dat files?

I have read the contributing guidelines
Self-reported review complexity:
- Medium

* perplexity : simplify filling the batch

examples/imatrix/imatrix.cpp

Sums and counts tensors no longer need to be consecutive. * imatrix : more sanity checks when loading multiple imatrix files * imatrix : use ggml_format_name instead of std::string concatenation Co-authored-by: Xuan Son Nguyen <[email protected]>

compilade · 2024-09-13T03:16:15Z

I'm setting this to "draft", because of concerns by @ikawrakow in ikawrakow/ik_llama.cpp#15 (comment) and ikawrakow/ik_llama.cpp#15 (comment) (mostly related to the fact that GGUF is harder to parse than imatrix.dat files).

More details near the end of ikawrakow/ik_llama.cpp#15 (reply in thread).

I'll need some days to think about how to go further with this.

ggerganov · 2025-04-08T07:59:37Z

@compilade This is a good change and I think it would be useful to bring it to a completion.

In the future, we can extend libllama with an interface for saving/loading imatrix data. This way the implementation for reading and writing the imatrix data would be localized in libllama and can be kept in-sync more easily. This can be combined with the refactoring of llama_model_quantize_params to not pass C++ objects.

This should make comparisons between the formats easier because this matches the behavior of the previous version.

tools/imatrix/imatrix.cpp

CISC · 2025-06-23T19:47:10Z

tools/imatrix/imatrix.cpp

+static bool str_remove_suffix(std::string & str, const std::string & suffix) {
+    bool has_suffix = str_has_suffix(str, suffix);
+    if (has_suffix) {
+        str = str.substr(0, str.size() - suffix.size());
+    }
+    return has_suffix;
+}


Suggested change

static bool str_remove_suffix(std::string & str, const std::string & suffix) {

bool has_suffix = str_has_suffix(str, suffix);

if (has_suffix) {

str = str.substr(0, str.size() - suffix.size());

}

return has_suffix;

}

static bool str_remove_suffix(std::string & str, const std::string_view & suffix) {

bool has_suffix = string_ends_with(str, suffix);

if (has_suffix) {

str = str.substr(0, str.size() - suffix.size());

}

return has_suffix;

}

This is a nice complement to string_ends_with, should be moved to common.cpp as string_remove_suffix.

Shouldn't std::string_view be passed by value instead of by const reference in these functions?

For example, https://quuxplusone.github.io/blog/2021/11/09/pass-string-view-by-value/ suggests that std::string_view should always be passed by value.

I did wonder about that, but didn't know enough of the details behind it, reading that it certainly seems so, and it should be worth going over the usage of std::string_view elsewhere.

tools/imatrix/imatrix.cpp

tools/quantize/quantize.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

CISC

LGTM, I did some light testing with old and new formats, converting them back and forth, checking that quantization was unchanged, etc, all seems good.

~~Only slightly annoying thing is the garbage output to console (by gguf_init_from_file) when using old format.~~ Resolved in #14381

Will probably benefit from testing by more people before merging, @bartowski1182 @danielhanchen ?

JohannesGaessler · 2025-06-25T19:59:38Z

Thank you for working on this, I've been thinking that storing imatrix as gguf wold be nice for investigating the use of gradients instead of activations.

CISC · 2025-07-05T11:48:54Z

I guess unless there are any objections from @bartowski1182 or @danielhanchen we can merge this after the weekend?

bartowski1182 · 2025-07-05T11:54:06Z

Objections? I've been looking forward to this change for months haha

danielhanchen · 2025-07-06T07:23:12Z

Oh this looks fantastic great work! Re using bigger batch sizes - does this mean if memory allows, imatrix should be in fact faster to process via PP?

I'll try this out over the month, but maybe in the meantime, I'll temporarily use the legacy format - but overall this change is very welcome! Great work @compilade !

compilade · 2025-07-07T04:30:41Z

Re using bigger batch sizes - does this mean if memory allows, imatrix should be in fact faster to process via PP?

@danielhanchen

Currently, with llama-imatrix from the master branch, the chunk size is tied to the ubatch size, which means setting a -ub different than the chunk size leads to some broken behavior.

This PR makes it possible to process multiple chunks in a single ubatch, or use multiple ubatches per chunk. It's not tied together anymore. It also technically allows variable chunk sizes (which will be useful to eventually implement proper chat dataset handling).

Using bigger ubatch sizes can be useful for models which don't fit in RAM to load the weights less often (assuming mmap is used, since the weights are read once per ubatch anyway).

Not sure how the GPU backends would handle that situation (with very large models), although when it does fit, this still allows benefiting from pipeline parallelism (as llama-perplexity already does) by allowing to calculate multiple ubatches per chunk or multiple chunks per (u?)batch.

but maybe in the meantime, I'll temporarily use the legacy format

You'll still get most of the advantages described above even with the old format; both are saved from the same internal data (which was changed to better fit the new format).

The main benefits of the new GGUF-based imatrix format (which (for now) is only used when using a .gguf suffix for the output imatrix file) is saner handling of MoE models, especially when merging imatrix files with different chunk sizes. Also readability by GGUF tooling (e.g. HF previews, gguf-dump, etc.).

While a round-trip conversion is possible, the legacy format contains insufficient shape and counts information for correctly merging imatrix data from MoE models. That's not really a problem when using one-off imatrix files on a single chunk size, though.

(llama-quantize should be able to read both formats)

Oh, there's another behavior which differs: when a tensor sub-matrice was not solicited by a dataset (e.g. a MoE model with unsolicited experts), the old format skips the entire tensor (even when e.g. 95% of the experts were solicited), while the new format keeps it, and it's in llama-quantize that the zeros are handled by using imatrix weights of 1 for that sub-matrice, which from my experiments around #12557, seems reasonable, and mostly matches what the quantization functions do without imatrix. This should allow avoiding problems like #12913.

Partial data is not handled at read time for the legacy format because of insufficient shape information. (It could be handled at write time (as in @nicoboss's fork), but that workaround would incorrectly affect merges of MoE imatrix files (by adding a 1 value to the squared activations of unused experts, even when those experts are used in another merged imatrix file), although arguably the current behavior (dropping the data for the entire tensor) is also wrong. Both problems cannot be fixed simultaneously without a different format than the legacy one.)

compilade · 2025-07-07T05:39:49Z

tools/quantize/quantize.cpp

+            const float count = ((const float *) counts->data)[j];
+            if (count > 0.0f) {
+                for (int64_t i = 0; i < ne0; ++i) {
+                    e[j*ne0 + i] = ((const float *) sums->data)[j*ne0 + i] / count;


The last paragraphs of my previous comment (in #9400 (comment)) made me think of #12557 (comment) by @jukofyork which suggested the imatrix weights calculated from the dataset(s) should gradually replace the prior of "equal weights" (aka using 1 for the imatrix weights).

Now I wonder if always using an extra 1 in the sum of squared activations would be sufficient to make the formula closer to act as a regulariser (I might be misusing the term), at least near the lower extreme.

Not sure what relative weight the prior should have, though. Maybe (1.0f)/(1.0f) would be too little, since it becomes 1/512 after only one chunk.

(Obviously, this is out of scope for this PR (and will not be included here), but it's still somewhat related, because the new format stores per-matrice token counts instead of per-tensor chunk counts, which would make the above possible sanely for stacked MoE tensors (which may not be used equally even within a chunk))

E.g.

Suggested change

e[j*ne0 + i] = ((const float *) sums->data)[j*ne0 + i] / count;

e[j*ne0 + i] = (((const float *) sums->data)[j*ne0 + i] + 1.0f) / (count + 1.0f);

The last paragraphs of my previous comment (in #9400 (comment)) made me think of #12557 (comment) by @jukofyork which suggested the imatrix weights calculated from the dataset(s) should gradually replace the prior of "equal weights" (aka using 1 for the imatrix weights).

Now I wonder if always using an extra 1 in the sum of squared activations would be sufficient to make the formula closer to act as a regulariser (I might be misusing the term), at least near the lower extreme.

This idea actually goes right back to the roots of Bayesianism:

https://en.wikipedia.org/wiki/Sunrise_problem

or more formally/recently:

https://en.wikipedia.org/wiki/Rule_of_succession

This sort of "pseudocount" was used extensively in the NLP literature in the pre-LLM days:

https://en.wikipedia.org/wiki/Additive_smoothing

(as often you ended up with very small samples in some buckets when performing n-gram analysis, etc)

More formally, it's the formula for the posterior predictive of a Bernoulli likelihood, see the first row of the first table here:

https://en.wikipedia.org/wiki/Conjugate_prior

Not sure what relative weight the prior should have, though. Maybe (1.0f)/(1.0f) would be too little, since it becomes 1/512 after only one chunk.

Since the imatrix values are a non-negativite weights centered around 1, a:

https://en.wikipedia.org/wiki/Log-normal_distribution

prior would likely be the best place to start (and is very easy to deal with / program up as it's just the Normal posterior predictive formula [which has a nice closed form solution], but using log-transformed data - the Wikipedia page above links to a paper that shows this).

slaren · 2025-07-07T09:38:24Z

Using bigger ubatch sizes can be useful for models which don't fit in RAM to load the weights less often (assuming mmap is used, since the weights are read once per ubatch anyway).

Not sure how the GPU backends would handle that situation (with very large models), although when it does fit, this still allows benefiting from pipeline parallelism (as llama-perplexity already does) by allowing to calculate multiple ubatches per chunk or multiple chunks per (u?)batch.

Using a larger batch size will also help on GPU backends for models that don't fit in VRAM, since it reduces the number of times that the weights have to be copied to VRAM. However, usage of the eval callback prevents taking advantage of pipeline parallelism, since after every matrix multiplication there is a full synchronization to copy the results of the operation to the CPU.

nicoboss · 2025-07-07T13:05:05Z

Thanks a lot for creating this amazing new imatrix file format and generally improving imatrix computation by a lot. I'm very excited that partial data caused by missing expert activation is now handled properly, thanks to the new file format.

One of the most impactful changes of this PR seems to be imatrix support for 3D tensors. This finally allows generating imatrix quants for models using MLA such as DeepSeek (V2, V2-Lite, V2.5, V3, R1), MiniCPM3-4B, PLM-1.8B, KwaiCoder-DS-V2-Lite, hpc-coder-v2-6b, whale-v3-base-marged) without the Fix imatrix calculation for MLA models patch. This change surprisingly wasn't even mention in the PR description.

TODO: 4d? (is that even used in practice?)

No there is not currently any practical use case for 4D tensors nor do I think there will ever be one. The most dimensions currently required are 3D tensors for MLA.

saood06 · 2025-07-08T04:26:53Z

Oh, there's another behavior which differs: when a tensor sub-matrice was not solicited by a dataset (e.g. a MoE model with unsolicited experts), the old format skips the entire tensor (even when e.g. 95% of the experts were solicited), while the new format keeps it, and it's in llama-quantize that the zeros are handled by using imatrix weights of 1 for that sub-matrice, which from my experiments around #12557, seems reasonable, and mostly matches what the quantization functions do without imatrix. This should allow avoiding problems like #12913.

Partial data is not handled at read time for the legacy format because of insufficient shape information. (It could be handled at write time (as in @nicoboss's fork), but that workaround would incorrectly affect merges of MoE imatrix files (by adding a 1 value to the squared activations of unused experts, even when those experts are used in another merged imatrix file), although arguably the current behavior (dropping the data for the entire tensor) is also wrong. Both problems cannot be fixed simultaneously without a different format than the legacy one.)

The solution in @nicoboss's fork was inspired by ikawrakow/ik_llama.cpp#202 which does mention this concern (and to me seems to agree with the approach taken here):

Strictly speaking it would be better to leave the zeros in the imatrix data of experts that have never been activated. But this would require to go and add proper protection against all-zeros imatrices, along with the appropriate corrective action, for all quants, and not just for IQ1_S_R4 as I did in ikawrakow/ik_llama.cpp#191. So, for now we go with same-importance columns for never activated experts.

EAddario · 2025-07-12T09:17:33Z

Really looking forward to this PR being merged into master!

In the meantime, you may already know this but passing along a tip shared by @David-AU-github in here that has worked for me when dealing with imatrices with partial activations in MoEs: increase the model's number of active experts (if KV override is supported), then calib / imatrix.

compilade added 8 commits August 20, 2024 15:17

imatrix : allow processing multiple chunks per batch

bce5464

* perplexity : simplify filling the batch

imatrix : fix segfault when using a single chunk per batch

347247a

imatrix : use GGUF to store imatrix data

3de9300

imatrix : fix conversion problems

c8ab6a3

Merge branch 'master' into compilade/imatrix-batched-chunks

3ad0603

imatrix : use FMA and sort tensor names

d19101c

py : add requirements for legacy imatrix convert script

503630e

perplexity : revert changes

9e6b0e9

compilade added 3 commits September 9, 2024 22:20

py : include imatrix converter requirements in toplevel requirements

894ed8d

imatrix : avoid using designated initializers in C++

efa9186

imatrix : remove unused n_entries

2217247

ngxson reviewed Sep 10, 2024

View reviewed changes

examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved

examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved

examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved

compilade and others added 2 commits September 10, 2024 11:51

quantize : use unused imatrix chunk_size with LLAMA_TRACE

2d79a70

compilade marked this pull request as draft September 13, 2024 03:11

compilade added 3 commits January 30, 2025 19:56

common : use GGUF for imatrix output by default

c7a32e7

Merge branch 'master' into compilade/imatrix-batched-chunks

db502dd

Merge branch 'master' into compilade/imatrix-batched-chunks

1be357d

compilade mentioned this pull request Apr 6, 2025

imatrix: add option to display importance score statistics for a given imatrix file #12718

Open

compilade added 3 commits April 13, 2025 12:10

Merge branch 'master' into compilade/imatrix-batched-chunks

16202d6

imatrix : two-way conversion between old format and GGUF

a5165a6

convert : remove imatrix to gguf python script

635f945

compilade added 7 commits April 15, 2025 17:52

imatrix : use the function name in more error messages

1d19025

Merge branch 'master' into compilade/imatrix-batched-chunks

2c09450

imatrix : don't use FMA explicitly

ba6f6be

This should make comparisons between the formats easier because this matches the behavior of the previous version.

imatrix : avoid returning from void function save_imatrix

1a9454a

imatrix : support 3d tensors with MUL_MAT

43cd2b3

quantize : fix dataset name loading from gguf imatrix

0e79355

Merge branch 'master' into compilade/imatrix-batched-chunks

118d52f

compilade marked this pull request as ready for review June 23, 2025 18:45

CISC reviewed Jun 23, 2025

View reviewed changes

common : move string_remove_suffix from quantize and imatrix

e33de12

Co-authored-by: Sigbjørn Skjæret <[email protected]>

CISC approved these changes Jun 24, 2025

View reviewed changes

CISC mentioned this pull request Jun 25, 2025

ggml : do not output unprintable characters on GGUF load failure #14381

Merged

compilade commented Jul 7, 2025

View reviewed changes

	e[jne0 + i] = ((const float ) sums->data)[j*ne0 + i] / count;
	e[jne0 + i] = (((const float ) sums->data)[j*ne0 + i] + 1.0f) / (count + 1.0f);

imatrix : use GGUF to store importance matrices #9400

Are you sure you want to change the base?

imatrix : use GGUF to store importance matrices #9400

Conversation

compilade commented Sep 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of changes

TODO

Uh oh!

Uh oh!

Uh oh!

Uh oh!

compilade commented Sep 13, 2024

Uh oh!

ggerganov commented Apr 8, 2025

Uh oh!

Uh oh!

CISC Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

compilade Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

CISC Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

CISC left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Jun 25, 2025

Uh oh!

CISC commented Jul 5, 2025

Uh oh!

bartowski1182 commented Jul 5, 2025

Uh oh!

danielhanchen commented Jul 6, 2025

Uh oh!

compilade commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

compilade Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

jukofyork Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slaren commented Jul 7, 2025

Uh oh!

nicoboss commented Jul 7, 2025

Uh oh!

saood06 commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EAddario commented Jul 12, 2025

Uh oh!

Uh oh!

compilade commented Sep 10, 2024 •

edited

Loading

CISC left a comment •

edited

Loading

compilade commented Jul 7, 2025 •

edited

Loading

jukofyork Jul 9, 2025 •

edited

Loading

saood06 commented Jul 8, 2025 •

edited

Loading