Skip to content

imatrix: add option to display importance score statistics for a given imatrix file #12718

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 53 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
d8e902e
Add --show-statistics option
EAddario Apr 1, 2025
f46693b
Add --show-statistics logic
EAddario Apr 1, 2025
b3ac78b
Merge branch 'master' into imatrix
EAddario Apr 1, 2025
dc3373e
Add tensor name parsing
EAddario Apr 2, 2025
0589c3e
Tidy output format
EAddario Apr 2, 2025
e1fd1af
Fix typo in title
EAddario Apr 2, 2025
490a8fe
Merge branch 'master' into imatrix
EAddario Apr 7, 2025
62ac268
Improve tensor influence ranking
EAddario Apr 8, 2025
73d8ecb
Add better statistics
EAddario Apr 13, 2025
200d88c
Merge branch 'master' into imatrix
EAddario Apr 13, 2025
0b7f9c4
Change statistics' sort order
EAddario Apr 15, 2025
52e86e2
Merge branch 'master' into imatrix
EAddario Apr 15, 2025
91d48da
Merge branch 'master' into imatrix
EAddario Apr 19, 2025
755c1ef
Add Cosine Similarity
EAddario Apr 22, 2025
72a5ec1
Merge branch 'master' into imatrix
EAddario May 3, 2025
5cd20e4
Add header search path
EAddario May 3, 2025
1dbe6c3
Change header search path to private
EAddario May 3, 2025
bb47f0d
Merge branch 'master' into imatrix
EAddario May 11, 2025
a3ac66c
Merge branch 'master' into imatrix
EAddario May 25, 2025
3eb556e
Add weighted statistics per layer
EAddario May 25, 2025
0276d71
Merge branch 'master' into imatrix
EAddario Jun 3, 2025
1f8dc23
Merge branch 'master' into imatrix
EAddario Jun 13, 2025
8ecd5fa
Merge branch 'master' into imatrix
EAddario Jun 14, 2025
8302a8a
Merge branch 'master' into imatrix
EAddario Jun 15, 2025
bfc0dfc
Merge branch 'master' into imatrix
EAddario Jun 21, 2025
5cfc443
Update report title
EAddario Jun 21, 2025
280dfdd
Merge branch 'master' into imatrix
EAddario Jun 22, 2025
235442a
Refactor compute_statistics out of main
EAddario Jun 22, 2025
c823d16
Refactor compute_cossim out of load_imatrix
EAddario Jun 22, 2025
a5c4640
Refactor compute_statistics out of load_imatrix
EAddario Jun 22, 2025
655be19
Move imatrix statistics calculation into its own functions
EAddario Jun 22, 2025
23ecca8
Add checks and validations
EAddario Jun 22, 2025
a4166a8
Remove unnecessary include directory
EAddario Jun 22, 2025
ed4ba31
Merge branch 'master' into imatrix
EAddario Jun 23, 2025
19f8e15
Rename labels
EAddario Jun 24, 2025
f5fd2b7
Add m_stats getter and refactor compute_statistics out of load_imatrix
EAddario Jun 24, 2025
bc3bd57
Refactor variable names
EAddario Jun 24, 2025
c3ede42
Merge branch 'master' into imatrix
EAddario Jun 24, 2025
1389753
Merge branch 'master' into imatrix
EAddario Jun 29, 2025
fde3089
Minor cosmetic change
EAddario Jun 29, 2025
c5a3d0a
Retrigger checks (empty commit)
EAddario Jul 1, 2025
688d0c2
Merge branch 'master' into imatrix
EAddario Jul 5, 2025
b1c481a
Rerun checks (empty commit)
EAddario Jul 5, 2025
dd13175
Fix unnecessary type promotion
EAddario Jul 7, 2025
0cd8e67
Reverting change to improve code readability
EAddario Jul 7, 2025
6c72d8e
Merge branch 'master' into imatrix
EAddario Jul 7, 2025
6826341
Rerun checks (empty commit)
EAddario Jul 7, 2025
432650b
Rerun checks (empty commit)
EAddario Jul 8, 2025
61a21a4
Rerun checks - third time's the Charm 🤞 (empty commit)
EAddario Jul 9, 2025
1a43247
Merge branch 'master' into imatrix
EAddario Jul 11, 2025
a3fdb2b
Minor cosmetic change
EAddario Jul 12, 2025
f9391bd
Update README
EAddario Jul 12, 2025
98bcd3e
Fix typo
EAddario Jul 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions common/arg.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2647,6 +2647,13 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.i_chunk = value;
}
).set_examples({LLAMA_EXAMPLE_IMATRIX}));
add_opt(common_arg(
{"--show-statistics"},
string_format("show imatrix statistics and then exit (default: %s)", params.show_statistics ? "true" : "false"),
[](common_params & params) {
params.show_statistics = true;
}
).set_examples({LLAMA_EXAMPLE_IMATRIX}));
add_opt(common_arg(
{"--parse-special"},
string_format("prase special tokens (chat, tool, etc) (default: %s)", params.parse_special ? "true" : "false"),
Expand Down
7 changes: 4 additions & 3 deletions common/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -420,9 +420,10 @@ struct common_params {
int32_t n_save_freq = 0; // save the imatrix every n_save_freq iterations
int32_t i_chunk = 0; // start processing from this chunk

bool process_output = false; // collect data for the output tensor
bool compute_ppl = true; // whether to compute perplexity
bool parse_special = false; // whether to parse special tokens during imatrix tokenization
bool process_output = false; // collect data for the output tensor
bool compute_ppl = true; // whether to compute perplexity
bool show_statistics = false; // show imatrix statistics per tensor
bool parse_special = false; // whether to parse special tokens during imatrix tokenization

// cvector-generator params
int n_pca_batch = 100;
Expand Down
73 changes: 60 additions & 13 deletions tools/imatrix/README.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,80 @@
# llama.cpp/tools/imatrix

Compute an importance matrix for a model and given text dataset. Can be used during quantization to enhance the quality of the quantized models.
More information is available here: https://github.com/ggml-org/llama.cpp/pull/4861
More information is [available here](https://github.com/ggml-org/llama.cpp/pull/4861)

## Usage

```
./llama-imatrix \
-m model.gguf -f some-text.txt [-o imatrix.dat] [--process-output] [--verbosity 1] \
[--no-ppl] [--chunk 123] [--output-frequency 10] [--save-frequency 0] \
[--in-file imatrix-prev-0.dat --in-file imatrix-prev-1.dat ...]
-m model.gguf -f some-text.txt [-o imatrix.dat] [--process-output] \
[--chunk 123] [--output-frequency 10] [--save-frequency 0] [--show-statistics] \
[--no-ppl] [--in-file imatrix-prev-0.dat --in-file imatrix-prev-1.dat ...] \
[--parse-special] [...]
```

Here `-m` with a model name and `-f` with a file containing training data (such as e.g. `wiki.train.raw`) are mandatory.
Here `-m | --model` with a model name and `-f | --file` with a file containing calibration data (such as e.g. `wiki.train.raw`) are mandatory.
The parameters in square brackets are optional and have the following meaning:
* `-o` (or `--output-file`) specifies the name of the file where the computed data will be stored. If missing `imatrix.dat` is used.
* `--verbosity` specifies the verbosity level. If set to `0`, no output other than the perplexity of the processed chunks will be generated. If set to `1`, each time the results are saved a message is written to `stderr`. If `>=2`, a message is output each time data is collected for any tensor. Default verbosity level is `1`.
* `--output-frequency` specifies how often the so far computed result is saved to disk. Default is 10 (i.e., every 10 chunks)
* `-h | --help` shows usage information and exits.
* `-lv | --verbosity` specifies the verbosity level. If set to `0`, no output other than the perplexity of the processed chunks will be generated. If set to `1`, each time the results are saved a message is written to `stderr`. If `>=2`, a message is output each time data is collected for any tensor. Default verbosity level is `1`.
* `-o | --output-file` specifies the name of the file where the computed data will be stored. If missing `imatrix.dat` is used.
* `-ofreq | --output-frequency` specifies how often the so far computed result is saved to disk. Default is 10 (i.e., every 10 chunks)
* `--save-frequency` specifies how often to save a copy of the imatrix in a separate file. Default is 0 (i.e., never)
* `--process-output` specifies if data will be collected for the `output.weight` tensor. My experience is that it is better to not utilize the importance matrix when quantizing `output.weight`, so this is set to `false` by default.
* `--process-output` specifies if data will be collected for the `output.weight` tensor. Typically, it is better not to utilize the importance matrix when quantizing `output.weight`, so this is set to `false` by default.
* `--in-file` one or more existing imatrix files to load and combine. Useful for merging files from multiple runs/datasets.
* `--parse-special` enables parsing of special tokens (e.g., `<|im_start|>` in some models). Useful for models with custom tokenizers.
* `--chunk` to skip the first `n` chunks of tokens from the input data. Useful for resuming or skipping initial low-quality data.
* `-n | --n-chunks` maximum number of chunks to process. Default is -1 for all available chunks.
* `--no-ppl` disables the calculation of perplexity for the processed chunks. Useful if you want to speed up the processing and do not care about perplexity.
* `--show-statistics` displays imatrix file's statistics.

For faster computation, make sure to use GPU offloading via the `-ngl` argument
For faster computation, make sure to use GPU offloading via the `-ngl | --n-gpu-layers` argument

## Example
## Examples

```bash
# generate importance matrix (imatrix.dat)
./llama-imatrix -m ggml-model-f16.gguf -f train-data.txt -ngl 99
# generate importance matrix using default filename (imatrix.dat), offloading 99 layers to GPU
./llama-imatrix -m ggml-model-f16.gguf -f calibration-data.txt -ngl 99

# use the imatrix to perform a Q4_K_M quantization
./llama-quantize --imatrix imatrix.dat ggml-model-f16.gguf ./ggml-model-q4_k_m.gguf q4_k_m
```

```bash
# combine Existing imatrices
./llama-imatrix --in-file imatrix-prev-0.dat --in-file imatrix-prev-1.dat -o imatrix-combined.dat
```

```bash
# skip first 5 chunks, save intermediates every 20 chunks and snapshots every 50, parsing special tokens
./llama-imatrix -m ggml-model-f16.gguf -f calibration-data.txt --chunk 5 --output-frequency 20 --save-frequency 50 --parse-special
```

```bash
# analyse imatrix file and display summary statistics instead of running inference
./llama-imatrix --in-file imatrix.dat --show-statistics
```

`--show-statistics` will display the following statistics:

#### Per tensor

* Σ(Act²): sum of all squared activations (the importance scores)
* Min & Max: minimum and maximum squared activations values
* μ & σ: Squared activations' mean and standard deviation
* % Active: proportion of elements whose average squared activation exceeds a small threshold (1e-5). Helpful to determine how alive/dormant the tensor is during inference
* N: number of squared activations
* Entropy: entropy of the squared activation distribution, in bits (standard Shannon entropy measurement) $S = -\sum_{i=1}^N p_i \log_2 p_i$
* E (norm): Normalized entropy. $E(norm)=\frac{-\sum_{i=1}^N p_i \log_2 p_i}{log_2 N}$. These two metrics can be used to determine how well a prompt "exercises" the model's capabilities
* ZD Score: z-score distribution as described in _3.1 Layer Importance Scores_ of [Layer-Wise Quantization](https://arxiv.org/abs/2406.17415)
* CosSim: cosine similarity with respect to the previous layer's tensor. Useful to determine how similar the squared activations of the current layer are to the previous layer's squared activations.

#### Per layer

Weighted averages of Σ(Act²), ZD Score and CosSim are also calculated.

#### Important note on the computed Statistics

When using these statistics, please note that they are computed on the squared activations, **not on the actual (raw) activations**.
Whilst the results are still useful, they're less accurate than using the raw values, and in the case of the cosine similarity, could be misleading if the tensor contains opposite vectors.
This limitation is due to the current implementation of the importance matrix, but a pull request ([use GGUF to store importance matrices](https://github.com/ggml-org/llama.cpp/pull/9400)) aims to address this.
Comment on lines +78 to +80
Copy link
Collaborator

@compilade compilade Jul 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EAddario
Note that #9400 does not change what is stored, only how it's stored. It's still the sums of squared activations.

It will make it easier to store other things, but that was not changed for now.

Loading
Loading