-
Notifications
You must be signed in to change notification settings - Fork 12.3k
imatrix: add option to display importance score statistics for a given imatrix file #12718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
EAddario
wants to merge
53
commits into
ggml-org:master
Choose a base branch
from
EAddario:imatrix
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+326
−30
Open
Changes from all commits
Commits
Show all changes
53 commits
Select commit
Hold shift + click to select a range
d8e902e
Add --show-statistics option
EAddario f46693b
Add --show-statistics logic
EAddario b3ac78b
Merge branch 'master' into imatrix
EAddario dc3373e
Add tensor name parsing
EAddario 0589c3e
Tidy output format
EAddario e1fd1af
Fix typo in title
EAddario 490a8fe
Merge branch 'master' into imatrix
EAddario 62ac268
Improve tensor influence ranking
EAddario 73d8ecb
Add better statistics
EAddario 200d88c
Merge branch 'master' into imatrix
EAddario 0b7f9c4
Change statistics' sort order
EAddario 52e86e2
Merge branch 'master' into imatrix
EAddario 91d48da
Merge branch 'master' into imatrix
EAddario 755c1ef
Add Cosine Similarity
EAddario 72a5ec1
Merge branch 'master' into imatrix
EAddario 5cd20e4
Add header search path
EAddario 1dbe6c3
Change header search path to private
EAddario bb47f0d
Merge branch 'master' into imatrix
EAddario a3ac66c
Merge branch 'master' into imatrix
EAddario 3eb556e
Add weighted statistics per layer
EAddario 0276d71
Merge branch 'master' into imatrix
EAddario 1f8dc23
Merge branch 'master' into imatrix
EAddario 8ecd5fa
Merge branch 'master' into imatrix
EAddario 8302a8a
Merge branch 'master' into imatrix
EAddario bfc0dfc
Merge branch 'master' into imatrix
EAddario 5cfc443
Update report title
EAddario 280dfdd
Merge branch 'master' into imatrix
EAddario 235442a
Refactor compute_statistics out of main
EAddario c823d16
Refactor compute_cossim out of load_imatrix
EAddario a5c4640
Refactor compute_statistics out of load_imatrix
EAddario 655be19
Move imatrix statistics calculation into its own functions
EAddario 23ecca8
Add checks and validations
EAddario a4166a8
Remove unnecessary include directory
EAddario ed4ba31
Merge branch 'master' into imatrix
EAddario 19f8e15
Rename labels
EAddario f5fd2b7
Add m_stats getter and refactor compute_statistics out of load_imatrix
EAddario bc3bd57
Refactor variable names
EAddario c3ede42
Merge branch 'master' into imatrix
EAddario 1389753
Merge branch 'master' into imatrix
EAddario fde3089
Minor cosmetic change
EAddario c5a3d0a
Retrigger checks (empty commit)
EAddario 688d0c2
Merge branch 'master' into imatrix
EAddario b1c481a
Rerun checks (empty commit)
EAddario dd13175
Fix unnecessary type promotion
EAddario 0cd8e67
Reverting change to improve code readability
EAddario 6c72d8e
Merge branch 'master' into imatrix
EAddario 6826341
Rerun checks (empty commit)
EAddario 432650b
Rerun checks (empty commit)
EAddario 61a21a4
Rerun checks - third time's the Charm 🤞 (empty commit)
EAddario 1a43247
Merge branch 'master' into imatrix
EAddario a3fdb2b
Minor cosmetic change
EAddario f9391bd
Update README
EAddario 98bcd3e
Fix typo
EAddario File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,33 +1,80 @@ | ||
# llama.cpp/tools/imatrix | ||
|
||
Compute an importance matrix for a model and given text dataset. Can be used during quantization to enhance the quality of the quantized models. | ||
More information is available here: https://github.com/ggml-org/llama.cpp/pull/4861 | ||
More information is [available here](https://github.com/ggml-org/llama.cpp/pull/4861) | ||
|
||
## Usage | ||
|
||
``` | ||
./llama-imatrix \ | ||
-m model.gguf -f some-text.txt [-o imatrix.dat] [--process-output] [--verbosity 1] \ | ||
[--no-ppl] [--chunk 123] [--output-frequency 10] [--save-frequency 0] \ | ||
[--in-file imatrix-prev-0.dat --in-file imatrix-prev-1.dat ...] | ||
-m model.gguf -f some-text.txt [-o imatrix.dat] [--process-output] \ | ||
[--chunk 123] [--output-frequency 10] [--save-frequency 0] [--show-statistics] \ | ||
[--no-ppl] [--in-file imatrix-prev-0.dat --in-file imatrix-prev-1.dat ...] \ | ||
[--parse-special] [...] | ||
``` | ||
|
||
Here `-m` with a model name and `-f` with a file containing training data (such as e.g. `wiki.train.raw`) are mandatory. | ||
Here `-m | --model` with a model name and `-f | --file` with a file containing calibration data (such as e.g. `wiki.train.raw`) are mandatory. | ||
The parameters in square brackets are optional and have the following meaning: | ||
* `-o` (or `--output-file`) specifies the name of the file where the computed data will be stored. If missing `imatrix.dat` is used. | ||
* `--verbosity` specifies the verbosity level. If set to `0`, no output other than the perplexity of the processed chunks will be generated. If set to `1`, each time the results are saved a message is written to `stderr`. If `>=2`, a message is output each time data is collected for any tensor. Default verbosity level is `1`. | ||
* `--output-frequency` specifies how often the so far computed result is saved to disk. Default is 10 (i.e., every 10 chunks) | ||
* `-h | --help` shows usage information and exits. | ||
* `-lv | --verbosity` specifies the verbosity level. If set to `0`, no output other than the perplexity of the processed chunks will be generated. If set to `1`, each time the results are saved a message is written to `stderr`. If `>=2`, a message is output each time data is collected for any tensor. Default verbosity level is `1`. | ||
* `-o | --output-file` specifies the name of the file where the computed data will be stored. If missing `imatrix.dat` is used. | ||
* `-ofreq | --output-frequency` specifies how often the so far computed result is saved to disk. Default is 10 (i.e., every 10 chunks) | ||
* `--save-frequency` specifies how often to save a copy of the imatrix in a separate file. Default is 0 (i.e., never) | ||
* `--process-output` specifies if data will be collected for the `output.weight` tensor. My experience is that it is better to not utilize the importance matrix when quantizing `output.weight`, so this is set to `false` by default. | ||
* `--process-output` specifies if data will be collected for the `output.weight` tensor. Typically, it is better not to utilize the importance matrix when quantizing `output.weight`, so this is set to `false` by default. | ||
* `--in-file` one or more existing imatrix files to load and combine. Useful for merging files from multiple runs/datasets. | ||
* `--parse-special` enables parsing of special tokens (e.g., `<|im_start|>` in some models). Useful for models with custom tokenizers. | ||
* `--chunk` to skip the first `n` chunks of tokens from the input data. Useful for resuming or skipping initial low-quality data. | ||
* `-n | --n-chunks` maximum number of chunks to process. Default is -1 for all available chunks. | ||
* `--no-ppl` disables the calculation of perplexity for the processed chunks. Useful if you want to speed up the processing and do not care about perplexity. | ||
* `--show-statistics` displays imatrix file's statistics. | ||
|
||
For faster computation, make sure to use GPU offloading via the `-ngl` argument | ||
For faster computation, make sure to use GPU offloading via the `-ngl | --n-gpu-layers` argument | ||
|
||
## Example | ||
## Examples | ||
|
||
```bash | ||
# generate importance matrix (imatrix.dat) | ||
./llama-imatrix -m ggml-model-f16.gguf -f train-data.txt -ngl 99 | ||
# generate importance matrix using default filename (imatrix.dat), offloading 99 layers to GPU | ||
./llama-imatrix -m ggml-model-f16.gguf -f calibration-data.txt -ngl 99 | ||
|
||
# use the imatrix to perform a Q4_K_M quantization | ||
./llama-quantize --imatrix imatrix.dat ggml-model-f16.gguf ./ggml-model-q4_k_m.gguf q4_k_m | ||
``` | ||
|
||
```bash | ||
# combine Existing imatrices | ||
./llama-imatrix --in-file imatrix-prev-0.dat --in-file imatrix-prev-1.dat -o imatrix-combined.dat | ||
``` | ||
|
||
```bash | ||
# skip first 5 chunks, save intermediates every 20 chunks and snapshots every 50, parsing special tokens | ||
./llama-imatrix -m ggml-model-f16.gguf -f calibration-data.txt --chunk 5 --output-frequency 20 --save-frequency 50 --parse-special | ||
``` | ||
|
||
```bash | ||
# analyse imatrix file and display summary statistics instead of running inference | ||
./llama-imatrix --in-file imatrix.dat --show-statistics | ||
``` | ||
|
||
`--show-statistics` will display the following statistics: | ||
|
||
#### Per tensor | ||
|
||
* Σ(Act²): sum of all squared activations (the importance scores) | ||
* Min & Max: minimum and maximum squared activations values | ||
* μ & σ: Squared activations' mean and standard deviation | ||
* % Active: proportion of elements whose average squared activation exceeds a small threshold (1e-5). Helpful to determine how alive/dormant the tensor is during inference | ||
* N: number of squared activations | ||
* Entropy: entropy of the squared activation distribution, in bits (standard Shannon entropy measurement) $S = -\sum_{i=1}^N p_i \log_2 p_i$ | ||
* E (norm): Normalized entropy. $E(norm)=\frac{-\sum_{i=1}^N p_i \log_2 p_i}{log_2 N}$. These two metrics can be used to determine how well a prompt "exercises" the model's capabilities | ||
* ZD Score: z-score distribution as described in _3.1 Layer Importance Scores_ of [Layer-Wise Quantization](https://arxiv.org/abs/2406.17415) | ||
* CosSim: cosine similarity with respect to the previous layer's tensor. Useful to determine how similar the squared activations of the current layer are to the previous layer's squared activations. | ||
|
||
#### Per layer | ||
|
||
Weighted averages of Σ(Act²), ZD Score and CosSim are also calculated. | ||
|
||
#### Important note on the computed Statistics | ||
|
||
When using these statistics, please note that they are computed on the squared activations, **not on the actual (raw) activations**. | ||
Whilst the results are still useful, they're less accurate than using the raw values, and in the case of the cosine similarity, could be misleading if the tensor contains opposite vectors. | ||
This limitation is due to the current implementation of the importance matrix, but a pull request ([use GGUF to store importance matrices](https://github.com/ggml-org/llama.cpp/pull/9400)) aims to address this. | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EAddario
Note that #9400 does not change what is stored, only how it's stored. It's still the sums of squared activations.
It will make it easier to store other things, but that was not changed for now.