-
Notifications
You must be signed in to change notification settings - Fork 12.3k
imatrix : use GGUF to store importance matrices #9400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
* perplexity : simplify filling the batch
Sums and counts tensors no longer need to be consecutive. * imatrix : more sanity checks when loading multiple imatrix files * imatrix : use ggml_format_name instead of std::string concatenation Co-authored-by: Xuan Son Nguyen <[email protected]>
I'm setting this to "draft", because of concerns by @ikawrakow in ikawrakow/ik_llama.cpp#15 (comment) and ikawrakow/ik_llama.cpp#15 (comment) (mostly related to the fact that GGUF is harder to parse than More details near the end of ikawrakow/ik_llama.cpp#15 (reply in thread). I'll need some days to think about how to go further with this. |
@compilade This is a good change and I think it would be useful to bring it to a completion. In the future, we can extend |
This should make comparisons between the formats easier because this matches the behavior of the previous version.
tools/imatrix/imatrix.cpp
Outdated
static bool str_remove_suffix(std::string & str, const std::string & suffix) { | ||
bool has_suffix = str_has_suffix(str, suffix); | ||
if (has_suffix) { | ||
str = str.substr(0, str.size() - suffix.size()); | ||
} | ||
return has_suffix; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static bool str_remove_suffix(std::string & str, const std::string & suffix) { | |
bool has_suffix = str_has_suffix(str, suffix); | |
if (has_suffix) { | |
str = str.substr(0, str.size() - suffix.size()); | |
} | |
return has_suffix; | |
} | |
static bool str_remove_suffix(std::string & str, const std::string_view & suffix) { | |
bool has_suffix = string_ends_with(str, suffix); | |
if (has_suffix) { | |
str = str.substr(0, str.size() - suffix.size()); | |
} | |
return has_suffix; | |
} |
This is a nice complement to string_ends_with
, should be moved to common.cpp
as string_remove_suffix
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't std::string_view
be passed by value instead of by const reference in these functions?
For example, https://quuxplusone.github.io/blog/2021/11/09/pass-string-view-by-value/ suggests that std::string_view
should always be passed by value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did wonder about that, but didn't know enough of the details behind it, reading that it certainly seems so, and it should be worth going over the usage of std::string_view
elsewhere.
Co-authored-by: Sigbjørn Skjæret <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I did some light testing with old and new formats, converting them back and forth, checking that quantization was unchanged, etc, all seems good.
Only slightly annoying thing is the garbage output to console (by Resolved in #14381gguf_init_from_file
) when using old format.
Will probably benefit from testing by more people before merging, @bartowski1182 @danielhanchen ?
Thank you for working on this, I've been thinking that storing imatrix as gguf wold be nice for investigating the use of gradients instead of activations. |
I guess unless there are any objections from @bartowski1182 or @danielhanchen we can merge this after the weekend? |
Objections? I've been looking forward to this change for months haha |
Oh this looks fantastic great work! Re using bigger batch sizes - does this mean if memory allows, imatrix should be in fact faster to process via PP? I'll try this out over the month, but maybe in the meantime, I'll temporarily use the legacy format - but overall this change is very welcome! Great work @compilade ! |
Currently, with This PR makes it possible to process multiple chunks in a single ubatch, or use multiple ubatches per chunk. It's not tied together anymore. It also technically allows variable chunk sizes (which will be useful to eventually implement proper chat dataset handling). Using bigger ubatch sizes can be useful for models which don't fit in RAM to load the weights less often (assuming mmap is used, since the weights are read once per ubatch anyway). Not sure how the GPU backends would handle that situation (with very large models), although when it does fit, this still allows benefiting from pipeline parallelism (as
You'll still get most of the advantages described above even with the old format; both are saved from the same internal data (which was changed to better fit the new format). The main benefits of the new GGUF-based While a round-trip conversion is possible, the legacy format contains insufficient shape and counts information for correctly merging ( Oh, there's another behavior which differs: when a tensor sub-matrice was not solicited by a dataset (e.g. a MoE model with unsolicited experts), the old format skips the entire tensor (even when e.g. 95% of the experts were solicited), while the new format keeps it, and it's in Partial data is not handled at read time for the legacy format because of insufficient shape information. (It could be handled at write time (as in @nicoboss's fork), but that workaround would incorrectly affect merges of MoE |
const float count = ((const float *) counts->data)[j]; | ||
if (count > 0.0f) { | ||
for (int64_t i = 0; i < ne0; ++i) { | ||
e[j*ne0 + i] = ((const float *) sums->data)[j*ne0 + i] / count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last paragraphs of my previous comment (in #9400 (comment)) made me think of #12557 (comment) by @jukofyork which suggested the imatrix
weights calculated from the dataset(s) should gradually replace the prior of "equal weights" (aka using 1
for the imatrix
weights).
Now I wonder if always using an extra 1
in the sum of squared activations would be sufficient to make the formula closer to act as a regulariser (I might be misusing the term), at least near the lower extreme.
Not sure what relative weight the prior should have, though. Maybe (1.0f)/(1.0f)
would be too little, since it becomes 1/512
after only one chunk.
(Obviously, this is out of scope for this PR (and will not be included here), but it's still somewhat related, because the new format stores per-matrice token counts instead of per-tensor chunk counts, which would make the above possible sanely for stacked MoE tensors (which may not be used equally even within a chunk))
E.g.
e[j*ne0 + i] = ((const float *) sums->data)[j*ne0 + i] / count; | |
e[j*ne0 + i] = (((const float *) sums->data)[j*ne0 + i] + 1.0f) / (count + 1.0f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last paragraphs of my previous comment (in #9400 (comment)) made me think of #12557 (comment) by @jukofyork which suggested the
imatrix
weights calculated from the dataset(s) should gradually replace the prior of "equal weights" (aka using1
for theimatrix
weights).Now I wonder if always using an extra
1
in the sum of squared activations would be sufficient to make the formula closer to act as a regulariser (I might be misusing the term), at least near the lower extreme.
This idea actually goes right back to the roots of Bayesianism:
https://en.wikipedia.org/wiki/Sunrise_problem
or more formally/recently:
https://en.wikipedia.org/wiki/Rule_of_succession
This sort of "pseudocount" was used extensively in the NLP literature in the pre-LLM days:
https://en.wikipedia.org/wiki/Additive_smoothing
(as often you ended up with very small samples in some buckets when performing n-gram analysis, etc)
More formally, it's the formula for the posterior predictive of a Bernoulli likelihood, see the first row of the first table here:
https://en.wikipedia.org/wiki/Conjugate_prior
Not sure what relative weight the prior should have, though. Maybe
(1.0f)/(1.0f)
would be too little, since it becomes1/512
after only one chunk.
Since the imatrix
values are a non-negativite weights centered around 1, a:
https://en.wikipedia.org/wiki/Log-normal_distribution
prior would likely be the best place to start (and is very easy to deal with / program up as it's just the Normal posterior predictive formula [which has a nice closed form solution], but using log-transformed data - the Wikipedia page above links to a paper that shows this).
Using a larger batch size will also help on GPU backends for models that don't fit in VRAM, since it reduces the number of times that the weights have to be copied to VRAM. However, usage of the eval callback prevents taking advantage of pipeline parallelism, since after every matrix multiplication there is a full synchronization to copy the results of the operation to the CPU. |
Thanks a lot for creating this amazing new imatrix file format and generally improving imatrix computation by a lot. I'm very excited that partial data caused by missing expert activation is now handled properly, thanks to the new file format. One of the most impactful changes of this PR seems to be imatrix support for 3D tensors. This finally allows generating imatrix quants for models using MLA such as DeepSeek (V2, V2-Lite, V2.5, V3, R1), MiniCPM3-4B, PLM-1.8B, KwaiCoder-DS-V2-Lite, hpc-coder-v2-6b, whale-v3-base-marged) without the Fix imatrix calculation for MLA models patch. This change surprisingly wasn't even mention in the PR description.
No there is not currently any practical use case for 4D tensors nor do I think there will ever be one. The most dimensions currently required are 3D tensors for MLA. |
The solution in @nicoboss's fork was inspired by ikawrakow/ik_llama.cpp#202 which does mention this concern (and to me seems to agree with the approach taken here):
|
Follow-up from ikawrakow/ik_llama.cpp#15 (reply in thread).
Using GGUF as the format for
imatrix
files will be useful for further experiments (e.g. with L²QER) and compatibility with existing or future GGUF tooling (e.g. GGUF previews on HuggingFace, graphical GGUF viewer(s) #6715, some kind ofgguf-diff
, etc.).There are multiple problems with
imatrix
which this is addressing:unordered_map
iteration order (makessha256sum
useless to compareimatrix
files made on the same dataset)-ub
(intermediate saves happen waaay too often)Summary of changes
imatrix
data.general.type
isimatrix
general.architecture
imatrix
files.*.in_sum2
and*.counts
for each tensors with imatrix data.*.in_sum2
are the per-channel sums of squared activationsF32
, like before.*.counts
are the number of activations (also the number of tokens), useful to calculate the mean squared activations (which is used byllama-quantize
)imatrix
files together with--in-file
.F32
even though it's integer values, because when calculating the mean it would be converted toF32
anyway to perform the division.Addconvert_legacy_imatrix_to_gguf.py
to convert oldimatrix.dat
files toimatrix.gguf
llama-quantize
can still read the old format (with a warning)) or can be converted withllama-imatrix
directly (when the output file has the.gguf
suffix).llama-perplexity
since perplexity : support using multiple sequences to allow larger batch sizes #5946, allow computing multiple chunks per batch withllama-imatrix
Use fused-multiply-add (withstd::fma
) when accumulating the sums of activationsllama-imatrix
frommaster
)f64
would be even better, but I'm not use it's worth it yet. For the curious, usingdouble
for the intermediate accumulations can be tried by changing only one line inIMatrixStats
:vector<float> values
tovector<double> values
.)unordered_map
.sha256sum
can be meaningfully used to compareimatrix
files generated in very similar conditions.TODO
llama-quantize
with oldimatrix.dat
with newllama-quantize
using convertedimatrix.gguf
sha256sum
.llama-imatrix
at different batch sizes-ub 64 -b 512
and-ub 512 -b 2048
for a chunk size of 512 (-c 512
)llama-imatrix
vs newllama-imatrix
--in-file
withllama-imatrix
.imatrix
or.gguf
imatrix (for round-trip conversions)general.architecture
exclusion.self.add_architecture()
a no-op, but maybegeneral.architecture
should simply be excluded whenself.arch == ""
. Not sure how to prevent using the otherself.add_*
(inGGUFWriter
) which expectself.arch
to be something.imatrix.dat
files?