-
Notifications
You must be signed in to change notification settings - Fork 12.3k
Granite Four #13550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Granite Four #13550
Conversation
This will be necessary to support Jamba (and other recurrent models mixed with Attention). Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.
* llama : begin work on support for variable GQA This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads. * llama : gracefully fail when not finding hybrid slot
* ggml : simplify SSM-related operators * llama : make recurrent state slot allocation contiguous * llama : adapt internal uses of batches to llama_ubatch
This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.
This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.
This removes the need for ggml_ssm_conv!!! But performance seems slighly worse on my system, especially for prompt processing. Maybe ggml_mul_mat isn't optimized for small row sizes? More performance testing is necessary until GGML_OP_SSM_CONV is removed. * ggml : make ggml_ssm_scan not modify its source tensors * llama : fix shared recurrent tail cell count for small ubatch sizes Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.
* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors The implementation already supported it, and this makes Mamba's conv step slightly faster.
This can be changed back later if the name change is wrong. I was renaming the functions anyway to generalize kv-cache-related functions to hybrid and recurrent model architectures. I think llama_past is a better name than llama_cache for a combined kv cache and recurrent state cache, because the states it contains pretty much always come before the newly-added ones for any particular sequence. Also 'llama_past_clear' sounds more obvious in what it does than 'llama_kv_cache_clear'. The future is what the models generate. (For embeddings, the kv cache isn't really used anyway) Still, I'm open to better suggestions.
* origin/master: llama : remove llm_graph_input_one (ggml-org#14603) Signed-off-by: Gabe Goodhart <[email protected]>
I've removed the virtual inheritance now and collapsed |
Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>
This matches how recurrent vs attention heads are identified for Jamba Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can merge after @compilade approves.
ml.get_key(LLM_KV_ROPE_SCALING_FINETUNED, rope_finetuned, false); | ||
hparams.rope_finetuned = rope_finetuned; | ||
|
||
// A layer is recurrent IFF the n_head_kv value is set to 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// A layer is recurrent IFF the n_head_kv value is set to 0 | |
// A layer is recurrent IF the n_head_kv value is set to 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually meant IFF as in if and only if. Happy to change it if that's too obscure though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, never heard of that abbreviation, one lives and learns... :)
Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>
* origin/master: cmake : do not search for curl libraries by ourselves (ggml-org#14613) SYCL: Initial set_rows kernel implementation (ggml-org#14562) llama : minor coding style fix for smollm3 (ggml-org#14605) cmake : bump llguidance version to v1.0.1 (ggml-org#14609) cmake : llguidance build parser library only (ggml-org#14608) cuda : support Falcon-H1 state size for SSM_SCAN (ggml-org#14602) Signed-off-by: Gabe Goodhart <[email protected]>
* origin/master: Smoldocling support (ggml-org#14597) Docs: script to auto-generate ggml operations docs (ggml-org#14598)
@compilade You have the final say. :) |
] | ||
|
||
return super().modify_tensors(data_torch, name, bid) | ||
|
||
|
||
@ModelBase.register("GraniteMoeHybridForCausalLM", "BambaForCausalLM") | ||
class GraniteHybridModel(Mamba2Model, GraniteMoeModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple inheritance in Python works by using methods from the first class where it's present. (At least according to https://stackoverflow.com/questions/3277367/how-does-pythons-super-work-with-multiple-inheritance)
In this case, it means methods from Mamba2Model
will be used for mostly everything, and GraniteMoeModel
will be used for its prepare_tensors
override (from LlamaModel
somewhere in its parent hierarchy), unless I'm misunderstanding the order.
The resolution order seems to be
$ python3
>>> import convert_hf_to_gguf
>>> convert_hf_to_gguf.GraniteHybridModel.__mro__
(<class 'convert_hf_to_gguf.GraniteHybridModel'>, <class 'convert_hf_to_gguf.Mamba2Model'>, <class 'convert_hf_to_gguf.GraniteMoeModel'>, <class 'convert_hf_to_gguf.GraniteModel'>, <class 'convert_hf_to_gguf.LlamaModel'>, <class 'convert_hf_to_gguf.TextModel'>, <class 'convert_hf_to_gguf.ModelBase'>, <class 'object'>)
(Noting this here, because I had to check how that works, not because there's a problem).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's right. I like the suggestions below to be more explicit.
convert_hf_to_gguf.py
Outdated
return [(self.map_tensor_name(name), data_torch)] | ||
|
||
def set_gguf_parameters(self): | ||
GraniteMoeModel.set_gguf_parameters(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all the key-values are overwritten below (which might not be the case, I did not verify), then it could be simpler to not call the parent set_gguf_parameters
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's at least one part of GraniteMoeModel
that should be kept, so I'm inclined to keep this as is
The gist is to be explicit about which base class is being used with the multiple inheritance setup Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>
@compilade thanks for catching my slop! 🏓 back to you (fixed all except the question about |
gguf-py/gguf/gguf_writer.py
Outdated
if any(key in kv_data for kv_data in self.kv_data): | ||
# Warn about duplicate keys if they differ by value or type | ||
if any( | ||
( | ||
key in kv_data | ||
and (kv_data[key].value != val or kv_data[key].type != vtype) | ||
) | ||
for kv_data in self.kv_data | ||
): | ||
logger.warning(f'Duplicated key name {key!r}, overwriting it with new value {val!r} of type {vtype.name}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern with this is how this would hide redundant overrides (meaning they might not get noticed to be removed).
In a way, the warnings kind of encourage making set_gguf_parameters
easier to follow (so that the overrides don't happen). I could be wrong, though.
For example, these seem to be the duplicate keys for which warnings are hidden by this section:
diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index 2df43ba11..28188b43d 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -6550,13 +6550,6 @@ class GraniteHybridModel(Mamba2Model, GraniteMoeModel):
def set_gguf_parameters(self):
GraniteMoeModel.set_gguf_parameters(self)
- ## General Params ##
- self.gguf_writer.add_embedding_length(self.d_model)
- self.gguf_writer.add_block_count(self.block_count)
- self.gguf_writer.add_context_length(self.hparams.get("max_position_embeddings", 0))
- self.gguf_writer.add_vocab_size(self.hparams["vocab_size"])
- self.gguf_writer.add_feed_forward_length(self.hparams["intermediate_size"])
-
## Mamba mixer params ##
self.gguf_writer.add_ssm_conv_kernel(self.find_hparam(["conv_kernel", "d_conv"]))
self.gguf_writer.add_ssm_state_size(self.find_hparam(["state_size", "d_state"]))
@@ -6573,14 +6566,8 @@ class GraniteHybridModel(Mamba2Model, GraniteMoeModel):
]
if rope_dim := self.hparams.get("attn_rotary_emb"):
self.gguf_writer.add_rope_dimension_count(rope_dim)
- self.gguf_writer.add_head_count(self.hparams["num_attention_heads"])
self.gguf_writer.add_head_count_kv(head_count_kv_vec)
- ## Feed Forward Params ##
- self.gguf_writer.add_layer_norm_rms_eps(
- self.find_hparam(["layer_norm_epsilon", "rms_norm_eps"], optional=True) or 1e-5
- )
-
## If Bamba, use rope, otherwise don't
use_rope = "BambaForCausalLM" in self.hparams["architectures"]
self.gguf_writer.add_rope_scaling_finetuned(use_rope)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good point. I think given that we've moved pretty significantly away from the complex inheritance on the c++ side, it ight make sense to see about doing something similar here since the multiple inheritance is definitely causing confusing code here as well. Let me take a look at how to simplify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part isn't really made confusing by multiple inheritance, but by the long (but linear) family tree of GraniteMoeModel(GraniteModel(LlamaModel(TextModel)))
for the inheritance of the set_gguf_parameters
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a dual edged sword, since we are getting more layers of inheritance it's easy to add duplicates below without noticing, and then adding lots of noise above about something that isn't really an issue. Noise leads to complacency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed. I guess the real question is whether we want to support using parents directly for gguf params in a multiple-inheritance situation. If it's single inheritance, there should be no reason to overwrite what a parent did. It's possible that a child uses a different key for the same value as a parent, but that would cause the parent's lookup to not find the key and the child's lookup would have a different value (which I think is actually happening here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, very good point! Scope creep for sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok removed. That leaves a few known override warnings:
WARNING:gguf.gguf_writer:Duplicated key name 'granitehybrid.embedding_length', overwriting it with new value 1536 of type UINT32
WARNING:gguf.gguf_writer:Duplicated key name 'granitehybrid.block_count', overwriting it with new value 40 of type UINT32
WARNING:gguf.gguf_writer:Duplicated key name 'granitehybrid.vocab_size', overwriting it with new value 49160 of type UINT32
WARNING:gguf.gguf_writer:Duplicated key name 'granitehybrid.feed_forward_length', overwriting it with new value 512 of type UINT32
WARNING:gguf.gguf_writer:Duplicated key name 'granitehybrid.attention.head_count', overwriting it with new value 12 of type UINT32
WARNING:gguf.gguf_writer:Duplicated key name 'granitehybrid.attention.head_count_kv', overwriting it with new value [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0] of type ARRAY
WARNING:gguf.gguf_writer:Duplicated key name 'granitehybrid.attention.layer_norm_rms_epsilon', overwriting it with new value 1e-05 of type FLOAT32
WARNING:gguf.gguf_writer:Duplicated key name 'granitehybrid.context_length', overwriting it with new value 1048576 of type UINT32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be resolved here with a comment in the converter and the warnings reinstanted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok removed. That leaves a few known override warnings
All of these (except the head_count_kv
one (and the context length, I forgot that)) can be avoided by applying the patch from #13550 (comment).
The source hparams
fields seem to be mostly the same both times they are set (small difference with layer_norm_epsilon
, but both times the actually-used field is rms_norm_eps
(for granite-4.0-tiny-random at least)).
If you think it's clearer to keep it this way, this is fine with me too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 Nope, you're totally right. I'm concurrently trying to update my draft of bumping llama.cpp
in ollama
and multitasking poorly. I'll push with those removed.
…alue After further discussion, this encourages sloppy overwriting in the model converters Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>
Co-authored-by: Francis Couture-Harpin <[email protected]> (thanks for the sharp eyes and patience!) Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>
Ok, thanks for the patience @compilade, I think it should be good-to-go once CI is green! |
@compilade @ggerganov @CISC Thank you so much for all the help and discussion in getting this merged! It's really great to have it officially in, and even more so for all the other great hybrid recurrent models that have come in during the process. |
Description
This PR is the end-point for architecture support for Granite 4.0 (#13269 . It incorporates a number of changes from other in-flight branches that will need to be merged first:
Additionally, this PR replaces some work done on other PRs / branches:
Bamba
support: Bamba architecture #10810Bamba
support: https://github.com/gabe-l-hart/llama.cpp/tree/BambaArchitectureRefactorGranite 4.0
support: https://github.com/gabe-l-hart/llama.cpp/tree/GraniteFourDraftBamba
work, this will also be abandoned in favor of this PRJamba
: llama : support Jamba hybrid Transformer-Mamba models #7531master
.Jamba
support in this branch, but on further inspection, it looks like theJamba
architecture has some additional bells-and-whistles (eg sliding-window-attention) that would need further work, so my plan is to leaveJamba
off for now and possibly tackle it later (hopefully it's much easier than the original branch!)Outstanding Questions
Besides the upstream PRs, there are a few questions to answer before this PR is merge ready:
llama-kv-cache
beyond those in feat: Hybrid unified/recurrent cache #13276, but they depend on the addition ofhparams.recurrent_layer_arr
which is only populated correctly if there is a valid model architecture to check against. Should I move all of these changes to the hybrid cache PR or keep them here where the model architectures become real?hparams.recurrent_layer_arr
? Using a max-layer-sizestd::array
doesn't feel quite right.Bamba
andgranite-4.0-tiny-shared-preview
on this branch vs the respective draft branches, so I need to determine if this is due to changes in the attention implementation (ie "working as expected") or a bug somewhere.dymamic_cast
to get the right cache type could be expensive (though it's likely negligible relative to the tensor math). Should we do something more clever to handle different cache types inllama-graph
?switch
statement for determining the type of KV cache to allocate inllama-model.cpp
seems redundant withllama_model_is_recurrent
andllama_model_is_hybrid
. Should we use those functions instead and eliminate the duplicate logic and additional place to tweak for new recurrent / hybrid models?Testing
To test out this branch, I've been using the following models:
granite-4.0-tiny-preview
: https://huggingface.co/ibm-granite/granite-4.0-tiny-previewBamba-9B-v1
: https://huggingface.co/ibm-ai-platform/Bamba-9B-v1mamba2-370m-hf
: https://huggingface.co/AntonV/mamba2-370m-hfDetails
This PR has a lot of changes in it, some of which are isolated in the prereq-PRs above. In addition to the general
mamba2
andllama_kv_cache_hybrid
changes, this PR does the following:python side
BambaForCausalLM
andGraniteMoeHybridForCausalLM
gguf_writer.py
that allows duplicate key/value pairs throughadd_key_value
if (and only if) they match both value and type with the existing key. This is a convenience for hybrid models so that the converter doesn't need to rewrite the hparam conversion from multiple parents.HybridAttention
section underKeys
inconstants.py
to holdattention.layer_indices
. OPEN QUESTION: Should this just go underAttention
?c++ side
llama_model_is_hybrid
akin tollama_model_is_recurrent
llama_model_is_recurrent
intollm_arch_is_*
implemented inllama-arch.*
andllama_model_is_*
implemented inllama-model.*
. This was done so that they could be used during model initialization before the model itself can be passed as the argument, specifically to determine how to populatehparams.recurrent_layer_arr
(see below).hparams.recurrent_layer_arr
and support parsing ithparams.n_embd_k_s
/hparams.n_embd_v_s
0
. This should be fine since none of those places interact with the hybrid caching.hparams.recurrent_layer(uint32_t)
to check whether a given layer is recurrentbamba
andgranitemoeshared
inllama-arch.*
(the boring part!)hparams
as an additional argument to thellama_model.create_memory
methodllama-graph
, anywhere that a specific cache type needs to be fetched, it is grabbed using new methodsget_recurrent_cache
/get_unified_cache
. These methods usedynamic_cast
to handle both non-hybrid caches and hybrid caches.llama-model.cpp
bamba
andgranitemoehybrid
inllama-model
build_mamba_layer
/build_mamba2_layer
fromllm_build_mamba
andbuild_attention_layer
/build_layer_ffn
fromllm_build_granite
intostatic
methods on their respective classes. This makes for some gross function signatures where member data needs to be explicitly passed, but it allows the hybrid model architecture(s) to use these methods without complex inheritance.