Skip to content

Granite Four #13550

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 167 commits into
base: master
Choose a base branch
from
Open

Granite Four #13550

wants to merge 167 commits into from

Conversation

gabe-l-hart
Copy link
Contributor

@gabe-l-hart gabe-l-hart commented May 14, 2025

Description

This PR is the end-point for architecture support for Granite 4.0 (#13269 . It incorporates a number of changes from other in-flight branches that will need to be merged first:

Additionally, this PR replaces some work done on other PRs / branches:

Outstanding Questions

Besides the upstream PRs, there are a few questions to answer before this PR is merge ready:

  • This PR contains several changes to llama-kv-cache beyond those in feat: Hybrid unified/recurrent cache #13276, but they depend on the addition of hparams.recurrent_layer_arr which is only populated correctly if there is a valid model architecture to check against. Should I move all of these changes to the hybrid cache PR or keep them here where the model architectures become real?
  • Is there a more efficient way to implement hparams.recurrent_layer_arr? Using a max-layer-size std::array doesn't feel quite right.
  • There are still some numerical differences between the attention outputs when running Bamba and granite-4.0-tiny-shared-preview on this branch vs the respective draft branches, so I need to determine if this is due to changes in the attention implementation (ie "working as expected") or a bug somewhere.
  • The use of dymamic_cast to get the right cache type could be expensive (though it's likely negligible relative to the tensor math). Should we do something more clever to handle different cache types in llama-graph?
  • The switch statement for determining the type of KV cache to allocate in llama-model.cpp seems redundant with llama_model_is_recurrent and llama_model_is_hybrid. Should we use those functions instead and eliminate the duplicate logic and additional place to tweak for new recurrent / hybrid models?

Testing

To test out this branch, I've been using the following models:

Details

This PR has a lot of changes in it, some of which are isolated in the prereq-PRs above. In addition to the general mamba2 and llama_kv_cache_hybrid changes, this PR does the following:

python side

  • Add conversion support for BambaForCausalLM and GraniteMoeHybridForCausalLM
    • This includes one small tweak to gguf_writer.py that allows duplicate key/value pairs through add_key_value if (and only if) they match both value and type with the existing key. This is a convenience for hybrid models so that the converter doesn't need to rewrite the hparam conversion from multiple parents.
    • This also adds the new HybridAttention section under Keys in constants.py to hold attention.layer_indices. OPEN QUESTION: Should this just go under Attention?

c++ side

  • Add a new public API function llama_model_is_hybrid akin to llama_model_is_recurrent
    • I also split up both this function and llama_model_is_recurrent into llm_arch_is_* implemented in llama-arch.* and llama_model_is_* implemented in llama-model.*. This was done so that they could be used during model initialization before the model itself can be passed as the argument, specifically to determine how to populate hparams.recurrent_layer_arr (see below).
  • Add hparams.recurrent_layer_arr and support parsing it
    • The current implementation pre-allocates it as a fixed-length array which doesn't feel quite right.
  • Add an optional layer id to hparams.n_embd_k_s / hparams.n_embd_v_s
    • This is done because for hybrid models, the values may be different by layer.
    • I plumbed through as many usages of these methods as I could find to properly pass the layer index, but there are some places where it's not available which default to layer 0. This should be fine since none of those places interact with the hybrid caching.
  • Add hparams.recurrent_layer(uint32_t) to check whether a given layer is recurrent
  • Model name/param/arch plumbing for bamba and granitemoeshared in llama-arch.* (the boring part!)
  • (possibly breaking) Add hparams as an additional argument to the llama_model.create_memory method
    • This is done so the hparams can be given to the cache construction and used to determine which layers are recurrent for hybrid cache creation
  • In llama-graph, anywhere that a specific cache type needs to be fetched, it is grabbed using new methods get_recurrent_cache / get_unified_cache. These methods use dynamic_cast to handle both non-hybrid caches and hybrid caches.
  • Add support for instantiating the hybrid cache in llama-model.cpp
  • Add model support for bamba and granitemoehybrid in llama-model
    • Most of this is "business as usual," but that breaks down when trying to avoid code duplication for the hybrid architecture
    • To avoid code duplication, I hoisted build_mamba_layer / build_mamba2_layer from llm_build_mamba and build_attention_layer / build_layer_ffn from llm_build_granite into static methods on their respective classes. This makes for some gross function signatures where member data needs to be explicitly passed, but it allows the hybrid model architecture(s) to use these methods without complex inheritance.
    • I tried an alternative route using diamond inheritance, but this would have required some kind of "don't actually initialize the graph" switch in the parent model builders' constructors to avoid trying to build the parent model graphs during initialization of the hybrid class.

compilade added 30 commits April 3, 2024 20:47
This will be necessary to support Jamba
(and other recurrent models mixed with Attention).

Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.
* llama : begin work on support for variable GQA

This will also be useful for Jamba if we consider the Mamba layers
to have 0 KV heads.

* llama : gracefully fail when not finding hybrid slot
* ggml : simplify SSM-related operators

* llama : make recurrent state slot allocation contiguous

* llama : adapt internal uses of batches to llama_ubatch
This reduces overhead when running hellaswag
on thousands of sequences with very small 100k params Mamba models.
This otherwise was a problem when running the HellaSwag benchmark
with small batch sizes, making it crash.
This removes the need for ggml_ssm_conv!!!
But performance seems slighly worse on my system,
especially for prompt processing.
Maybe ggml_mul_mat isn't optimized for small row sizes?
More performance testing is necessary until GGML_OP_SSM_CONV is removed.

* ggml : make ggml_ssm_scan not modify its source tensors

* llama : fix shared recurrent tail cell count for small ubatch sizes

Otherwise it was impossible to run the 'parallel' example with '-ub 1'
with a Mamba or Jamba model.
* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors

The implementation already supported it,
and this makes Mamba's conv step slightly faster.
This can be changed back later if the name change is wrong.
I was renaming the functions anyway to generalize kv-cache-related
functions to hybrid and recurrent model architectures.
I think llama_past is a better name than llama_cache for a combined
kv cache and recurrent state cache, because the states it contains
pretty much always come before the newly-added ones for any particular
sequence. Also 'llama_past_clear' sounds more obvious in what it does
than 'llama_kv_cache_clear'. The future is what the models generate.
(For embeddings, the kv cache isn't really used anyway)

Still, I'm open to better suggestions.
gabe-l-hart and others added 2 commits July 9, 2025 07:57
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <[email protected]>
Signed-off-by: Gabe Goodhart <[email protected]>

Co-authored-by: Sigbjørn Skjæret <[email protected]>
compilade and others added 7 commits July 9, 2025 10:06
…o GraniteFour

* origin/compilade/refactor-kv-cache:
memory : avoid referring to KV in recurrent cache logs
model : make falcon-h1 use shared mamba2 layer builder

Signed-off-by: Gabe Goodhart <[email protected]>
Some of the tensor names are common with Llama4
…o GraniteFour

* origin/compilade/refactor-kv-cache:
gguf-py : avoid adding duplicate tensor mappings for Jamba
* origin/master:
llama : support Jamba hybrid Transformer-Mamba models (ggml-org#7531)
ggml : add ggml_scale_bias (ggml-org#14417)
@gabe-l-hart gabe-l-hart marked this pull request as ready for review July 9, 2025 19:09
@gabe-l-hart
Copy link
Contributor Author

@compilade @ggerganov @CISC I think this one is ready to go now!

The main outstanding questions I have with the changes here are:

  1. Since GRANITE_MOE_HYBRID is two characters longer than the longest previous enum key, all the vertical alignment changed in constants.py and llama-arch.cpp. Would it be better to not do this and just have one that doesn't vertically align to avoid the bigger diff?
  2. How do you feel about the use of virtual inheritance to implement the diamond pattern with llm_build_granite_hybrid -> (llm_graph_context_mamba, llm_graph_context_granite) -> llm_graph_context?

If we prefer to focus on composition over inheritance, we could move to a hasa relationship and remove llm_graph_context from the inheritance tree of the llm_graph_context_mamba and llm_graph_context_granite classes. Those would then need to be renamed and take a llm_graph_context & (or llm_graph_context *) as a member, then use it in their layer builder methods (eg graph_ctx->build_attn(...)). Then, the top-level llm_build_granite_hybrid (and llm_build_mamba / llm_build_jamba / llm_build_falcon_h1) would have member instances instantiated with this.

src/llama-arch.h Outdated
@@ -52,6 +52,7 @@ enum llm_arch {
LLM_ARCH_MAMBA2,
LLM_ARCH_JAMBA,
LLM_ARCH_FALCON_H1,
LLM_ARCH_BAMBA,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking over the code again, I think we could probably collapse LLM_ARCH_BAMBA into LLM_ARCH_GRANITE_MOE_HYBRID

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might then suggest we change GRANITE_MOE_HYBRID to simply GRANITE_HYBRID (which would have the nice benefit of removing the extra indentation!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only hang up collapsing these is how to determine when to use rope. Right now, the only way to tell is by the architecture name (bamba uses it, granitehybrid doesn't). This could be handled with an hparam, but I'm not clear exactly which one to use. I see rope_finetuned that doesn't appear to actually be used anywhere which seems like it might be a good option, but the default value is false which is reported as "unknown", so it wouldn't be a perfect fit. I also see n_no_rope_layer_step which is used for this in llama4, but from what I can tell, there's no corresponding constant for plumbing this through via conversion.

@ggerganov
Copy link
Member

How do you feel about the use of virtual inheritance to implement the diamond pattern with llm_build_granite_hybrid -> (llm_graph_context_mamba, llm_graph_context_granite) -> llm_graph_context?

I was just looking at that - we should avoid this. I haven't used this pattern and I am not sure I understand how it works.

In general, code de-duplication is not an objective here. Even with composition. It's completely fine to copy paste the model architectures instead of reusing them with inheritance. Pretty much all the architectures in llama-model.cpp could be deduplicated, but we don't do it because it's much easier to understand the flow of the networks.

What we typically do for deduplication is to extract common blocks into llm_graph_context. But usually this is done when we see a good pattern.

The llm_graph_context_mamba and llm_build_rwkv6_base are temporary - I think we should eventually move the code of these helper classes into llm_graph_context, by passing all the necessary model tensors as arguments. This way all architectures will be consistently implemented as child of llm_graph_context. This is a well-defined pattern and it's easy to follow.

@gabe-l-hart
Copy link
Contributor Author

That makes sense. For now, I think I'll remove the virtual inheritance and inherit only from llm_graph_context_mamba, then duplicate the granite helpers to keep inheritance linear.

I was just looking at that - we should avoid this. I haven't used this pattern and I am not sure I understand how it works.

I also did not understand this at all until I tried to get it working here. I didn't fully trust it until putting together a dummy version: #13550 (comment)

@CISC
Copy link
Collaborator

CISC commented Jul 9, 2025

1. Since `GRANITE_MOE_HYBRID` is two characters longer than the longest previous enum key, all the vertical alignment changed in `constants.py` and `llama-arch.cpp`. Would it be better to not do this and just have one that doesn't vertically align to avoid the bigger diff?

It's not an issue, it's easy to filter out by hiding whitespace.

The only key difference is the use of rope which is now set via
rope_finetuned in the hparams

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <[email protected]>
Per PR discussion, it's simpler to keep this with basic inheritance and not
introduce the complexity of virtual inheritance and multiple inheritance

ggml-org#13550 (comment)

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <[email protected]>
* origin/master:
llama : remove llm_graph_input_one (ggml-org#14603)

Signed-off-by: Gabe Goodhart <[email protected]>
@gabe-l-hart
Copy link
Contributor Author

I've removed the virtual inheritance now and collapsed Bamba and GraniteMoeHybrid into simply GraniteHybrid which covers all permutations of hybrid architectures (w/ and w/out the granite multipliers on top of llama) and dense/MoE/MoE+shared.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <[email protected]>
This matches how recurrent vs attention heads are identified for Jamba

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <[email protected]>
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can merge after @compilade approves.

@ggerganov ggerganov requested review from compilade and CISC July 10, 2025 06:14
ml.get_key(LLM_KV_ROPE_SCALING_FINETUNED, rope_finetuned, false);
hparams.rope_finetuned = rope_finetuned;

// A layer is recurrent IFF the n_head_kv value is set to 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// A layer is recurrent IFF the n_head_kv value is set to 0
// A layer is recurrent IF the n_head_kv value is set to 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually meant IFF as in if and only if. Happy to change it if that's too obscure though

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, never heard of that abbreviation, one lives and learns... :)

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <[email protected]>
* origin/master:
cmake : do not search for curl libraries by ourselves (ggml-org#14613)
SYCL: Initial set_rows kernel implementation (ggml-org#14562)
llama : minor coding style fix for smollm3 (ggml-org#14605)
cmake : bump llguidance version to v1.0.1 (ggml-org#14609)
cmake : llguidance build parser library only (ggml-org#14608)
cuda : support Falcon-H1 state size for SSM_SCAN (ggml-org#14602)

Signed-off-by: Gabe Goodhart <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants