-
Notifications
You must be signed in to change notification settings - Fork 12.3k
Granite Four #13550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
gabe-l-hart
wants to merge
149
commits into
ggml-org:master
Choose a base branch
from
gabe-l-hart:GraniteFour
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Granite Four #13550
+1,565
−651
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This will be necessary to support Jamba (and other recurrent models mixed with Attention). Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.
* llama : begin work on support for variable GQA This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads. * llama : gracefully fail when not finding hybrid slot
* ggml : simplify SSM-related operators * llama : make recurrent state slot allocation contiguous * llama : adapt internal uses of batches to llama_ubatch
This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.
This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.
This removes the need for ggml_ssm_conv!!! But performance seems slighly worse on my system, especially for prompt processing. Maybe ggml_mul_mat isn't optimized for small row sizes? More performance testing is necessary until GGML_OP_SSM_CONV is removed. * ggml : make ggml_ssm_scan not modify its source tensors * llama : fix shared recurrent tail cell count for small ubatch sizes Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.
* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors The implementation already supported it, and this makes Mamba's conv step slightly faster.
This can be changed back later if the name change is wrong. I was renaming the functions anyway to generalize kv-cache-related functions to hybrid and recurrent model architectures. I think llama_past is a better name than llama_cache for a combined kv cache and recurrent state cache, because the states it contains pretty much always come before the newly-added ones for any particular sequence. Also 'llama_past_clear' sounds more obvious in what it does than 'llama_kv_cache_clear'. The future is what the models generate. (For embeddings, the kv cache isn't really used anyway) Still, I'm open to better suggestions.
Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>
* origin/master: gguf-py : add support for chat template jinja files (ggml-org#14508)
* origin/master: Fix conditional enabling following arch checks for ggml-sycl (ggml-org#14504) convert : correct gemma 3n conversion (ggml-org#14450) kv-cache : use ggml_set_rows (ggml-org#14285) ggml : fix FA mask dim 2 and 3 (ggml-org#14505) ggml : remove kompute backend (ggml-org#14501) CUDA: add dynamic shared mem to softmax, refactor general usage (ggml-org#14497)
…o GraniteFourWithJamba * origin/compilade/refactor-kv-cache: (32 commits) convert : fix jamba conv1d shape squeezing llama : partially apply clang-format style llama : remove implicit recurrent state rollbacks llama : begin renaming llama_past back to llama_kv_cache llama : use unused n_embd_k_gqa in k_shift llama : fix mixed signedness comparison convert_hf : fix Jamba conversion llama : session saving and reloading for hybrid models mamba : fix non-contiguous usage of ggml_silu examples : replace llama_kv_cache_seq_* with llama_past_seq_* llama : rename llama_cache to llama_past llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL llama : fix .base() compilation error on Windows llama : use im2col and mul_mat to perform convolution for Mamba llama : avoid copies for simple batch splits llama : fix edge case finding batch seq_id of split recurrent cell llama : minimize swaps when reordering logits llama : fix batch split output count for embeddings llama : use equal-sequence-length sub-batches for recurrent models llama : sequence-length-aware batch splitting ...
…id inputs Branch: GraniteFourWithJamba Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteFourWithJamba Signed-off-by: Gabe Goodhart <[email protected]>
…as mixins The key is for the mixin classes (llm_graph_context_mamba, llm_graph_context_granite) to use virtual inheritance from llm_graph_context. This allows the common members to exist only once in the class hierarchy. The downside is that llm_graph_context will be re-initialized once for each parent (ie 2x for single mixin, 3x for two mixins, etc...). Branch: GraniteFourWithJamba Signed-off-by: Gabe Goodhart <[email protected]>
But this time it contains the sub-cache graph inputs. This *should* make it easier to handle updating the inputs when caching the graph (eventually).
…o GraniteFour * origin/compilade/refactor-kv-cache: model : add Jamba to Mamba-specific hparams printing graph : add back hybrid memory graph input opencl : broadcast for soft_max (ggml-org#14510) vulkan: support mixed/deepseekR1 FA head sizes (ggml-org#14509) ggml: backward pass for split swiglu (ggml-org#14483)
Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>
* origin/master: CUDA: add bf16 and i32 to getrows (ggml-org#14529) vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (ggml-org#14485) vulkan: fix rms_norm+mul fusion (ggml-org#14545) vulkan: Handle updated FA dim2/3 definition (ggml-org#14518) server : fix assistant prefilling when content is an array (ggml-org#14360) opencl: add GELU_ERF (ggml-org#14476) eval-callback : check for empty input (ggml-org#14539) test-backend-ops: add support for specifying output format (ggml-org#14368) metal : disable fast math in all quantize kernels (ggml-org#14528) batch : add optional for sequential equal split (ggml-org#14511) graph : prepare for 4D mask (ggml-org#14515) batch : add n_used count (ggml-org#14512) CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (ggml-org#14002) ggml : implement GEGLU_ERF and GEGLU_QUICK ops (ggml-org#14445)
8 tasks
…o GraniteFour * origin/compilade/refactor-kv-cache:
* origin/master: model : fix hunyuan moe chat template (ggml-org#14584) model : add SmolLM3 (ggml-org#14581) memory : fix broken batch splits for recurrent cache (ggml-org#14575) vulkan : fix rope with partial rotation and non-cont src (ggml-org#14582) server: Add ability to mount server at prefix (ggml-org#14544) model : add hunyuan moe (ggml-org#14425) vulkan: increase timeout for CI (ggml-org#14574) cuda : fix rope with partial rotation and non-cont src (ggml-org#14580) CUDA: add bilinear interpolation for upscale (ggml-org#14563) musa: fix build warnings (unused variable) (ggml-org#14561) llama : fix incorrect minicpm3 v_states shape (ggml-org#14571) llama : remove ggml_cont where possible (ggml-org#14568)
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
CISC
requested changes
Jul 8, 2025
This was already partially supported via reusing the granite ffn builder, and there may be models that leverage this architecture going forward. The naming is a bit odd, but in the transformers version, it reuses the same model class and simply has zero regular experts and a single shared expert (which is the same as a single dense FFN). Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>
…o GraniteFour * origin/compilade/refactor-kv-cache: model : use ggml_swiglu_split for Mamba model : remove unnecessary prefix for tensor loading constants jamba : remove redundant nullptr initializations vulkan: optimize flash attention split_k_reduce (ggml-org#14554)
Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Apple Metal
https://en.wikipedia.org/wiki/Metal_(API)
examples
ggml
changes relating to the ggml tensor library for machine learning
Nvidia GPU
Issues specific to Nvidia GPUs
python
python script changes
server
testing
Everything test related
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR is the end-point for architecture support for Granite 4.0 (#13269 . It incorporates a number of changes from other in-flight branches that will need to be merged first:
Additionally, this PR replaces some work done on other PRs / branches:
Bamba
support: Bamba architecture #10810Bamba
support: https://github.com/gabe-l-hart/llama.cpp/tree/BambaArchitectureRefactorGranite 4.0
support: https://github.com/gabe-l-hart/llama.cpp/tree/GraniteFourDraftBamba
work, this will also be abandoned in favor of this PRJamba
: llama : support Jamba hybrid Transformer-Mamba models #7531master
.Jamba
support in this branch, but on further inspection, it looks like theJamba
architecture has some additional bells-and-whistles (eg sliding-window-attention) that would need further work, so my plan is to leaveJamba
off for now and possibly tackle it later (hopefully it's much easier than the original branch!)Outstanding Questions
Besides the upstream PRs, there are a few questions to answer before this PR is merge ready:
llama-kv-cache
beyond those in feat: Hybrid unified/recurrent cache #13276, but they depend on the addition ofhparams.recurrent_layer_arr
which is only populated correctly if there is a valid model architecture to check against. Should I move all of these changes to the hybrid cache PR or keep them here where the model architectures become real?hparams.recurrent_layer_arr
? Using a max-layer-sizestd::array
doesn't feel quite right.Bamba
andgranite-4.0-tiny-shared-preview
on this branch vs the respective draft branches, so I need to determine if this is due to changes in the attention implementation (ie "working as expected") or a bug somewhere.dymamic_cast
to get the right cache type could be expensive (though it's likely negligible relative to the tensor math). Should we do something more clever to handle different cache types inllama-graph
?switch
statement for determining the type of KV cache to allocate inllama-model.cpp
seems redundant withllama_model_is_recurrent
andllama_model_is_hybrid
. Should we use those functions instead and eliminate the duplicate logic and additional place to tweak for new recurrent / hybrid models?Testing
To test out this branch, I've been using the following models:
granite-4.0-tiny-preview
: https://huggingface.co/ibm-granite/granite-4.0-tiny-previewBamba-9B-v1
: https://huggingface.co/ibm-ai-platform/Bamba-9B-v1mamba2-370m-hf
: https://huggingface.co/AntonV/mamba2-370m-hfDetails
This PR has a lot of changes in it, some of which are isolated in the prereq-PRs above. In addition to the general
mamba2
andllama_kv_cache_hybrid
changes, this PR does the following:python side
BambaForCausalLM
andGraniteMoeHybridForCausalLM
gguf_writer.py
that allows duplicate key/value pairs throughadd_key_value
if (and only if) they match both value and type with the existing key. This is a convenience for hybrid models so that the converter doesn't need to rewrite the hparam conversion from multiple parents.HybridAttention
section underKeys
inconstants.py
to holdattention.layer_indices
. OPEN QUESTION: Should this just go underAttention
?c++ side
llama_model_is_hybrid
akin tollama_model_is_recurrent
llama_model_is_recurrent
intollm_arch_is_*
implemented inllama-arch.*
andllama_model_is_*
implemented inllama-model.*
. This was done so that they could be used during model initialization before the model itself can be passed as the argument, specifically to determine how to populatehparams.recurrent_layer_arr
(see below).hparams.recurrent_layer_arr
and support parsing ithparams.n_embd_k_s
/hparams.n_embd_v_s
0
. This should be fine since none of those places interact with the hybrid caching.hparams.recurrent_layer(uint32_t)
to check whether a given layer is recurrentbamba
andgranitemoeshared
inllama-arch.*
(the boring part!)hparams
as an additional argument to thellama_model.create_memory
methodllama-graph
, anywhere that a specific cache type needs to be fetched, it is grabbed using new methodsget_recurrent_cache
/get_unified_cache
. These methods usedynamic_cast
to handle both non-hybrid caches and hybrid caches.llama-model.cpp
bamba
andgranitemoehybrid
inllama-model
build_mamba_layer
/build_mamba2_layer
fromllm_build_mamba
andbuild_attention_layer
/build_layer_ffn
fromllm_build_granite
intostatic
methods on their respective classes. This makes for some gross function signatures where member data needs to be explicitly passed, but it allows the hybrid model architecture(s) to use these methods without complex inheritance.