[pull] master from ggml-org:master #205

pull · 2025-07-09T10:12:04Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.2)

Can you help keep this open source service alive? 💖 Please sponsor : )

* v1 * push more fixes * another fix * fix * more fixes * minor fix * more cleaning on python code * python fixes * changed precision for multipliers float 32->64 * fixes * another fix * fix * pre-norm -> norm * fix * Revert "fix" This reverts commit 243e4d1. * fix * small fix ffn_norm * try * mix instead of max * fix vocab size * conflict solve * fixed multipliers * falcon-h1 specefic vocab resolved * read arch from gguf.MODEL_ARCH * mamba_d_ssm added to d_inner find_hparam * remove unused functions from gguf_writer.py * override modify_tensors instead of get_tensors * fix conversion and d_inner * added some cb functions for debugging puposes * inp_out_ids moved outside of layers loop * mup_vec create as float64 * fix rope_theta * injected mup * clean ups * rm extra space * rm unused MAMBA_CHUNK_SIZE * rm unused key * add bos False * changed ROPE_TYPE * cleaning debugging stuff * cleaning debug quant * fix comment * some cleanups * some cleanups * Update src/llama-model-loader.cpp * more cleanups * moe cleanuips * d_ssm -> d_inner; * cleaning unused hparams * cleanup * more cleanups * more cleanups on python conversion; * minor cleanups * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> * remove todo * added falcon-h1 * tensor not required * clean * remove unneeded attributes * more cleanups and fixed conversion * remove final_norm * flake8 fixes * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * flake8 fixes * Update src/llama-hparams.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-arch.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * added hashes * Update src/llama-arch.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Update src/llama-vocab.cpp Co-authored-by: Georgi Gerganov <[email protected]> * update the update file * Revert "update the update file" This reverts commit 082ab4a. * fix: address suggestions * fix: update convert_hf_to_gguf.py * Update gguf-py/gguf/constants.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model-loader.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * d_inner fixed * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * reshaping ssm_norm for 34B * removing generate_mup * remove duplicates metadata keys * rm comment * final comment * fix unused args * fix constants * fix bad merge * Update src/llama-model.cpp Co-authored-by: compilade <[email protected]> * falcon-h1: remove unused ssm_in_b and bad merge * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * falcon-h1: fix last comment * Update convert_hf_to_gguf.py Co-authored-by: compilade <[email protected]> * falcon-h1: revert add_add_bos(False) * falcon-h1: fix tied weights * falcon-h1: remove whitespace * falcon-h1: fix wrong size param * falcon-h1: fix whitespace issues --------- Co-authored-by: younesbelkada <[email protected]> Co-authored-by: Younes B <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]> Co-authored-by: compilade <[email protected]>

* ggml : add ggml_scale_bias * ggml_vec_mad1_f32 * add more simd * add CUDA * sycl * vulkan * cann (placeholder) * opencl * will this fix cpu? * fix cuda * suggestions from coderabbit * fix cann compile error * vDSP_vsmsa * rm __ARM_FEATURE_SVE * use memcpy for op params * make code looks more consistent * use scalar for __ARM_FEATURE_SVE * add x param to ggml_vec_mad1_f32

* wip: llama : separate recurrent states from the KV cache This will be necessary to support Jamba (and other recurrent models mixed with Attention). Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states. * llama : use std::find for seq_nodes in llama_rs_cache * llama : state checkpoints for recurrent models * llama : correctly handle more edge cases for the rs cache * llama : rename many llama_kv_cache_* functions * llama : remove useless return value for some llama_cache_* functions * llama : rethink recurrent state cell counts * llama : begin work on support for variable GQA This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads. * llama : gracefully fail when not finding hybrid slot * llama : support Jamba * llama : fix BERT inference without KV cache * convert-hf : check for unprocessed Jamba experts * convert-hf : support Mini-Jamba conversion * llama : fix Jamba quantization sanity checks * llama : sequence-length-aware batch splitting * llama : use equal-sequence-length sub-batches for recurrent models * ggml : simplify SSM-related operators * llama : make recurrent state slot allocation contiguous * llama : adapt internal uses of batches to llama_ubatch * llama : fix batch split output count for embeddings * llama : minimize swaps when reordering logits This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models. * llama : fix edge case finding batch seq_id of split recurrent cell This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash. * llama : avoid copies for simple batch splits * ggml : make ggml_ssm_scan not modify its source tensors * llama : fix shared recurrent tail cell count for small ubatch sizes Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model. * llama : fix .base() compilation error on Windows * llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL * ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors The implementation already supported it, and this makes Mamba's conv step slightly faster. * mamba : fix non-contiguous usage of ggml_silu * llama : session saving and reloading for hybrid models * convert_hf : fix Jamba conversion * llama : fix mixed signedness comparison * llama : use unused n_embd_k_gqa in k_shift This also slightly reduces the diff from the master branch * llama : begin renaming llama_past back to llama_kv_cache * llama : remove implicit recurrent state rollbacks * llama : partially apply clang-format style * convert : fix jamba conv1d shape squeezing * graph : add back hybrid memory graph input But this time it contains the sub-cache graph inputs. This *should* make it easier to handle updating the inputs when caching the graph (eventually). * model : add Jamba to Mamba-specific hparams printing * jamba : remove redundant nullptr initializations * model : remove unnecessary prefix for tensor loading constants Co-authored-by: Sigbjørn Skjæret <[email protected]> * model : use ggml_swiglu_split for Mamba Co-authored-by: Sigbjørn Skjæret <[email protected]> * model : make falcon-h1 use shared mamba2 layer builder * memory : avoid referring to KV in recurrent cache logs * gguf-py : avoid adding duplicate tensor mappings for Jamba Some of the tensor names are common with Llama4 --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

* SYCL: Initial set_rows kernel implementation * Revert max_threads to 256 * Refactor set_rows and address review comments * Deduplicate conversion function * Remove guard before kernel launch and refactor * Fix and add back SFINAE

* cmake : do not search for curl libraries by ourselves * run : do not search for curl libraries by ourselves

* Docs: script to auto-generate ggml operations docs * Review: formatting changes + change github action * Use built-in types instead of typing * docs : add BLAS and Metal ops --------- Co-authored-by: Georgi Gerganov <[email protected]>

* support for smoldocling * fixed merge conflicts * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Gabe Goodhart <[email protected]> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Gabe Goodhart <[email protected]> * merge conflicts * pre tokenizer merge fix * convert : fix smollm3 jinja template (#14586) Signed-off-by: ryan-mangeno <[email protected]> * support for smoldocling Signed-off-by: ryan-mangeno <[email protected]> * fixed merge conflicts Signed-off-by: ryan-mangeno <[email protected]> * Update src/llama-vocab.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.h Co-authored-by: Sigbjørn Skjæret <[email protected]> * safetensors tensor mapping Signed-off-by: ryan-mangeno <[email protected]> * added back accidental removal of clean spaces for hunyuan * Update src/llama-vocab.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * updated hash and reordererd model list * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-vocab.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update include/llama.h Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update convert_hf_to_gguf_update.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-vocab.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * removed old tensor name * removed tensor mappings -> handled by smolvlm * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Signed-off-by: ryan-mangeno <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]> Co-authored-by: Xuan-Son Nguyen <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]> Co-authored-by: compilade <[email protected]>

* opencl: add `set_rows` for `f16` and `f32` * opencl: better choose workgroup size for `set_rows`

* add tiled mul_mat_f16_f32 * fix trailing whitespace * add insightful comments

* wip: llama : separate recurrent states from the KV cache This will be necessary to support Jamba (and other recurrent models mixed with Attention). Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states. * llama : use std::find for seq_nodes in llama_rs_cache * llama : state checkpoints for recurrent models * llama : correctly handle more edge cases for the rs cache * llama : rename many llama_kv_cache_* functions * llama : remove useless return value for some llama_cache_* functions * llama : rethink recurrent state cell counts * llama : begin work on support for variable GQA This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads. * llama : gracefully fail when not finding hybrid slot * llama : support Jamba * llama : fix BERT inference without KV cache * convert-hf : check for unprocessed Jamba experts * convert-hf : support Mini-Jamba conversion * llama : fix Jamba quantization sanity checks * llama : sequence-length-aware batch splitting * llama : use equal-sequence-length sub-batches for recurrent models * ggml : simplify SSM-related operators * llama : make recurrent state slot allocation contiguous * llama : adapt internal uses of batches to llama_ubatch * llama : fix batch split output count for embeddings * llama : minimize swaps when reordering logits This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models. * llama : fix edge case finding batch seq_id of split recurrent cell This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash. * llama : avoid copies for simple batch splits * llama : use im2col and mul_mat to perform convolution for Mamba This removes the need for ggml_ssm_conv!!! But performance seems slighly worse on my system, especially for prompt processing. Maybe ggml_mul_mat isn't optimized for small row sizes? More performance testing is necessary until GGML_OP_SSM_CONV is removed. * ggml : make ggml_ssm_scan not modify its source tensors * llama : fix shared recurrent tail cell count for small ubatch sizes Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model. * llama : fix .base() compilation error on Windows * llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL * ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors The implementation already supported it, and this makes Mamba's conv step slightly faster. * llama : rename llama_cache to llama_past This can be changed back later if the name change is wrong. I was renaming the functions anyway to generalize kv-cache-related functions to hybrid and recurrent model architectures. I think llama_past is a better name than llama_cache for a combined kv cache and recurrent state cache, because the states it contains pretty much always come before the newly-added ones for any particular sequence. Also 'llama_past_clear' sounds more obvious in what it does than 'llama_kv_cache_clear'. The future is what the models generate. (For embeddings, the kv cache isn't really used anyway) Still, I'm open to better suggestions. * examples : replace llama_kv_cache_seq_* with llama_past_seq_* * mamba : fix non-contiguous usage of ggml_silu * llama : initial Mamba-2 support * ggml : SIMD ggml_ssm_scan for Mamba-2 * ggml : improve ggml_mul speed when masking recurrent states * llama : support running Mamba-Codestral-7B-v0.1 * llama : fix Mamba-2 conv state saving * ggml : make the ggml_mul fast broadcast path more consistently formatted * llama : remove unused variable * llama : add missing break * convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly. * llama : session saving and reloading for hybrid models * convert_hf : fix Jamba conversion * llama : fix mixed signedness comparison * llama : use unused n_embd_k_gqa in k_shift This also slightly reduces the diff from the master branch * llama : begin renaming llama_past back to llama_kv_cache * llama : avoid redundant state copy for Mamba 1 and 2 * metal : attempt to adapt SSM_SCAN for Mamba-2 * metal : fix SSM_SCAN pipeline scope * metal : use log and exp instead of log1pf and expf in SSM_SCAN * metal : remove unused arguments for SSM_SCAN The max index is 31, so trimming the arguments is necessary. * metal : add back n_seqs to SSM_SCAN args Whoops, this is needed for the offset in the concatenated output. * metal : fix SSM_SCAN state head offset * metal : fix wrong number of tokens per sequence in SSM_SCAN * ggml : remove unused fast broadcast path in GGML_MUL This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity. * ggml : avoid multiply by D in GGML_OP_SSM_SCAN This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks * convert : fix flake8 lint * llama : remove implicit recurrent state rollbacks * llama : partially apply clang-format style * metal : fix confusion between ; and , * metal : add missing args for nb references in ssm_scan_f32_group * metal : single-user mamba2 inference works * kv-cache : remove const_cast when setting inputs for s_copy And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy. * convert : avoid AutoConfig for Mamba and Mamba2 hparams * kv-cache : allow context shift for recurrent models * graph : fix recurrent state copies when avoiding copies Works, but using lambda functions might not be that clean. * ggml : fix mamba2 ssm scan when compiled with SVE * ggml-cpu : reorder SVE FMA for consistency with other SIMD arches * cuda : implement ssm scan for Mamba2 There is still room for improvement, but it works! * cuda : adapt Mamba1 ssm scan to shape changes from Mamba2 * feat: Add conversion for Bamba models This is borrowed and adapted from the original implementation #10810 Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add Granite 4 conversion This is a manual copy from my draft branch https://github.com/gabe-l-hart/llama.cpp/blob/GraniteFourDraft/convert_hf_to_gguf.py#L5076 Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * feat: Plumb bamba through llama-arch Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add bamba to llama_arch_is_hybrid_recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add optional mamba ssm_in bias tensor Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add template specialization for get_arr to load a vector<uint32_t> for layer index arr in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * feat: Use an explicit bool to determine mamaba vs mamba2 This allows other architectures like bamba and granitemoehybrid to use mamab2 without a growing architecture `if` statement inside the mamba implementation. Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * feat: Isolate mamba(2) and granite attention layer building in static methods This will allow these layer-builder methods to be used from other build structs without complex inheritance. Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use per-layer sizes in granite build_attention_layer Also no need to pass in kv cache since it's already in the inp_attn Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * feat: First (broken) pass at end-to-end Bamba implementation It generates (garbage) tokens! Still lots of debugging to do. Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * fix: Only do Granite multipliers if set Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Pull granite ffn portion into a static function and reuse in hybrid Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * feat(py): Allow gguf duplicate keys if they match by value and type This is helpful for hybrid models that want to do gguf param setting by calling multiple parent classes without needing to make those parent classes try/except on every attempt to set a gguf value. Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * refactor(py): Simplify granitemoehybrid conversion to use parents better Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add GRANITE_MOE_HYBRID through llama-arch Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * feat: Support GRANITE_MOE_HYBRID in llama-model This re-uses the Bamba code paths heavily and simply adds the missing parts for loading MoE and the shared expert. Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * style: Fix flake8 errors Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * fix: Fix recurrent cache get after rebase Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * fix: Fix hybrid granite implementation for signature changes in build_mamba*_layer Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Refactor relationship between non-hybrid classes and hybrid impl to use mixins The challenge here is to give both the non-hybrid classes (llm_build_mamba and llm_build_granite) AND the hybrid class (llm_build_hybrid_mamba) access to the same intermediate "base class" functionality (build_mamba*_layer, build_granite_attention_layer) without running into trouble with diamond inheritance of llm_graph_context. Due to the non-trivial initialization that happens in llm_graph_context, diamond inheritance results in multiple initializations of the common base which cause problems around the unique ptrs. I wanted to get away from `self->` everywhere, but this is still a bit cleaner than making those methods static I think. Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Implement the full copy-paste version to duplicate the layer builders This follows the pattern where the type of input is pinned to the type of memory and that is used to dispatch to the correct version of `build_rs` / `build_attn`. There's a lot of code duplication that can hopefully be pulled into common functions in the graph later. Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Rename llm_build_hybrid_mamba -> llm_build_granite_hybrid I've got back-and-forth a lot about how/if to try to implement reuse of the "child model" layer types for hybrid models. At the end of the day, I think hybrid models are their own beast and even if their layers are inspired by other models, they should maintain control of their own layer building (in other words, the copy-paste method). Given that, the name should reflect that this is not a generic hybrid model builder, but rather a granite- specific hybrid model builder that can do MoE (granite 4) or dense (bamba). As part if this, I also cleaned up dangling comments from previous attempts at using static methods for reusability. Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * mamba : fix mismatched new and delete size for llm_build_mamba Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON * memory : correctly handle failure in apply() ggml-ci * style: Remove TODO for adding first hybrid models to the switch Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * fix: Fix bad merge in tensor_mapping.py w/ SSM_NORM Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * fix: Fix bad merge resolution with variable renames/moves in llm_build_mamba Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * docs: Fix comment about duplicate key check Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * fix: Conform to standard way of initializing inp_out_ids Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * convert : fix jamba conv1d shape squeezing * fix: Fix input initialization in granite_hybrid after removal of hybrid inputs Branch: GraniteFourWithJamba Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use llm_graph_context_mamba in llm_build_granite_hybrid Branch: GraniteFourWithJamba Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Refactor mamba2/granite/jamba/granite_hybrid relationships as mixins The key is for the mixin classes (llm_graph_context_mamba, llm_graph_context_granite) to use virtual inheritance from llm_graph_context. This allows the common members to exist only once in the class hierarchy. The downside is that llm_graph_context will be re-initialized once for each parent (ie 2x for single mixin, 3x for two mixins, etc...). Branch: GraniteFourWithJamba Signed-off-by: Gabe Goodhart <[email protected]> * graph : add back hybrid memory graph input But this time it contains the sub-cache graph inputs. This *should* make it easier to handle updating the inputs when caching the graph (eventually). * model : add Jamba to Mamba-specific hparams printing * fix: Fix input setup after upstream merge Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * jamba : remove redundant nullptr initializations * model : remove unnecessary prefix for tensor loading constants Co-authored-by: Sigbjørn Skjæret <[email protected]> * model : use ggml_swiglu_split for Mamba Co-authored-by: Sigbjørn Skjæret <[email protected]> * feat: Add support for dense FFN in GraniteMoeHybrid This was already partially supported via reusing the granite ffn builder, and there may be models that leverage this architecture going forward. The naming is a bit odd, but in the transformers version, it reuses the same model class and simply has zero regular experts and a single shared expert (which is the same as a single dense FFN). Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add support for dense FFN tensor names on c++ side Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use child inputs for Falcon H1 after merge resolution Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * fix: Remove unnecessary prefix on tensor constants Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]> * model : make falcon-h1 use shared mamba2 layer builder * memory : avoid referring to KV in recurrent cache logs * fix: Revert order changes for Falcon H1 to stay consistent with upstream Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * gguf-py : avoid adding duplicate tensor mappings for Jamba Some of the tensor names are common with Llama4 * refactor: Collapse Bamba and GraniteMoeHybrid into GraniteHybrid The only key difference is the use of rope which is now set via rope_finetuned in the hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Remove use of diamond inheritance Per PR discussion, it's simpler to keep this with basic inheritance and not introduce the complexity of virtual inheritance and multiple inheritance #13550 (comment) Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * feat: Log mamba params for Granite Hybrid Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * fix: Remove unused ssm_in_b Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Remove ATTENTION_LAYER_INDICES hparam in favor of n_head_kv This matches how recurrent vs attention heads are identified for Jamba Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * fix: Remove unused template expansion for get_arr Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * fix: Review cleanup in convert_hf_to_gguf The gist is to be explicit about which base class is being used with the multiple inheritance setup Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * fix: Undo hidden warnings about duplicate identical keys in add_key_value After further discussion, this encourages sloppy overwriting in the model converters Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * fix: If not using ROPE, context is "infinite" Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * doc: Add a comment outlining expected duplicate key warnings Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * fix: Remove unnecessary duplicate keys in converter Co-authored-by: Francis Couture-Harpin <[email protected]> (thanks for the sharp eyes and patience!) Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Francis Couture-Harpin <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

ggml-ci

* readme : add hot PRs * cont * readme : update title * readme : hot PRs links * cont

**Important** LFM2 was [merged ](huggingface/transformers#39340 transformers, but has not yet been released. To convert into gguf, install transformers from source ```shell pip install "transformers @ git+https://github.com/huggingface/transformers.git@main" ```

* vulkan: allow unclamped loads in coopmat2 mul_mat_id shader * vulkan: increase coopmat2 mul_mat_id tile size * vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path * vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)

* vulkan: support SET_ROWS Add variants of the copy_to_quant shader that do the SET_ROWS operation. Change these shaders to spread the work across the workgroup. The memory access pattern is probably not great (one thread per quant block), but should be fine for now. * vulkan: optimize set_rows Larger workgroups for non-quant types. Set "norepeat" (there is manual repeat logic). Use fastmod.

* vulkan: fix noncontig check for mat_mul_id splitting Remove supports_op check for > 4096 (splitting fixes this) * vulkan: fix batched matmul dequant for Q*_K

* Kimi-K2 conversion * add Kimi_K2 pre type * Kimi-K2 * Kimi-K2 unicode * Kimi-K2 * LLAMA_MAX_EXPERTS 384 * fix vocab iteration * regex space fix * add kimi-k2 to pre_computed_hashes * Updated with kimi-k2 get_vocab_base_pre hash * fix whitespaces * fix flake errors * remove more unicode.cpp whitespaces * change set_vocab() flow * add moonshotai-Kimi-K2.jinja to /models/templates/ * update moonshotai-Kimi-K2.jinja * add kimi-k2 chat template * add kimi-k2 * update NotImplementedError Co-authored-by: Sigbjørn Skjæret <[email protected]> * except Exception Co-authored-by: Sigbjørn Skjæret <[email protected]> * LLM_CHAT_TEMPLATE_KIMI_K2 if(add_ass){} --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

Add LLAMA_API to fix the run-time error with llama-cpp-python in Windows env: attributeError: function 'llama_kv_self_seq_div' not found. Did you mean: 'llama_kv_self_seq_add'? Although llama_kv_self_seq_div() has been marked deprecated but it is necessary to export it to make llama-cpp-python happy. Observed software version: OS: windows compiler: MSVC llama-cpp-python: tag: v0.3.12-cu124 llama.cpp: tag: b5833 Signed-off-by: Min-Hua Chen <[email protected]> Co-authored-by: Min-Hua Chen <[email protected]>

)

ggml-ci

* ggml : add asserts ggml-ci * cont : fix constant type Co-authored-by: Diego Devesa <[email protected]> --------- Co-authored-by: Diego Devesa <[email protected]>

* Support diffusion models: Add Dream 7B * Move diffusion to examples * Move stuff to examples. Add patch to not use kv-cache * Address review comments * Make sampling fast * llama: remove diffusion functions * Add basic timings + cleanup * More cleanup * Review comments: better formating, use LOG instead std::cerr, re-use batch, use ubatch instead of max_length * fixup! * Review: move everything to diffusion-cli for now

* kv-cache : prepare K/V buffers for separation ggml-ci * batched-bench : fix oob write ggml-ci * llama : add "virtual sequences" ggml-ci * llama : use "stream" vs "virtual sequence" ggml-ci * graph : fix stream splitting when KV cache is not used ggml-ci * kv-cache : add multi-stream save/load support ggml-ci * llama : add "--attn-streams" flag ggml-ci * kv-cache : fix handling when find_slot fails ggml-ci * kv-cache : restore find_slot impl ggml-ci * kv-cache : add comments * kv-cache : add bounds checks for sequence id ggml-ci * cont : add n_seq_max to batch allocr ggml-ci * kv-cache : perform stream copies lazily after llama_synchronize ggml-ci * kv-cache : avoid throwing exceptions across the C boundary ggml-ci * CUDA: 4D FlashAttention support (#14628) * CUDA: 4D FlashAttention support * CUDA: fix WMMA FA kernel * llama : rename attn_streams -> kv_unified ggml-ci * common : rename kv_split -> kv_unified ggml-ci --------- Co-authored-by: Johannes Gäßler <[email protected]>

Co-authored-by: qwaqrm <[email protected]>

* Minimal setup of webgpu backend with dawn. Just prints out the adapter and segfaults * Initialize webgpu device * Making progress on setting up the backend * Finish more boilerplate/utility functions * Organize file and work on alloc buffer * Add webgpu_context to prepare for actually running some shaders * Work on memset and add shader loading * Work on memset polyfill * Implement set_tensor as webgpu WriteBuffer, remove host_buffer stubs since webgpu doesn't support it * Implement get_tensor and buffer_clear * Finish rest of setup * Start work on compute graph * Basic mat mul working * Work on emscripten build * Basic WebGPU backend instructions * Use EMSCRIPTEN flag * Work on passing ci, implement 4d tensor multiplication * Pass thread safety test * Implement permuting for mul_mat and cpy * minor cleanups * Address feedback * Remove division by type size in cpy op * Fix formatting and add github action workflows for vulkan and metal (m-series) webgpu backends * Fix name * Fix macos dawn prefix path

* make hf token optional * fail if we can't get necessary tokenizer config

ggml-ci

* llama : reuse compute graphs ggml-ci * llama-bench : add graph reuse parameter ggml-ci * cont : remove the parameter and the sched resets ggml-ci * graph : rename update() to can_reuse() ggml-ci * params : remove is_same() ggml-ci * graph : set res->params in llm_graph_context constructor ggml-ci * graph : avoid set_max_nodes in llm_graph_result ggml-ci * kv-cache : reuse llama_context's graph result instance ggml-ci * context : reset the previous graph result upon memory updates ggml-ci * batch : llama_ubatch now carries its data instead of pointing to balloc ggml-ci * merge : fix build ggml-ci * graph : fix can_reuse() checks when flash-attention is disabled * graph : move llm_graph_result impl in source file + debug env ggml-ci

ggml-ci

* Add Ernie4.5 MoE * Fix Flake errors. * Properly encode/decode MoE layer step * Correct tensor mappings (.weight) * Pass and read n_ff_exp * n_ff_shexp calculation and further minor changes * Rope fixes. * .gitignore fix * Add unit32 cast for Linux builds * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> * Further fixes from code review * Fix trailing whitespace * Reenable missing experts error * Code style from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> * Fix non-MoE regression Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

Without that condition, this debug log clutters the screen every batch treated in the prompt processing, or every token generated in Kobold.cpp.

ggml-ci

ngxson and others added 4 commits July 9, 2025 09:26

convert : fix smollm3 jinja template (#14586)

20b7bf8

llama : remove unintended whitespace (#14592)

1055545

model : add skt/A.X-4.0 model vocabulary (#14589)

ffd59e7

pull bot locked and limited conversation to collaborators Jul 9, 2025

pull bot added the ⤵️ pull label Jul 9, 2025

github-actions bot added the python label Jul 9, 2025

ggml : prevent integer overflow in gguf tensor size calculation (#14595)

26a48ad

github-actions bot added the ggml label Jul 9, 2025

ngxson and others added 21 commits July 9, 2025 18:16

llama : remove llm_graph_input_one (#14603)

cb9178f

cuda : support Falcon-H1 state size for SSM_SCAN (#14602)

a57d1bc

cmake : llguidance build parser library only (#14608)

ac44eb6

cmake : bump llguidance version to v1.0.1 (#14609)

f9a867f

llama : minor coding style fix for smollm3 (#14605)

435a6d1

SYCL: Initial set_rows kernel implementation (#14562)

704bb7a

* SYCL: Initial set_rows kernel implementation * Revert max_threads to 256 * Refactor set_rows and address review comments * Deduplicate conversion function * Remove guard before kernel launch and refactor * Fix and add back SFINAE

cmake : do not search for curl libraries by ourselves (#14613)

a457551

* cmake : do not search for curl libraries by ourselves * run : do not search for curl libraries by ourselves

opencl: add set_rows for f16 and f32 (#14547)

0b88557

* opencl: add `set_rows` for `f16` and `f32` * opencl: better choose workgroup size for `set_rows`

opencl: add tiled mul_mat_f16_f32 (#14535)

6bdda13

* add tiled mul_mat_f16_f32 * fix trailing whitespace * add insightful comments

vocab : add midm-2.0 model pre-tokenizer (#14626)

576c82e

llama : move enum llama_vocab_pre_type to implementation (#14631)

0d5375d

ggml-ci

readme : add hot PRs (#14636)

aaa088d

* readme : add hot PRs * cont * readme : update title * readme : hot PRs links * cont

HIP : Add HIP 7.0+ compatibility for hipBLAS compute types (#14634)

756aa10

jeffbolznv and others added 30 commits July 15, 2025 21:32

vulkan: add RTE variants for glu/add/sub/mul/div (#14653)

10a0351

vulkan: fix noncontig check for mat_mul_id splitting (#14683)

ba1ceb3

* vulkan: fix noncontig check for mat_mul_id splitting Remove supports_op check for > 4096 (splitting fixes this) * vulkan: fix batched matmul dequant for Q*_K

gguf-py : dump bpw per layer and model in markdown mode (#14703)

c81f419

convert : add pre-computed hashes first to prevent order mishaps (#14701

cf91f21

)

convert : only check for tokenizer folder if we need it (#14704)

4b91d6f

scripts: synthetic prompt mode for server-bench.py (#14695)

5cae766

server : fix handling of the ignore_eos flag (#14710)

538cc77

ggml-ci

llama : fix parallel processing for plamo2 (#14716)

e4841d2

server : pre-calculate EOG logit biases (#14721)

6ffd4e9

ggml-ci

ggml : add asserts (#14720)

6497834

* ggml : add asserts ggml-ci * cont : fix constant type Co-authored-by: Diego Devesa <[email protected]> --------- Co-authored-by: Diego Devesa <[email protected]>

model : support output bias for qwen2 (#14711)

b0f0ecc

Co-authored-by: qwaqrm <[email protected]>

llama : fix parameter order for hybrid memory initialization (#14725)

496957e

convert : make hf token optional (#14717)

19e5943

* make hf token optional * fail if we can't get necessary tokenizer config

ci : disable failing vulkan crossbuilds (#14723)

1ba45d4

batch : fix uninitialized has_cpl flag (#14733)

ad57d3e

ggml-ci

kv-cache : opt mask set input (#14600)

d9b6910

ggml-ci

llama : fix parallel processing for lfm2 (#14705)

086cf81

kv-cache : fix k-shift for multiple streams (#14742)

d6fb3f6

ggml-ci

nix : use optionalAttrs for env mkDerivation attrset argument (#14726)

760b448

convert : fix Ernie4.5 MoE without shared experts (#14746)

670e136

use max work group size for device to replace the magic number (#14732)

349ea79

graph : Pass the graph placeholder message in debug mode (#14748)

09651d0

Without that condition, this debug log clutters the screen every batch treated in the prompt processing, or every token generated in Kobold.cpp.

graph : refactor context to not pass gf explicitly (#14629)

8f974bc

ggml-ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from ggml-org:master #205

[pull] master from ggml-org:master #205

pull bot commented Jul 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

[pull] master from ggml-org:master #205

Are you sure you want to change the base?

[pull] master from ggml-org:master #205

Conversation

pull bot commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pull bot commented Jul 9, 2025 •

edited

Loading