address #16574; fold CLI into mtmd-cli; use ggml_rope_ext + bicubic;s… by pockers21 · Pull Request #5 · pockers21/llama.cpp

pockers21 · 2025-11-05T10:22:55Z

…witch to 'jinaclip2'; fix converter constants

Make sure to read the contributing guidelines before submitting a PR

I've had issues loading models with llama-server: [44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf' and I was sure it could access the file. Seems like --models-dir and --models-presets dont interact like I thought they would but I salvaged this snippet that helps troubleshooting [44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf' (errno No such file or directory)

* Fix GLM 4.7 MoE gating func * Update src/models/deepseek2.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* memory : add llama_memory_hybrid_iswa * Update src/llama-memory-hybrid-iswa.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and update the indexing calculations in get_offsets. Mat-vec is faster than mat-mat for small values of n. We don't get the same reuse of the weights as in the non-ID path, but with this the cost is linear in n rather than n>1 being far slower than n==1.

… subgroup crash (ggml-org#17356)" (ggml-org#18831) This reverts commit 980b7cd.

…org#18938)

* from previous PR * Make instruction(system) as first message * Convert [input_message] (text/image/file) * Rename convert_responses_to_chatcmpl(body) -> response_body * Initial tool call support * Erase instructions field from chatcmpl body * Feed reasoning texts to chat template * Use std::vector instead of opaque json array * Make output_item.added events consistent * Move `server_task_result_cmpl_partial::update` from header to source * Match ID of output_item.added and .done events * Add function_call only if there is no "fc_" prefix * Add function call output at non-streaming API * Test if ID is persistent * Add doc * Fix style - use trailing comma * Rewrite state management * catch up with upstream/master * Fix style - "type" is the first item of SSE data * Explicitly check "instructions" from response_body * Make lambdas static * Check if reasoning content exists * Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final * Reject `input_file` since it is not supported by chatcmpl * Add "fc_" prefix to non-straming function call id as coderabbit pointed out --------- Co-authored-by: openingnow <>

…ml-org#18987) Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…18945) * vulkan: Remove transfer_ctx, do everything in compute_ctx. We had a bug where a set_tensor_async (using transfer_ctx) didn't get submitted before the graph_compute (using compute_ctx) that came after it. To avoid this sort of issue, just do everything in compute_ctx. Remove transfer_cmd_pool, which was already unused. * fix crash with perf logger

…8997) This commit removes the mention of RoPE in the comment for the Q and K computation as RoPE is not applied.

* fix: Use `tabular-nums` for chat message statistics * fix: Rebuild WebUI

* vulkan: make FA mask/softcap enables spec constants * don't specialize for sinks * bump timeout a little bit

…gml-org#19376) The cpu and cuda backends use fp16 for the VKQ accumulator type, this change does the same for vulkan. This helps particularly with large head sizes which are very register-limited. I tried this for the coopmat1 path and it slowed down a bit. I didn't try for scalar. I applied the softmax bias that the cuda backend uses to avoid overflow, although I was not able to reproduce the original bug without it.

* kimi linear model implementation * kimi linear convert_hf_to_gguf * kimi linear constants.py tensor_mapping.py * Kimi Linear ggml.h * kimi linear ggml-cpu * Kimi Linear ggml-cuda * Kimi Linear ggml.c * kimi linear src/llama * remove "const int64_t n_seq_tokens = q->ne[2];" to get rid of unused variable warning * remove type mismatch warning * read MoE params * removed some hard coded code * removed all hard code * use DeepseekV2 tokenizer * removed unnecessary internal methods called by the old set_vocab of KimiLinear * rewrite get_vocab for KimiLinear. Removed all kda_scan code * removed all traces of kda_scan * reduce OP count by 1 due to removal of kda_scan * Move KIMI_LINEAR to llm_arch_is_hybrid to enable KV cache * set n_embd_head_k/v to ensure kv cache works * don't quantize conv1d of Kimi Linear * Kimi Linear backend agnostic * removed LOG_INFO * naive chunking form implemented * fixed some comments * add Kimi-K2 specific tokens to be recognized as EOG * build_kda_autoregressive is implemented to replace build_kda_recurrent for faster inference. sync'd to b7682 * replaced Akk and Aqk with mul_mat and clamp * no clamp version * Moved Aqk computation out of the loop * fixed typo and split wkv_b into wk_b and wv_b * MLA KV cache support * fix trailing spaces * moved const llama_model & model; around to follow qwen3next format and see if it cna pass the -Wunused-private-field error * fix trailing whitespace * removed traling whitespaces in empty line + make sure indentation is multiple of 4 * try to make lint happy * remove blank lines to make lint happy * removed at least blank line containing white space * fixed flake8 complaints locally * return ggml_tensor * pair in kda_autoregressive and kda_chunking as in ngxson's Qwen3Next improvement * removed Kimi-Linear specific change that causes failure at server-windows * removed private: from kimi_linear to make build checks happy * removed unnecessary ggml_cont before ggml_reshape * created static function causal_conv1d to abtract similar code for q/k/v * merged dt_bias to SSM_DT. Do -exp(log_A) in convert_hf_to_gguf.py. * reverted to original * fixed find_hparam calls. Fixed e_score_correction_bias to use bias instead of weight. Removed all ssm_conv bias terms. * remove DT_B from constants.py. remove one comment line in llama-model.cpp * new class llm_graph_input_mem_hybrid_k to get around the new MLA change. switch the concat order of ggml_concat calls in kimi-linear.cpp to accommodate MLA changes. Removed support for exp_probs_b.weight * remove ssm_o_norm_b * remove ssm_o_norm_b * changed hparams.kda_head_dim to hparams.n_embd_head_kda. added TODO comment for class llama_graph_mem_hybrid_k * removed all ggml_cont b4 ggml_reshape_4d * Whitespace * replaced all hparams.get with find_hparams * added new names for n_experts, n_experts_used and score_func in TextModel and removed their code in KimiLinear in convert_hf_to_gguf.py. Removed unnecessary ggml_cont and GGML_ASSERT in kimi-linear.cpp * use is_mla to switch between different mem_hybrid types * fixed logical errors in convert_hf_to_gguf.py pointed out by CISC * removed if else for required parameters kv_lora_rank and qk_rope_head_dim * add back ggml_cont for Vcur * minor changes * removed extra line in llama-vocab.cpp. Added back the comment in llama-graph.cpp * f16 gguf cannot run without context length * made a mistake of adding back n_ctx parsing --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

* Fix model loading regex error * Change comments * Use const_iterator and remove specializations --------- Co-authored-by: Alde Rojas <hello@alde.dev>

* llama : add llama_memory_can_rm_suffix() * Revert "llama : add llama_memory_can_rm_suffix()" This reverts commit d30e59b. * spec : check if the target context is compatible for spec decoding

Only test non-F16 for head size 64 and 72 (one a multiple of QK, one not).

* Fix SYCL CEIL operator * sycl: implement GGML_OP_CEIL

…ggml-org#19310) * ggml webgpu: port binary operators to use pre-wgsl * Add binary.wgsl: unified shader with conditionals for all 4 ops * Add gen_binary_shaders.cpp: build tool for using pre_wgsl preprocessor * Remove bin_op.tmpl.wgsl and binary.wgsl (Python template) * Update CMake to generate binary operator shaders at build time * ggml-webgpu: migrate binary ops to JIT compilation with overlap handling * port binary operators from AOT to pre-wgsl JIT compilation * add src1=dst overlap handling for binary ops * use compile-time workgroup size defines instead of runtime overrides * ggml-webgpu: complete overlap handling for binary ops * add support for inplace & overlap case in binding setup * restructure conditional logic to handle all overlap cases * ensure all buffer bindings are correctly assigned for edge cases * ggml-webgpu: remove unused binary overlap cases Remove src0==src1 binary overlap case that never occurs in practice. * keep INPLACE (src0==dst), OVERLAP (src1==dst), DEFAULT * remove unused src0==src1 and all-same variant * refactor wgsl to eliminate duplication

* gguf-py: Bump sentencepiece version There's a new version that's been out for a while that addresses the issues mentioned in ggml-org#14200. There's a long chain of reasons I would like this change, but the short version is that it allows people who use both `sentencepiece` and `gguf` to take advantage of these fixes. On conda-forge, currently, it locks the version (since there is no notion of optional dependencies). Regardless, I don't think this should be too controversial. * review feedback

@CISC

* Support Step3.5-Flash * fix: norm.weight + 1 (HF zero_centered=true) * step35: simplify GGUF conversion + drop redundant rope KVs * Address review feedback * rename limits -> clamp * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Apply suggestion from @CISC Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * rename swiglu limits -> swiglu clamp in LLM_KV * avoid CI fail * Apply suggestions from code review * Apply suggestions from code review * disabled KV shifting for LLM_ARCH_STEP35 * Apply suggestions from code review * mistakenly removed cmath * add model size && apply missed suggestion * assert partial_rotary_factors * fix CI errors: * load freq_base_swa --------- Co-authored-by: lvyichen <lvyichen@stepfun.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* metal : refactor bin kernels * cont * cont : fix cv

…9411) * ci : use less jobs when building with sanitizers * cont : fix nproc * cont : fix the fix * cont : simplify

* remove server job from webui and move slow test * use pip-install option

* cleanup `llama-quantize --help` output some much needed TLC * remove future argument oops, spoiler * cleanup of cleanup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Rename variables + fix rope_neox Seems memory layout is shared with Vulkan so we can port fix from ggml-org#19299 * Fix rope_multi * Fix rope_vision * Fix rope_norm * Rename ne* to ne0* for consistent variable naming * cont : consistent stride names --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

@ymcki

* Unified delta net handling * Remove old methods. * Refactor and optimize * Adapt autoregressive version from @ymcki * Change to decay mask approach * Fix bad permute * Qwen 3.5 support * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Further fixes * Use inheritance, remove unneeded conts * Not like this! * Remove ggml.h explicit import * Remove transformers, fix the views * ACTUALLY fix views, make super calls explicit in conversion. * Fix conversion again * Remove extra ggml.h imports --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

…icubic;switch to 'jinaclip2'; fix converter constants

Remove unnecessary try/except Jina text hparams. Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

github-actions Bot added python examples labels Nov 5, 2025

pockers21 force-pushed the feature/jinaclip-v2-projector branch 4 times, most recently from 0eeb6fc to b50c9c8 Compare November 10, 2025 01:33

pockers21 force-pushed the feature/jinaclip-v2-projector branch 13 times, most recently from a2fef90 to 6617024 Compare November 20, 2025 01:35

teto and others added 11 commits January 21, 2026 08:52

memory : add llama_memory_hybrid_iswa (ggml-org#18601)

ad8d85b

* memory : add llama_memory_hybrid_iswa * Update src/llama-memory-hybrid-iswa.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Revert "vulkan: force full subgroups for flash attention to fix intel…

067b8d7

… subgroup crash (ggml-org#17356)" (ggml-org#18831) This reverts commit 980b7cd.

vulkan: support flash attention GQA/split_k with small batches (ggml-…

33f890e

…org#18938)

common : improve error message when HTTPS is missing but required (gg…

14be5a3

…ml-org#18987) Signed-off-by: Adrien Gallouët <angt@huggingface.co>

llama : clarify nemotron-h.cpp comment about RoPE [no ci] (ggml-org#1…

9da3dcd

…8997) This commit removes the mention of RoPE in the comment for the Q and K computation as RoPE is not applied.

fix: Use tabular-nums for chat message statistics (ggml-org#18915)

3802d3c

* fix: Use `tabular-nums` for chat message statistics * fix: Rebuild WebUI

pockers21 force-pushed the feature/jinaclip-v2-projector branch from d2b726f to 778d0a0 Compare February 6, 2026 07:43

jeffbolznv added 2 commits February 6, 2026 08:49

vulkan: make FA mask/softcap enables spec constants (ggml-org#19309)

f9bd518

* vulkan: make FA mask/softcap enables spec constants * don't specialize for sinks * bump timeout a little bit

pockers21 force-pushed the feature/jinaclip-v2-projector branch from 778d0a0 to a0b9c48 Compare February 6, 2026 09:23

ymcki and others added 14 commits February 6, 2026 11:39

unicode : MSVC regex fix (ggml-org#19340)

06bf379

* Fix model loading regex error * Change comments * Use const_iterator and remove specializations --------- Co-authored-by: Alde Rojas <hello@alde.dev>

common : add common_speculative_is_compat() (ggml-org#19270)

dfde599

* llama : add llama_memory_can_rm_suffix() * Revert "llama : add llama_memory_can_rm_suffix()" This reverts commit d30e59b. * spec : check if the target context is compatible for spec decoding

tests: reduce number of FA test permutations (ggml-org#19381)

db6adb3

Only test non-F16 for head size 64 and 72 (one a multiple of QK, one not).

sycl: add F16 support for GGML_OP_CEIL (ggml-org#19306)

537eadb

* Fix SYCL CEIL operator * sycl: implement GGML_OP_CEIL

metal : fix event synchronization in cpy_tensor_async (ggml-org#19402)

34ba7b5

metal : consolidate bin kernels (ggml-org#19390)

8872ad2

* metal : refactor bin kernels * cont * cont : fix cv

ci : use -j param correctly when building with sanitizers (ggml-org#1…

96441c9

…9411) * ci : use less jobs when building with sanitizers * cont : fix nproc * cont : fix the fix * cont : simplify

ci : remove server job from webui and move slow test (ggml-org#19424)

9a5f577

* remove server job from webui and move slow test * use pip-install option

llama-quantize : cleanup --help output (ggml-org#19317)

5999b50

* cleanup `llama-quantize --help` output some much needed TLC * remove future argument oops, spoiler * cleanup of cleanup

server : improve context checkpoint logic (ggml-org#19408)

eb449cd

pockers21 force-pushed the feature/jinaclip-v2-projector branch from a0b9c48 to 5e3f111 Compare February 8, 2026 07:51

angt and others added 6 commits February 8, 2026 09:06

rpc : update from common.cpp (ggml-org#19400)

5fa1c19

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

address ggml-org#16574; fold CLI into mtmd-cli; use ggml_rope_ext + b…

2de063a

…icubic;switch to 'jinaclip2'; fix converter constants

Simplify Jina BERT v3 detection logic

6fd56b7

Remove unnecessary try/except Jina text hparams. Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

remove unused fused QKV mapping

91bb220

pockers21 force-pushed the feature/jinaclip-v2-projector branch 3 times, most recently from 7a459cc to c47ad9f Compare February 9, 2026 06:18

Refactor JinaCLIP vision mmproj mapping to use tensor_mapping table

5926d82

pockers21 force-pushed the feature/jinaclip-v2-projector branch from c47ad9f to 5926d82 Compare February 9, 2026 07:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

address #16574; fold CLI into mtmd-cli; use ggml_rope_ext + bicubic;s…#5

address #16574; fold CLI into mtmd-cli; use ggml_rope_ext + bicubic;s…#5
pockers21 wants to merge 191 commits into
masterfrom
feature/jinaclip-v2-projector

pockers21 commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

pockers21 commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants