Releases: teleprint-me/llama.cpp
Releases · teleprint-me/llama.cpp
b5936
b5849
vulkan: optimize flash attention split_k_reduce (#14554) * vulkan: allow FA split_k with smaller KV values * vulkan: spread split_k_reduce work across more threads k_num can get rather large. Use the whole workgroup to reduce the M/L values. Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).
b5813
gguf-py : add support for chat template jinja files (#14508) * add support for chat template jinja files * remove gemma3n hack
b5787
Add Conv2d for CPU (#14388) * Conv2D: Add CPU version * Half decent * Tiled approach for F32 * remove file * Fix tests * Support F16 operations * add assert about size * Review: further formatting fixes, add assert and use CPU version of fp32->fp16
b5648
sycl: Remove not needed copy f16->f32 for dnnl mul mat (#14125)
b5599
gguf-py : add add_classifier_output_labels method to writer (#14031) * add add_classifier_output_labels * use add_classifier_output_labels
b5575
mtmd : fix memory leak in mtmd_helper_eval_chunk_single (#13961) * mtmd : fix memory in mtmd_helper_eval_chunk_single * mtmd-cli : fix mem leak * Update tools/mtmd/mtmd-cli.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
b5572
gguf: fix failure on version == 0 (#13956)
b5548
CUDA: fix typo in FlashAttention code (#13926)
b5538
cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (#13890)