Releases · teleprint-me/llama.cpp

18 Jul 18:18

9fb1042

b5936 Latest

Latest

graph : fix graph reuse reset of params (#14760)

ggml-ci

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6
373 MB 2025-07-18T18:18:28Z
llama-b5936-bin-macos-arm64.zip

sha256:7a8910a3c26d4784ce2889cbdf3cbbe1bc6599673da9128eb165a07afc279432
10.6 MB 2025-07-18T18:18:40Z
llama-b5936-bin-macos-x64.zip

sha256:f13c6861dacff4ff8d0a19b14dfa78f1a6421fdbe7015d43c313909fb0302951
27.2 MB 2025-07-18T18:18:41Z
llama-b5936-bin-ubuntu-vulkan-x64.zip

sha256:44973ee3bf83f4655a7ae520f71db9f803c52c321e6a85dd59cecd7ed7c0c293
20.9 MB 2025-07-18T18:18:42Z
llama-b5936-bin-ubuntu-x64.zip

sha256:924578a048317fae7fe8e29d36a5a0287b408761ff9b7923d5e03175186ce900
12.5 MB 2025-07-18T18:18:43Z
llama-b5936-bin-win-cpu-arm64.zip

sha256:74a942630d64091ef986d1f20ff8da0fa9635cec6a9499845c2973abd42ac1b1
10.9 MB 2025-07-18T18:18:44Z
llama-b5936-bin-win-cpu-x64.zip

sha256:11b264ade4821c0efe4f75a5ad80568555bd12f5c9520453631a0934cb010f01
13.7 MB 2025-07-18T18:18:45Z
llama-b5936-bin-win-cuda-12.4-x64.zip

sha256:d7a870460420594ffaa25ba14eff21ac4cf3a27338a4bfd8eb26b3909fa63beb
129 MB 2025-07-18T18:18:46Z
llama-b5936-bin-win-hip-radeon-x64.zip

sha256:5b06423062001e84afe23effb793aab9bacca01223fb3f9660b2bc0c3de61038
299 MB 2025-07-18T18:18:52Z
llama-b5936-bin-win-opencl-adreno-arm64.zip

sha256:585ca8c81599c39202c788f3a7a0635b850d81fcfdcc51988554d2b182d6b4d3
11.2 MB 2025-07-18T18:19:02Z
Source code (zip)

2025-07-18T17:08:33Z
Source code (tar.gz)

2025-07-18T17:08:33Z

09 Jul 02:09

github-actions

b5849

6efcd65

b5849

vulkan: optimize flash attention split_k_reduce (#14554)

* vulkan: allow FA split_k with smaller KV values

* vulkan: spread split_k_reduce work across more threads

k_num can get rather large. Use the whole workgroup to reduce the M/L values.

Launch a thread for each element in the HSV dimension of the output. Helps a
lot for large HSV (like deepseek).

Assets 15

02 Jul 20:24

github-actions

b5813

e75ba4c

b5813

gguf-py : add support for chat template jinja files (#14508)

* add support for chat template jinja files

* remove gemma3n hack

Assets 15

30 Jun 18:57

github-actions

b5787

0a5a3b5

b5787

Add Conv2d for CPU (#14388)

* Conv2D: Add CPU version

* Half decent

* Tiled approach for F32

* remove file

* Fix tests

* Support F16 operations

* add assert about size

* Review: further formatting fixes, add assert and use CPU version of fp32->fp16

Assets 15

13 Jun 01:12

github-actions

b5648

ed52f36

b5648

sycl: Remove not needed copy f16->f32 for dnnl mul mat (#14125)

Assets 15

05 Jun 22:28

github-actions

b5599

1caae7f

b5599

gguf-py : add add_classifier_output_labels method to writer (#14031)

* add add_classifier_output_labels

* use add_classifier_output_labels

Assets 15

02 Jun 17:15

github-actions

b5575

bfd3227

b5575

mtmd : fix memory leak in mtmd_helper_eval_chunk_single (#13961)

* mtmd : fix memory in mtmd_helper_eval_chunk_single

* mtmd-cli : fix mem leak

* Update tools/mtmd/mtmd-cli.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>

Assets 18

01 Jun 22:11

github-actions

b5572

7675c55

b5572

gguf: fix failure on version == 0 (#13956)

Assets 18

30 May 20:32

github-actions

b5548

e562eec

b5548

CUDA: fix typo in FlashAttention code (#13926)

Assets 18

30 May 01:04

github-actions

b5538

ec9e030

b5538

cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (#13890)

Assets 18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: teleprint-me/llama.cpp

b5936

Uh oh!

b5849

Uh oh!

b5813

Uh oh!

b5787

Uh oh!

b5648

Uh oh!

b5599

Uh oh!

b5575

Uh oh!

b5572

Uh oh!

b5548

Uh oh!

b5538

Uh oh!