Releases · CodeLinaro/llama.cpp

22 Aug 02:40

a1631e5

b3614

llama : simplify Mamba with advanced batch splits (#8526)

* llama : advanced batch splits

This includes equal-sequence-length batch splits which are useful
to simplify recurrent model operators.

* llama : always make recurrent state slots contiguous

* ggml : simplify mamba operators

* llama : fix integer signedness mixing

* llama : logits_all has priority over batch->logits

Otherwise, the server embeddings tests failed.
This was likely an existing problem but was only detected here
because of an additional assertion.

* llama : apply suggestions

Co-authored-by: Georgi Gerganov <[email protected]>

* llama : fix t5 segfault

* llama : fix Mamba session save and restore

* llama : minor cosmetic changes

* llama : rename llama_reorder_outputs to llama_output_reorder

Also move it closer to llama_output_reserve.

* llama : fix pooled embeddings when using batches with equal_seqs

* minor : add struct members for clarity

ggml-ci

* llama : fix T5 segfault again

* llama : fix Mamba pooled embeddings with multiple sequences

Until the pooled embeddings are refactored to allow splitting
across ubatches for causal embeddings,
recurrent models can only process a single sequence per ubatch
when calculating pooled embeddings.

* llama : add llama_model_is_recurrent to simplify figuring that out

This will make it easier to more cleanly support RWKV-v6 and Mamba-2.

* llama : fix simple splits when the batch contains embeddings

---------

Co-authored-by: Georgi Gerganov <[email protected]>

Assets 19

20 Aug 17:12

github-actions

b3608

50addec

b3608

[SYCL] fallback mmvq (#9088)

* fallback mmvq to mul_mat

* mmvq in cuda path

* Update ggml/src/ggml-sycl.cpp

Co-authored-by: Alberto Cabrera Pérez <[email protected]>

---------

Co-authored-by: Alberto Cabrera Pérez <[email protected]>

Assets 19

19 Aug 18:44

github-actions

b3605

cfac111

b3605

cann: add doc for cann backend (#8867)

Co-authored-by: xuedinge233 <[email protected]>
Co-authored-by: hipudding <[email protected]>

Assets 19

17 Aug 01:23

github-actions

b3599

8b3befc

b3599

server : refactor middleware and /health endpoint (#9056)

* server : refactor middleware and /health endpoint

* move "fail_on_no_slot" to /slots

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* fix server tests

* fix CI

* update server docs

---------

Co-authored-by: Georgi Gerganov <[email protected]>

Assets 19

15 Aug 20:10

github-actions

b3590

4b9afbb

b3590

retrieval : fix memory leak in retrieval query handling (#8955)

* retrieval

* Reuse querybatch to reduce frequent memory allocation

* delete unused white space

Assets 19

15 Aug 06:15

github-actions

b3584

5fd89a7

b3584

Vulkan Optimizations and Fixes (#8959)

* Optimize Vulkan REPEAT performance

* Use Vulkan GLSL fused multiply-add instruction where possible

* Add GGML_VULKAN_PERF option to output performance data per operator

* Rework and fix Vulkan descriptor set and descriptor pool handling

* Fix float32 concat f16 shader validation error

* Add Vulkan GROUP_NORM eps parameter

* Fix validation error with transfer queue memory barrier flags

* Remove trailing whitespaces

Assets 19

14 Aug 02:14

github-actions

b3581

06943a6

b3581

ggml : move rope type enum to ggml.h (#8949)

* ggml : move rope type enum to ggml.h

This commit moves the `llama_rope_type` enum from `llama.h` to
`ggml.h` and changes its name to `ggml_rope_type`.

The motivation for this change is to address the TODO in `llama.h` and
use the enum in ggml.

Note: This commit does not change the `mode` parameter to be of type
`enum ggml_rope_type`. The name `mode` and its usage suggest that it
might be more generic and possibly used as a bit field for multiple
flags. Further investigation/discussion may be needed to determine
if `mode` should be restricted to RoPE types.

* squash! ggml : move rope type enum to ggml.h

This commit removes GGML_ROPE_TYPE_NONE and GGML_ROPE_TYPE_GLM from
ggml.h, and back the llama_rope_type enum.

I've kept the assert for GGML_ROPE_TYPE_GLM as I'm not sure if it is
safe to remove it yet.

* squash! ggml : move rope type enum to ggml.h

This commit removes the enum ggml_rope_type from ggml.h and replaces it
with a define (GGML_ROPE_TYPE_NEOX). This define is used in the code to
check if the mode is set to GPT-NeoX. Also the enum llama_rope_type has
been updated to reflect this change.

* squash! ggml : move rope type enum to ggml.h

This commit contains a suggestion enable the GGML_ROPE_TYPE_NEOX
macro/define to be passed to the shader compiler.

* squash! ggml : move rope type enum to ggml.h

This commit fixes the editorconfig-checker warnings.

* squash! ggml : move rope type enum to ggml.h

Update comment for ggml_rope function.

* Revert "squash! ggml : move rope type enum to ggml.h"

This reverts commit 6261222bd0dc0efd51f0fb0435ad3f16a5b52fd6.

* squash! ggml : move rope type enum to ggml.h

Add GGML_ROPE_TYPE_NEOX to rope_common.comp.

* remove extra line

---------

Co-authored-by: slaren <[email protected]>

Assets 19

13 Aug 18:48

github-actions

b3580

828d6ff

b3580

export-lora : throw error if lora is quantized (#9002)

Assets 19

12 Aug 05:01

github-actions

b3570

4134999

b3570

gguf-py : Numpy dequantization for most types (#8939)

* gguf-py : Numpy dequantization for most types

* gguf-py : Numpy dequantization for grid-based i-quants

Assets 20

07 Aug 15:22

github-actions

b3541

be55695

b3541

ggml-backend : fix async copy from CPU (#8897)

* ggml-backend : fix async copy from CPU

* cuda : more reliable async copy, fix stream used when the devices are the same

Assets 20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: CodeLinaro/llama.cpp

b3614

Uh oh!

b3608

Uh oh!

b3605

Uh oh!

b3599

Uh oh!

b3590

Uh oh!

b3584

Uh oh!

b3581

Uh oh!

b3580

Uh oh!

b3570

Uh oh!

b3541

Uh oh!