llama : add llama_batch_ext #11875

ngxson · 2025-02-14T17:20:31Z

Migration patterns:

llama_batch batch = llama_batch_init(n_kv_max, 0, 1);
// becomes:
llama_batch_ext_ptr batch = llama_batch_ext_ptr(llama_batch_ext_init(n_kv_max, 1));


common_batch_add(batch, tokens[i], pos, { 0 }, false);
// becomes:
const llama_seq_id seq_id = 0;
llama_batch_ext_add_text(batch.get(), tokens[i], pos, &seq_id, 1, false);


llama_decode(lctx, llama_batch_get_one(tokens.data(), std::min(tokens.size(), (size_t) params.n_batch)));
// becomes:
llama_batch_ext_ptr batch(llama_batch_ext_init_from_text(tokens.data(), tokens.size(), 0, 0));
llama_decode_ext(lctx, batch.get());


llama_decode(ctx, batch);
// becomes:
llama_decode_ext(ctx, batch.get());

Current status:

This PR currently contains the first proposal of public API that allows hiding llama_batch from public API --> To be discussed
Only llama-server works for now
TODO: the members of llama_batch can be migrated to cpp types

ngxson · 2025-02-14T17:21:30Z

@ggerganov Would you mind having a look on this initial proposal? Thank you!

src/llama-batch.h

ggerganov · 2025-03-01T12:08:16Z

include/llama.h

+    struct llama_batch_ext_token_info {
+        llama_token    token;
+        llama_pos      pos;
+        int32_t        n_seq_id;
+        llama_seq_id * seq_id;
+        int8_t         logits;
+    };


This might not be very future-proof. Mixed-modality batches would have tokens, embeddings and tensors mixed together in the same batch. So calling llama_batch_ext_get_token_info(batch, i); is not always well-defined because it might not be a token at position i.

Maybe we can postpone this "token_info" API. I think all usages in the examples that require to read back info from the batch can be implemented in the example code without relying on the API. This way we can focus only on implementing only the API for creating batches and adding data to them. Later on when we have a better idea of the implementation, we can add a helper API to get info back from the batches.

Yes I agree. Furthermore, this API requires doing a copy, so it won't be the best for performance. It's better to remove this API for now.

I think all usages in the examples that require to read back info from the batch can be implemented in the example code without relying on the API.

This kind of logic is currently being used inside llama-server, not sure it appears on any other examples. I think I can make a thin wrapper for llama_batch_ext inside the example code. Feel free to tell me if you have a better idea.

This API is removed in 1d6ba97 , a new server_batch wrapper is added to manage token logits placement in the batch

common/common.cpp

include/llama.h

ngxson · 2025-03-01T19:47:31Z

OK so I've been able to apply this to various example (not all of them). Would be nice if you can have a quick look @ggerganov before I migrate the rest.

One thing to note, the loop check over tokens in batch (discussed in #11875 (comment)) is used by both server.cpp and embeddings.cpp, so my solution was to create a thin wrapper called common_batch. Looks a bit messy for now, so I'm wondering if in the future we can have a llama_get_embeddings_ext or something that can make this easier.

ggerganov · 2025-03-02T19:30:56Z

The common_batch is ok for now.

Looks a bit messy for now, so I'm wondering if in the future we can have a llama_get_embeddings_ext or something that can make this easier.

It seems we rather need something to query the batch, no? How do you imagine llama_get_embeddings_ext to work?

I was thinking something like:

struct llama_batch_ext_part;

llama_batch_ext_part * part = llama_batch_ext_get_part(batch, i);
if (llama_batch_ext_part_is_token(part)) {
    llama_token id = llama_batch_ext_part_get_id(part);
    ... get token id, sequence id, etc. ...
}

But since I'm not 100% about all the details yet related to multi-modal batches, I think it is better to postpone this API for later, and handle the batch information in the user code for now.

ngxson · 2025-03-03T10:02:17Z

How do you imagine llama_get_embeddings_ext to work?

I don't have a clear idea yet, but I'm thinking as a developer using libllama in the their program: Whenever I add a token to the batch, in case of text token I need to know:

The token (token ID in case of text)
The pos
The seq_id

So when I retrieve back the logits/embeddings, I would imagine that the get_embeddings function will have one of these 2 signatures:

get_embeddings(seq_id) ==> we already had llama_get_embeddings_seq
get_embeddings(seq_id, pos) ==> we currently need to read back the tokens from batch

It seems we rather need something to query the batch, no?

Yes we can, and this will be quite similar to my point above. I'm thinking about these 2 options:

Having something like llama_batch_ext_query(seq_id, pos) that returns the output_id of the token. This can then be used with llama_get_embeddings_ith(output_id)
Or, explicitly has llama_batch_ext_set_output(...) that returns the output_id. That means the logits param will be removed from llama_batch_ext_add_text
(Edit) Or, another option, llama_batch_ext_add_text can return the output_id if logits is set to true

ggerganov · 2025-03-07T12:21:39Z

Yes we can, and this will be quite similar to my point above. I'm thinking about these 2 options:

Having something like llama_batch_ext_query(seq_id, pos) that returns the output_id of the token. This can then be used with llama_get_embeddings_ith(output_id)
Or, explicitly has llama_batch_ext_set_output(...) that returns the output_id. That means the logits param will be removed from llama_batch_ext_add_text
(Edit) Or, another option, llama_batch_ext_add_text can return the output_id if logits is set to true

Hm, yes. The llama_batch_ext_set_output() idea sounds good.

Btw, this makes me wonder if we should actually move the output buffers for logits and the embeddings to be owned by the llama_batch_ext (currently these buffers are owned by the llama_context and are shared by all batches)?

ggerganov · 2025-03-13T13:01:20Z

Now that #12181 has been merged, it should be a good time to get this merged too.

ngxson · 2025-03-13T13:57:41Z

Yes thanks for the heads up, I'll focus on finishing this today & tomorrow

ngxson · 2025-03-13T16:03:51Z

Btw, this makes me wonder if we should actually move the output buffers for logits and the embeddings to be owned by the llama_batch_ext (currently these buffers are owned by the llama_context and are shared by all batches)?

If the output logits and embeddings are staying are float * or std::vector<float> then yes, I think it will be better to move them to llama_batch_ext (and can be done in a follow-up PR)

ngxson · 2025-03-24T16:40:32Z

@ggerganov Could you take a look on this PR this week? Thanks!

ggerganov

Sorry for the delay. I still have some concerns:

I am thinking that it would be a good idea to make the llama_batch_ext API to treat tokens and embeddings in a very similar manner. Ultimately, I think we should be able to create batches that contain both tokens and embeddings. For example the call:

    LLAMA_API struct llama_batch_ext * llama_batch_ext_init(
            int32_t n_tokens,
            int32_t n_seq_max);

might be better to define as:

    // either one of these - not sure which one yet
    LLAMA_API struct llama_batch_ext * llama_batch_ext_init(
            struct llama_model * model,
            int32_t n_seq_max);

    // this one will figure out `n_seq_max` from the context
    // maybe actually this one is the best
    LLAMA_API struct llama_batch_ext * llama_batch_ext_init(
            struct llama_context * ctx);

Passing n_tokens to the batch init call was necessary in the past in order to pre-allocate an array with enough size. But it is technically redundant information, because we can add new tokens and embeddings and dynamically resize the batch in libllama as needed. So I think there is no longer need to provide this information.

I fill like the embeddings API should mirror the tokens API. So instead of:

    LLAMA_API struct llama_batch_ext * llama_batch_ext_init_from_embd(
                  const float * embd,
                       size_t   n_tokens,
                       size_t   n_embd,
                    llama_pos   pos0,
                 llama_seq_id   seq_id);

    LLAMA_API int32_t llama_batch_ext_set_pos(struct llama_batch_ext * batch, llama_pos * pos, size_t n_pos);

we should probably have something like:

    LLAMA_API int32_t llama_batch_ext_add_embd(
        struct llama_batch_ext * batch,
                   const float * embd, // size would be inferred from the context
                     llama_pos * pos,  // multiple pos per embd
                        size_t   n_pos,
            const llama_seq_id * seq_ids,
                        size_t   n_seq_ids,
                          bool   output);

I think these change would help us think about tokens and embeddings as something very similar.

include/llama.h

ngxson · 2025-03-25T13:30:59Z

Nice, thanks for the comment. Yes I agree with your points. For llama_batch_ext_init it should take the llama_context * ctx because I think one batch should only be valid for a given context (in case we create multiple contexts from single model). I'll work on this now.

ngxson · 2025-03-25T15:40:51Z

Ok so I implemented the new llama_batch_ext_init and llama_batch_ext_add_embd in my last commit: d18a79e

For now, qwen2vl-cli is broken because the llama_batch_ext_add_text API only accepts one single position. We can implement the multi pos per token inside this API itself, but a little bit tricky. Let's take an example: we have n_pos_per_token == 4, then we add tokens at position 1, 2, 3, 4, 5, 6, we expect the pos array to be:

111122223333444455556666

But in reality, cgraph only accepts:

123456123456123456123456

One is the transposed version of another, so will be simple to do: before llama_de/encode we can reorder the pos so that it has the correct layout.

But before implementing this, I just want to check with you if my direction still looks ok.

ggerganov · 2025-03-25T16:18:28Z

src/llama-batch.cpp

-struct llama_batch_ext * llama_batch_ext_init(int32_t n_tokens_alloc, int32_t n_seq_max) {
-    return llama_batch_ext_init_impl(n_tokens_alloc, 0, n_seq_max);
+struct llama_batch_ext * llama_batch_ext_init(struct llama_context * ctx) {
+    return llama_batch_ext_init_impl(llama_n_batch(ctx), 0, llama_n_seq_max(ctx));


Yes, for now this is a good solution. Later, we will be resizing dynamically and not need the llama_n_batch().

ggerganov · 2025-03-25T16:19:20Z

But before implementing this, I just want to check with you if my direction still looks ok.

I think this looks good.

ngxson · 2025-03-25T17:43:17Z

I implemented the fix, tested qwen2vl-cli and it seems to work with text. Can't test with image for now since it's also broken on master

The command is: llama-qwen2vl-cli -m ../models/Qwen2-VL-7B-Instruct-Q4_K_M.gguf --mmproj ../models/mmproj-Qwen2-VL-7B-Instruct-f16.gguf --image ../models/bliss.png -p "what do you see?"

A bit confused about this, I found this nice illustration on qwen2vl model page on HF, which shows that qwen only use 3 pos per token. It's also confirmed by the config.json file. However, I'm not sure why in llama.cpp we use up to 4 pos.

ggerganov · 2025-03-25T17:45:06Z

bit confused about this, I found this nice illustration on qwen2vl model page on HF, which shows that qwen only use 3 pos per token. It's also confirmed by the config.json file. However, I'm not sure why in llama.cpp we use up to 4 pos.

cc @HimariO

ggerganov · 2025-03-25T18:31:20Z

examples/server/server.cpp

@@ -1963,7 +1963,7 @@ struct server_context {
            const int32_t n_batch = llama_n_batch(ctx);

            // only a single seq_id per token is needed


This comment is obsolete.

yeah I think I removed it in one of the commits above, we don't need n_batch anymore so I removed this whole code block

HimariO · 2025-03-27T16:59:44Z

A bit confused about this, I found this nice illustration on qwen2vl model page on HF, which shows that qwen only use 3 pos per token. It's also confirmed by the config.json file. However, I'm not sure why in llama.cpp we use up to 4 pos.

The fourth position ID is mainly for future-proofing, in case a newer model that takes 3D/depth input(like SpatialLM) is added. Currently, both Qwen2 & 2.5VL only use 3 position ID per token.

ggerganov · 2025-04-01T12:08:50Z

@ngxson We discussed with @slaren these changes and he raised a good point that the batch API does not need to explicitly pass token positions. These can be inferred from the KV cache.

Since not having to pass explicit token positions would simplify the batch API, it's a good idea to take it into account when redesigning it. So I will try to do some KV cache refactoring to support this. When I'm ready, I will come back to this PR and update it respectively.

ngxson · 2025-04-01T12:48:04Z

Thanks for the head up! Yes for text batch it would be nice if the position can be inferred from KV.

Please also note that, for multimodal batch, we may also need the N-dimension position. For example, in the case of Qwen2VL we have a normal position + an additional 2D position for each image token. I think what we can do is that for the that the 2D coordinate position can be given by the user, and the "normal" position can still be inferred from KV.

So for example in this PR, the API for Qwen will accepts (2*n_tokens) positions

ggerganov · 2025-04-01T12:57:05Z

I think the Qwen2VL image positions could be inferred from the token position:

llama.cpp/examples/llava/qwen2vl-cli.cpp

Lines 39 to 50 in 3fd072a

    
           for (int y = 0; y < ph; y++) 
        
           { 
        
               for (int x = 0; x < pw; x++) 
        
               { 
        
                   int i = y * pw + x; 
        
                   mrope_pos[i] = *st_pos_id; 
        
                   mrope_pos[i + img_tokens] = *st_pos_id + y; 
        
                   mrope_pos[i + img_tokens * 2] = *st_pos_id + x; 
        
                   mrope_pos[i + img_tokens * 3] = 0; 
        
               } 
        
           } 
        
           *st_pos_id += std::max(pw, ph);

So ideally, the user would not have to pass those as well.

ngxson · 2025-04-01T13:08:21Z

We still need to know the image size pw and ph in order to infer the 2D position. Ofc my goal for the future is to hide all of this inside the library. This is the idea in this comment where I suggest having a struct llama_mm_embd that contains the image embeddings tensor and the size of image pw, ph

In far future (or not that far?), struct llama_mm_embd can be used to also store everything that the text decoder wants to retrieve from multimodal encoder - that could be a series of image in case of video, audio tokens in case of audio input

ggerganov · 2025-04-01T13:51:19Z

Yes, the multi-dim positions are complicating things. Not sure what is the best solution.

The problem is that many vision models nowadays also need to know the size of image when passing the embeddings from encoder to decoder.

Maybe with Gemma demonstrating that this is not necessary, new models won't need these complications and we don't have to add support at all. Are there other models other than Qwen2VL that need 2D positions?

When passing these embeddings to decoder, we need to "wrap" each slice between some tokens, then also "wrap" each row of 3 slices between another token:

This seems something that the user code can implement logic (similar how to bos/eos tokens are added). Which models use this pattern?

Anyway, the KV cache refactoring can be done before making a decision about how to handle the images, so we can re-discuss this after that.

ngxson · 2025-04-01T14:11:55Z

Just to clarify before writing my response, there are 2 reasons why the image size is needed:

For 2D positional embedding like Qwen, where each token has 3 positions (temporal, X, Y)
For slices layout --> currently use by a lot of models like MiniCPM-V, SmolVLM, llava 1.6. This technique use normal position and normal causal mask, but they add special tokens to identify rows/cols of slices

Maybe with Gemma demonstrating that this is not necessary, new models won't need these complications and we don't have to add support at all. Are there other models other than Qwen2VL that need 2D positions?

Hmm yeah that could be right. Because M-RoPE is invented by Qwen, I don't see anyone using it for now. Not sure if other models will adopt it in the future.

But please not that gemma 3 does not use slices. This makes working with gemma 3 vision easy, but the current problem with gemma 3 is that the image size is fixed. For bigger images, they need to rely on a technique called "pan and zoom", which essentially a prompting technique that allow the model to "ask" the runtime to zoom the image, then rerun the generation. This is obviously very inefficient.

Models like SmolVLM, MiniCPM-V, Qwen (and maybe many other) are already using slicing technique that I said earlier, so we definitely need to support this in the API.

This seems something that the user code can implement logic (similar how to bos/eos tokens are added). Which models use this pattern?

In fact, better to think is that is the "chat template" for image. While it can be implemented in user code, I think it's better to make it transparent from user POV, as this part is model-specific and even harder than normal chat template for text, this is not something user can easily debug.

In gemma3-cli, you can see that the <start_of_image> token is added from user code. But my goal for the vision API is to hide this behind an API.

ngxson · 2025-04-04T08:48:47Z

@ggerganov I've been working on audio input and output recently and I think the API proposed by this PR pretty much correspond to what I need (ofc except for the position, which will be nicer to be hidden from user code).

Having this PR merged could save me some efforts, and more importantly unblock my researches on multimodal API, so I'm wondering if anything I could do on my side to accelerate this a bit more? Thank you!

Also now that the position is hidden from user code, I think we should also somehow modify the llama_kv_* API to make sure that user don't accidentally leave a "hole" in the context. For example, if KV cache has 10 tokens and they delete tokens from position [2, 5), now there will be no API to fill in this "hole" unless we also remove all tokens [5, inf)

ggerganov · 2025-04-04T09:49:29Z

My plan for the next steps was to refactor the llama_kv_cache_unified into 2 separate implementations - unified and recurrent (started some initial work in #12695). This will simplify the logic in llama_kv_cache and allow to design a better token position / sequence tracking in libllama which can then be used instead of manually passing positions with the batches. When this is ready, we can update this PR to not pass the positions explicitly.

Also now that the position is hidden from user code, I think we should also somehow modify the llama_kv_* API to make sure that user don't accidentally leave a "hole" in the context. For example, if KV cache has 10 tokens and they delete tokens from position [2, 5), now there will be no API to fill in this "hole" unless we also remove all tokens [5, inf)

The llama_kv_self_seq_ API would still require the user to keep track of the token positions / sequence lengths. But not sure if this can be improved.

ngxson added 2 commits February 14, 2025 00:48

first proposal for private llama_batch

4ed4fe7

rework, targeting llama-server

f2e59a8

github-actions bot added examples server labels Feb 14, 2025

ggerganov reviewed Feb 15, 2025

View reviewed changes

src/llama-batch.h Outdated Show resolved Hide resolved

ngxson added 3 commits February 16, 2025 00:02

move to llama_batch_ext

17d3658

server : use llama_batch_ext

85ef80c

fix server

aed4a8e

ggerganov mentioned this pull request Feb 20, 2025

llama : refactor llama_kv_cache, llama_context and llm_build_context #11213

Closed

21 tasks

ngxson added 4 commits February 24, 2025 17:01

llama_decode_ext

4bf7ca3

Merge branch 'master' into xsn/private_batch_api

a1b1dea

adapt common

f0ffd81

Merge branch 'master' into xsn/private_batch_api

9e75c49

ggerganov reviewed Mar 1, 2025

View reviewed changes

common/common.cpp Outdated Show resolved Hide resolved

ggerganov reviewed Mar 1, 2025

View reviewed changes

include/llama.h Outdated Show resolved Hide resolved

ngxson added 4 commits March 1, 2025 14:00

correct llama_decode_ext

40989f4

llama_batch_ext_add_text

1170135

remove token_info API

1d6ba97

apply various in places

46596ca

Merge branch 'master' into xsn/private_batch_api

17f954c

fix merge errors

86973cb

qwen2vl: fix mrope position

a9efdbb

ggerganov reviewed Mar 25, 2025

View reviewed changes

include/llama.h Outdated Show resolved Hide resolved

ngxson added 2 commits March 25, 2025 15:34

Merge branch 'master' into xsn/private_batch_api

1434c2c

llama_batch_ext_init with ctx

d18a79e

ggerganov reviewed Mar 25, 2025

View reviewed changes

ngxson added 2 commits March 25, 2025 18:35

fix qwzn2vl mrope position input

c4fea7f

fix build

42062cc

ngxson added 2 commits March 25, 2025 19:16

fix server

56e82d0

server: fix batch_spec

50fb396

ggerganov reviewed Mar 25, 2025

View reviewed changes

ngxson added 2 commits March 27, 2025 11:26

fix embeddings and retrieval

8ec0ff9

correct output_id for llama-cpp header

c1f4a78

ngxson mentioned this pull request Apr 19, 2025

mtmd : merge llava, gemma3 and minicpmv CLI into single llama-mtmd-cli #13012

Merged

ngxson mentioned this pull request Apr 27, 2025

llama : (mrope) allow using normal 1D position for text token #13138

Merged

ggerganov mentioned this pull request Jun 18, 2025

Eval bug: Qwen2.5-VL-7B-Instruct returns extremely inaccurate bbox coordinates #13694

Open

		@@ -1963,7 +1963,7 @@ struct server_context {
		const int32_t n_batch = llama_n_batch(ctx);

		// only a single seq_id per token is needed

llama : add llama_batch_ext #11875

Are you sure you want to change the base?

llama : add llama_batch_ext #11875

Uh oh!

Conversation

ngxson commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Feb 14, 2025

Uh oh!

Uh oh!

ggerganov Mar 1, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Mar 1, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Mar 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ngxson commented Mar 1, 2025

Uh oh!

ggerganov commented Mar 2, 2025

Uh oh!

ngxson commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Mar 7, 2025

Uh oh!

ggerganov commented Mar 13, 2025

Uh oh!

ngxson commented Mar 13, 2025

Uh oh!

ngxson commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Mar 24, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngxson commented Mar 25, 2025

Uh oh!

ngxson commented Mar 25, 2025

Uh oh!

ggerganov Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Mar 25, 2025

Uh oh!

ngxson commented Mar 25, 2025

Uh oh!

ggerganov commented Mar 25, 2025

Uh oh!

ggerganov Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

HimariO commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Apr 1, 2025

Uh oh!

ngxson commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Apr 1, 2025

Uh oh!

ngxson commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Apr 1, 2025

Uh oh!

ngxson commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ngxson commented Feb 14, 2025 •

edited

Loading

ngxson commented Mar 3, 2025 •

edited

Loading

ngxson commented Mar 13, 2025 •

edited

Loading

HimariO commented Mar 27, 2025 •

edited

Loading

ngxson commented Apr 1, 2025 •

edited

Loading

ngxson commented Apr 1, 2025 •

edited

Loading

ngxson commented Apr 1, 2025 •

edited

Loading

ngxson commented Apr 4, 2025 •

edited

Loading