Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : private llama_batch #11875

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Feb 14, 2025

Ref comment: #11292 (comment)

Closes #10381

Current status:

  • This PR currently contains the first proposal of public API that allows hiding llama_batch from public API --> To be discussed
  • Only llama-server works for now
  • TODO: the members of llama_batch can be migrated to cpp types

@ngxson
Copy link
Collaborator Author

ngxson commented Feb 14, 2025

@ggerganov Would you mind having a look on this initial proposal? Thank you!

Comment on lines 21 to 33
struct llama_batch {
int32_t n_tokens;
int32_t max_tokens;
bool is_view;

llama_token * token;
float * embd;
llama_pos * pos;
int32_t * n_seq_id;
llama_seq_id ** seq_id;
int8_t * logits; // TODO: rename this to "output"
};

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is overall looking good. Two things that come to mind:

  • Should we try to make the change backwards compatible by adding llama_batch_ext and keeping llama_batch + deprecation notices? Later on, we remove llama_batch, but we have given developers time to adapt to llama_batch_ext. If we decide to do that, we would need to implement a small shim for llama_batch_ext <-> llama_batch.

  • The new llama_batch, now that it is in the C++ world, can become more general-purpose and utilize STL containers. Of course, we should keep this PR simple and do this later - the main goal for now is just to remove the definition from the llama.h header so that we can later refactor more easily.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so I tried adding the private batch API as llama_batch_ext and keeping the old llama_batch as-is. It's still quite messy for now, not yet working, but just to show you how it looks like.

One small change that I'm not 100% sure, the llama_encode and llama_decode functions both use llama_batch, and when moving to _ext, I need to change the function name to prevent error: conflicting types. Here I decide to change it to llama_text_(en|de)code to make it more aligned with llama_vision_encode. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small change that I'm not 100% sure, the llama_encode and llama_decode functions both use llama_batch, and when moving to ext, I need to change the function name to prevent error: conflicting types. Here I decide to change it to llama_text(en|de)code to make it more aligned with llama_vision_encode. WDYT?

Not convinced this is the best way. IMO the llama_encode / llama_decode should be oblivious to the contents of the batch. I am imagining in the future to add multi-modal llama_contexts that would support simultaneously audio, vision, text, etc. So we should be able to add all of these types of inputs into a batch and call llama_encode. Internally the context will create the necessary sub-batches and process them with the respective compute graphs.

I think we can add llama_encode_ext / llama_decode_ext instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor: Allow adding both tokens and embeddings to llama_batch
2 participants