-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : private llama_batch #11875
base: master
Are you sure you want to change the base?
llama : private llama_batch #11875
Conversation
@ggerganov Would you mind having a look on this initial proposal? Thank you! |
src/llama-batch.h
Outdated
struct llama_batch { | ||
int32_t n_tokens; | ||
int32_t max_tokens; | ||
bool is_view; | ||
|
||
llama_token * token; | ||
float * embd; | ||
llama_pos * pos; | ||
int32_t * n_seq_id; | ||
llama_seq_id ** seq_id; | ||
int8_t * logits; // TODO: rename this to "output" | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is overall looking good. Two things that come to mind:
-
Should we try to make the change backwards compatible by adding
llama_batch_ext
and keepingllama_batch
+ deprecation notices? Later on, we removellama_batch
, but we have given developers time to adapt tollama_batch_ext
. If we decide to do that, we would need to implement a small shim forllama_batch_ext <-> llama_batch
. -
The new
llama_batch
, now that it is in the C++ world, can become more general-purpose and utilize STL containers. Of course, we should keep this PR simple and do this later - the main goal for now is just to remove the definition from thellama.h
header so that we can later refactor more easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so I tried adding the private batch API as llama_batch_ext
and keeping the old llama_batch
as-is. It's still quite messy for now, not yet working, but just to show you how it looks like.
One small change that I'm not 100% sure, the llama_encode
and llama_decode
functions both use llama_batch
, and when moving to _ext
, I need to change the function name to prevent error: conflicting types
. Here I decide to change it to llama_text_(en|de)code
to make it more aligned with llama_vision_encode
. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small change that I'm not 100% sure, the llama_encode and llama_decode functions both use llama_batch, and when moving to ext, I need to change the function name to prevent error: conflicting types. Here I decide to change it to llama_text(en|de)code to make it more aligned with llama_vision_encode. WDYT?
Not convinced this is the best way. IMO the llama_encode / llama_decode
should be oblivious to the contents of the batch. I am imagining in the future to add multi-modal llama_context
s that would support simultaneously audio, vision, text, etc. So we should be able to add all of these types of inputs into a batch and call llama_encode
. Internally the context will create the necessary sub-batches and process them with the respective compute graphs.
I think we can add llama_encode_ext / llama_decode_ext
instead.
Ref comment: #11292 (comment)
Closes #10381
Current status:
llama_batch
from public API --> To be discussedllama-server
works for nowllama_batch
can be migrated to cpp types