-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor: Allow adding both tokens and embeddings to llama_batch
#10381
Comments
struct llama_batch_seq {
int n_tokens;
llama_token * token; // only one of token and embd can be non-null
float * embd;
enum { none, all, last } logits;
int seq_id;
int pos; // -1 = end
};
struct llama_batch {
int n_seqs;
llama_batch_seq * seqs;
}; |
7 tasks
This issue was closed because it has been inactive for 14 days since being marked as stale. |
This was referenced Feb 8, 2025
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Background Description
Ref: #7553 , required for supporting future vision models (#8010)
I initially planned to make a proposal PR for this, but turns out it's quite more complicated than I thought. Therefore, I create this issue for further discussion before actually implement it.
Possible Refactor Approaches
The problem can be divided into 2 parts:
llama_batch
can be constructed?For the second part (How the cgraph should be modified?), it should be simple:
llm_build_inp_embd
can be modified to concat tensors from "learned" embd and input embd. The pseudo-code looks like this:The attention mask also need to be updated accordingly.
For the first part (How the
llama_batch
can be constructed?), the problem is that there are many different possible approach:Proposal 1: Add
n_embds
tollama_batch
The downside of this approach is that it's quite messy to keep track of
n_seq_id
,seq_id
,logits
Proposal 2: Add an overload version of
llama_decode/encode
The downside would be that this is kinda a "hacky" (not intuitive for developers), because one batch is now represented by 2
llama_batch
objects.Proposal 3: Keep
llama_batch
the same, but tokens ID < 0 are embeddingsFor example:
This seems to be easier to implement than all other proposals. The only thing I'm not sure is that do we expect negative token ID to be a reserved use case?
Proposal 4: Completely refactor
llama_batch
to accept sequence list instead of token listThis is actually proposed by @slaren , but I'm not sure what it should look like in real world. Could you please explain it further?
I'm also tagging @ggerganov and @abetlen for futher discussion. Thank you!
The text was updated successfully, but these errors were encountered: