-
Notifications
You must be signed in to change notification settings - Fork 12.3k
model : add PLaMo-2 model #14560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
model : add PLaMo-2 model #14560
Conversation
This will be necessary to support Jamba (and other recurrent models mixed with Attention). Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.
* llama : begin work on support for variable GQA This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads. * llama : gracefully fail when not finding hybrid slot
* ggml : simplify SSM-related operators * llama : make recurrent state slot allocation contiguous * llama : adapt internal uses of batches to llama_ubatch
This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.
This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.
This removes the need for ggml_ssm_conv!!! But performance seems slighly worse on my system, especially for prompt processing. Maybe ggml_mul_mat isn't optimized for small row sizes? More performance testing is necessary until GGML_OP_SSM_CONV is removed. * ggml : make ggml_ssm_scan not modify its source tensors * llama : fix shared recurrent tail cell count for small ubatch sizes Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.
* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors The implementation already supported it, and this makes Mamba's conv step slightly faster.
This can be changed back later if the name change is wrong. I was renaming the functions anyway to generalize kv-cache-related functions to hybrid and recurrent model architectures. I think llama_past is a better name than llama_cache for a combined kv cache and recurrent state cache, because the states it contains pretty much always come before the newly-added ones for any particular sequence. Also 'llama_past_clear' sounds more obvious in what it does than 'llama_kv_cache_clear'. The future is what the models generate. (For embeddings, the kv cache isn't really used anyway) Still, I'm open to better suggestions.
This also slightly reduces the diff from the master branch
Also begin reverting some implicit state rollback code.
But this time it contains the sub-cache graph inputs. This *should* make it easier to handle updating the inputs when caching the graph (eventually).
@@ -293,6 +294,7 @@ enum llm_tensor { | |||
LLM_TENSOR_SSM_IN, | |||
LLM_TENSOR_SSM_CONV1D, | |||
LLM_TENSOR_SSM_X, | |||
LLM_TENSOR_SSM_BCDT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't SSM_BCDT
the same as SSM_X
?
struct ggml_tensor * ssm_out = nullptr; | ||
struct ggml_tensor * ssm_in = nullptr; | ||
struct ggml_tensor * ssm_x = nullptr; | ||
struct ggml_tensor * ssm_bcdt = nullptr; // PLaMo-2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ssm_bcdt
doesn't seem to be used in this implementation of PLaMo-2 (unless I didn't see it). It looks like ssm_x
has the same role?
# If there is no lm_head, we need to map the token embedding to the output layer | ||
assert self.tensor_names is not None | ||
if all(['lm_head' not in name for name in self.tensor_names]): | ||
name_base = name.replace(".embed_tokens.weight", "") | ||
output_name = "lm_head" | ||
|
||
embed_tokens_mapped = self.map_tensor_name(name) | ||
output_mapped = self.map_tensor_name(output_name) + ".weight" | ||
|
||
return [(embed_tokens_mapped, data_torch), (output_mapped, data_torch)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to duplicate the token embeddings tensor in the model file, it can be done at load time in the model (this is what is done for most of the other arches).
Search for TENSOR_DUPLICATED
and TENSOR_NOT_REQUIRED
in src/llama-model.cpp
This PR supports PLaMo2 model in llama.cpp, which was also requested on a related discussion thread: #13874. This model uses a custom-implemented tokenizer, so this PR includes both the model itself (which uses an architecture combining Mamba and Attention, similar to Jamba) as well as implementing the new custom tokenizer.
Based on #7531
How to check if the plamo-2-translate works with this PR. First, retrieve the model itself by:
Then, I needed to modify the
tokenizer.jsonl
to pad some meaningless vocabs to align the vocabulary size to what is specified inconfig.json
, namely it should be 100032 by using this script:Next, convert the model into gguf by the following command:
Then build binaries as follows:
and finally, I successfully run the plamo-2-translate model as follows:
./release/bin/llama-cli -m plamo-2-translate.gguf -p "<|plamo:op|>dataset\ntranslation\n<|plamo:op|>input lang=English\nHello, how are you?\n<|plamo:op|>output\n" -no-cnv --verbose-prompt --no-warmup -sp
intermediate outputs
Output:
Seems correctly working!