model : add PLaMo-2 model #14560

mitmul · 2025-07-07T05:08:38Z

This PR supports PLaMo2 model in llama.cpp, which was also requested on a related discussion thread: #13874. This model uses a custom-implemented tokenizer, so this PR includes both the model itself (which uses an architecture combining Mamba and Attention, similar to Jamba) as well as implementing the new custom tokenizer.

Based on #7531

How to check if the plamo-2-translate works with this PR. First, retrieve the model itself by:

git clone https://huggingface.co/pfnet/plamo-2-translate

Then, I needed to modify the tokenizer.jsonl to pad some meaningless vocabs to align the vocabulary size to what is specified in config.json, namely it should be 100032 by using this script:

#!/usr/bin/env python3
"""Fix PLaMo-2 tokenizer by adding missing padding tokens."""

import json
import shutil

def fix_tokenizer():
    # Backup original file
    shutil.copy("plamo-2-translate/tokenizer.jsonl", "plamo-2-translate/tokenizer.jsonl.backup")
    
    # Read existing tokens
    with open("plamo-2-translate/tokenizer.jsonl", "r", encoding="utf-8") as f:
        lines = f.readlines()
    
    print(f"Current number of tokens: {len(lines)}")
    
    # Add 32 padding tokens
    # Use the same format as other special tokens in the file
    for i in range(32):
        token_id = 100000 + i
        # Create padding token with same format as other special tokens
        padding_token = [f"<pad_{i}>", 0.0, "CONTROL", "basic", 8, None, [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]]
        lines.append(json.dumps(padding_token, ensure_ascii=False) + "\n")
    
    # Write back
    with open("plamo-2-translate/tokenizer.jsonl", "w", encoding="utf-8") as f:
        f.writelines(lines)
    
    print(f"New number of tokens: {len(lines)}")
    print("Tokenizer fixed!")

if __name__ == "__main__":
    fix_tokenizer()

Next, convert the model into gguf by the following command:

python convert_hf_to_gguf.py plamo-2-translate --outfile plamo-2-translate.gguf --outtype auto

Then build binaries as follows:

cmake -B release -DLLAMA_CURL=OFF -DGGML_USE_BLAS=ON
cmake --build release --config Release

and finally, I successfully run the plamo-2-translate model as follows:

./release/bin/llama-cli -m plamo-2-translate.gguf -p "<|plamo:op|>dataset\ntranslation\n<|plamo:op|>input lang=English\nHello, how are you?\n<|plamo:op|>output\n" -no-cnv --verbose-prompt --no-warmup -sp

intermediate outputs

build: 5876 (272ffdb6) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.5.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M1 Max) - 64424 MiB free
llama_model_loader: loaded meta data with 37 key-value pairs and 467 tensors from plamo-2-translate.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = plamo2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Plamo 2 Translate
llama_model_loader: - kv   3:                         general.size_label str              = 10B
llama_model_loader: - kv   4:                            general.license str              = other
llama_model_loader: - kv   5:                       general.license.name str              = plamo-community-license
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/pfnet/plamo-2-...
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Plamo 2 8b
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Pfnet
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/pfnet/plamo-2-8b
llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["plamo", "translation", "translation"]
llama_model_loader: - kv  12:                          general.languages arr[str,2]       = ["en", "ja"]
llama_model_loader: - kv  13:             plamo2.attention.head_count_kv arr[i32,32]      = [0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, ...
llama_model_loader: - kv  14:                      plamo2.context_length u32              = 10485760
llama_model_loader: - kv  15:                    plamo2.embedding_length u32              = 4096
llama_model_loader: - kv  16:                         plamo2.block_count u32              = 32
llama_model_loader: - kv  17:                plamo2.attention.head_count u32              = 32
llama_model_loader: - kv  18:    plamo2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  19:        plamo2.attention.group_norm_epsilon f32              = 0.000001
llama_model_loader: - kv  20:        plamo2.attention.layer_norm_epsilon f32              = 0.000001
llama_model_loader: - kv  21:                      plamo2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:                      plamo2.ssm.state_size u32              = 64
llama_model_loader: - kv  23:                     plamo2.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  24:                  plamo2.ssm.time_step_rank u32              = 64
llama_model_loader: - kv  25:                      plamo2.ssm.inner_size u32              = 8192
llama_model_loader: - kv  26:                     plamo2.ssm.group_count u32              = 0
llama_model_loader: - kv  27:                 plamo2.feed_forward_length u32              = 16384
llama_model_loader: - kv  28:                          general.file_type u32              = 0
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                       tokenizer.ggml.model str              = plamo2
llama_model_loader: - kv  31:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  32:                      tokenizer.ggml.tokens arr[str,100032]  = ["<|plamo:unk|>", "<|plamo:bos|>", "<...
llama_model_loader: - kv  33:                      tokenizer.ggml.scores arr[f32,100032]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  34:                  tokenizer.ggml.token_type arr[i32,100032]  = [2, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  35:                tokenizer.ggml.eos_token_id u32              = 4
llama_model_loader: - kv  36:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - type  f32:  467 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = all F32
print_info: file size   = 35.50 GiB (32.00 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 61
load: token to piece cache size = 0.7989 MB
print_info: arch             = plamo2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 10485760
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = [0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4]
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = [0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8]
print_info: n_embd_k_gqa     = [0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512]
print_info: n_embd_v_gqa     = [0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512]
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 16384
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 10485760
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 4
print_info: ssm_d_inner      = 8192
print_info: ssm_d_state      = 64
print_info: ssm_dt_rank      = 64
print_info: ssm_n_group      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 9.53 B
print_info: general.name     = Plamo 2 Translate
print_info: vocab type       = PLaMo2
print_info: n_vocab          = 100032
print_info: n_merges         = 0
print_info: BOS token        = 1 '<|plamo:bos|>'
print_info: EOS token        = 4 '<|plamo:op|>'
print_info: UNK token        = 0 '<|plamo:unk|>'
print_info: PAD token        = 3 '<|plamo:pad|>'
print_info: LF token         = 10 '
'
print_info: EOG token        = 4 '<|plamo:op|>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors: Metal_Mapped model buffer size = 34784.34 MiB
load_tensors:   CPU_Mapped model buffer size =  1563.00 MiB
.............................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 67554.51 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_set_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_c4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_context:        CPU  output buffer size =     0.38 MiB
llama_kv_cache_unified:      Metal KV buffer size =   128.00 MiB
llama_kv_cache_unified: size =  128.00 MiB (  4096 cells,  16 layers,  1 seqs), K (f16):   64.00 MiB, V (f16):   64.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_memory_recurrent: mem_size = 1, n_seq_max = 1, type_r = 'f32', type_s = 'f32', n_layer = 32
llama_memory_recurrent:      Metal KV buffer size =    33.50 MiB
llama_memory_recurrent: KV self size  =   33.50 MiB, R (f32):    1.50 MiB, S (f32):   32.00 MiB
llama_context:      Metal compute buffer size =   306.10 MiB
llama_context:        CPU compute buffer size =    16.01 MiB
llama_context: graph nodes  = 2038
llama_context: graph splits = 9
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
main: llama threadpool init, n_threads = 8

system_info: n_threads = 8 (n_threads_batch = 8) / 10 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | REPACK = 1 | 

main: prompt: '<|plamo:op|>dataset
translation
<|plamo:op|>input lang=English
Hello, how are you?
<|plamo:op|>output
'
main: number of tokens in prompt = 20
     4 -> '<|plamo:op|>'
 45474 -> 'dataset'
    10 -> '
'
 18053 -> 'translation'
    10 -> '
'
     4 -> '<|plamo:op|>'
  1760 -> 'input'
 98700 -> ' lang'
    61 -> '='
 14134 -> 'English'
    10 -> '
'
  6721 -> 'Hello'
    44 -> ','
  1205 -> ' how'
  1089 -> ' are'
  1099 -> ' you'
  1076 -> '?
'
     4 -> '<|plamo:op|>'
  3045 -> 'output'
    10 -> '
'

sampler seed: 64554044
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

Output:

<|plamo:op|>dataset
translation
<|plamo:op|>input lang=English
Hello, how are you?
<|plamo:op|>output
こんにちは、ご機嫌いかがですか？
<|plamo:op|> [end of text]


llama_perf_sampler_print:    sampling time =       0.29 ms /    26 runs   (    0.01 ms per token, 89347.08 tokens per second)
llama_perf_context_print:        load time =    6939.57 ms
llama_perf_context_print: prompt eval time =     378.42 ms /    20 tokens (   18.92 ms per token,    52.85 tokens per second)
llama_perf_context_print:        eval time =     625.90 ms /     5 runs   (  125.18 ms per token,     7.99 tokens per second)
llama_perf_context_print:       total time =    7566.19 ms /    25 tokens
ggml_metal_free: deallocating

Seems correctly working!

This will be necessary to support Jamba (and other recurrent models mixed with Attention). Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

* llama : begin work on support for variable GQA This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads. * llama : gracefully fail when not finding hybrid slot

* ggml : simplify SSM-related operators * llama : make recurrent state slot allocation contiguous * llama : adapt internal uses of batches to llama_ubatch

This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.

This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.

This removes the need for ggml_ssm_conv!!! But performance seems slighly worse on my system, especially for prompt processing. Maybe ggml_mul_mat isn't optimized for small row sizes? More performance testing is necessary until GGML_OP_SSM_CONV is removed. * ggml : make ggml_ssm_scan not modify its source tensors * llama : fix shared recurrent tail cell count for small ubatch sizes Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.

* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors The implementation already supported it, and this makes Mamba's conv step slightly faster.

This can be changed back later if the name change is wrong. I was renaming the functions anyway to generalize kv-cache-related functions to hybrid and recurrent model architectures. I think llama_past is a better name than llama_cache for a combined kv cache and recurrent state cache, because the states it contains pretty much always come before the newly-added ones for any particular sequence. Also 'llama_past_clear' sounds more obvious in what it does than 'llama_kv_cache_clear'. The future is what the models generate. (For embeddings, the kv cache isn't really used anyway) Still, I'm open to better suggestions.

…g attn_k, attn_q when quantizing

src/llama-arch.h

src/llama-model.h

convert_hf_to_gguf.py

ggerganov · 2025-07-09T07:26:25Z

src/llama-vocab.h

+// PLaMo-2 Aho-Corasick tokenizer
+class llama_vocab_plamo2 {
+public:


I am not convinced that we have to add this new implementation.

Can you provide a reference for the PlaMo-2 tokenizer so we can understand how it differs from the existing tokenizer algorithms?

I appreciate you asking for a reference for the PlaMo-2 tokenizer to better understand how it differs from existing algorithms. The original implementation can be found at the following Hugging Face repository:

Original PlaMo-2 Tokenizer Implementation:
https://huggingface.co/pfnet/plamo-2-8b/blob/main/tokenization_plamo.py

I hope this clarifies the unique aspects of this tokenizer. I'm happy to discuss the necessity of this new implementation further.

whoreson · 2025-07-09T19:39:32Z

Sir, you're a gentleman and a scholar. It indeed seems to work, in Q8_0 as well.

Only caveat is that I had to set add_special=True in the server for the <|plamo:op|> token to appear+act as a stop token (as shown by the -sp option in the llama-cli command line above).

mitmul · 2025-07-10T11:28:43Z

Sir, you're a gentleman and a scholar. It indeed seems to work, in Q8_0 as well.

Only caveat is that I had to set add_special=True in the server for the <|plamo:op|> token to appear+act as a stop token (as shown by the -sp option in the llama-cli command line above).

Thank you for testing this PR. As for the special token <|plamo:op|> as the stop token, I added the following line to tokenizer_config.json before converting the model from HF format into GGML:

"eos_token_id": 4,

This makes the <|plamo:op|> token to be a stop token with the tokenizer without specifying any special options in runtime.

convert_hf_to_gguf.py

src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

compilade added 30 commits April 3, 2024 20:47

wip: llama : separate recurrent states from the KV cache

271104c

This will be necessary to support Jamba (and other recurrent models mixed with Attention). Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

llama : use std::find for seq_nodes in llama_rs_cache

8db1e4d

llama : state checkpoints for recurrent models

0028010

llama : correctly handle more edge cases for the rs cache

0c8b3b2

Merge branch 'master' into compilade/refactor-kv-cache

d66849f

llama : rename many llama_kv_cache_* functions

a09db95

Merge branch 'master' into compilade/refactor-kv-cache

c460ff1

llama : remove useless return value for some llama_cache_* functions

b6fafd1

Merge branch 'master' into compilade/refactor-kv-cache

b7ec12e

Merge branch 'master' into compilade/refactor-kv-cache

3b57b55

llama : rethink recurrent state cell counts

7e13f19

* llama : begin work on support for variable GQA This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads. * llama : gracefully fail when not finding hybrid slot

llama : support Jamba

cbc743e

Merge branch 'master' into compilade/refactor-kv-cache

0fd13e9

llama : fix BERT inference without KV cache

61a88a1

convert-hf : check for unprocessed Jamba experts

ea2e63e

convert-hf : support Mini-Jamba conversion

fc59407

llama : fix Jamba quantization sanity checks

181dadf

llama : sequence-length-aware batch splitting

3a414b0

Merge branch 'master' into compilade/refactor-kv-cache

4e4c41e

llama : use equal-sequence-length sub-batches for recurrent models

3587a94

* ggml : simplify SSM-related operators * llama : make recurrent state slot allocation contiguous * llama : adapt internal uses of batches to llama_ubatch

Merge branch 'master' into compilade/refactor-kv-cache

5d3c7b9

llama : fix batch split output count for embeddings

72eea49

llama : minimize swaps when reordering logits

18d1c14

This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.

llama : fix edge case finding batch seq_id of split recurrent cell

61200ef

This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.

llama : avoid copies for simple batch splits

eb589d5

llama : fix .base() compilation error on Windows

17f6c1e

llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL

fee3c1d

* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors The implementation already supported it, and this makes Mamba's conv step slightly faster.

Merge branch 'master' into compilade/refactor-kv-cache

6840ac0

mitmul marked this pull request as ready for review July 7, 2025 08:36

compilade and others added 3 commits July 7, 2025 14:57

Merge branch 'master' into compilade/refactor-kv-cache

f716358

Add PLaMo-2 model using hybrid memory module

f656712

Fix z shape

4728e42

mitmul force-pushed the mitmul/add-plamo2 branch from 0859c53 to 4728e42 Compare July 8, 2025 02:10

mitmul added 4 commits July 8, 2025 12:06

Add cmath to include from llama-vocab.h

6acaf3c

Explicitly dequantize normalization weights before RoPE apply

7e4c5ec

Revert unnecessary cast because the problem can be solved by excludin…

149b98c

…g attn_k, attn_q when quantizing

Use ATTN_K/Q_NORM for k,q weights to prevent quantization

7786520

yt-koike mentioned this pull request Jul 8, 2025

Llama-3_1-Nemotron-51B-Instruct ollama/ollama#8460

Open

mitmul changed the title ~~Add PLaMo-2 model~~ model : add PLaMo-2 model Jul 8, 2025

compilade reviewed Jul 9, 2025

View reviewed changes

src/llama-arch.h Outdated Show resolved Hide resolved

src/llama-model.h Outdated Show resolved Hide resolved

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

mitmul added 2 commits July 9, 2025 13:20

Remove SSM_BCDT that is not used from anywhere

0424a76

Do not duplicate embedding weights for output.weight

ea95a1d

ggerganov reviewed Jul 9, 2025

View reviewed changes

mitmul added 2 commits July 10, 2025 18:25

Fix tokenizer encoding problem for multibyte strings

2d76b21

Merge remote-tracking branch 'upstream/master' into mitmul/add-plamo2

fccec6d

mitmul force-pushed the mitmul/add-plamo2 branch from ceaa6a2 to fccec6d Compare July 10, 2025 10:09

Merge branch 'master' into mitmul/add-plamo2

5231e4f

CISC reviewed Jul 11, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

src/llama-model.cpp Outdated Show resolved Hide resolved

src/llama-model.cpp Outdated Show resolved Hide resolved

src/llama-model.cpp Outdated Show resolved Hide resolved

mitmul and others added 6 commits July 12, 2025 12:02

Apply suggestion from @CISC

521c1e0

Co-authored-by: Sigbjørn Skjæret <[email protected]>

Update src/llama-model.cpp

df95fce

Co-authored-by: Sigbjørn Skjæret <[email protected]>

Use LLM_FFN_SWIGLU instead of splitting ffn_gate and ffn_up

498b8b3

Remove unnecessary part for Grouped Query Attention

6afd3be

Fix how to load special token id to gguf

34360eb

Remove unused tensor mapping

71abd3a

CISC reviewed Jul 12, 2025

View reviewed changes

src/llama-model.cpp Outdated Show resolved Hide resolved

Update src/llama-model.cpp

fb2ae69

Co-authored-by: Sigbjørn Skjæret <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

model : add PLaMo-2 model #14560

model : add PLaMo-2 model #14560

mitmul commented Jul 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov Jul 9, 2025

Uh oh!

mitmul Jul 10, 2025

Uh oh!

whoreson commented Jul 9, 2025

Uh oh!

mitmul commented Jul 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

model : add PLaMo-2 model #14560

Are you sure you want to change the base?

model : add PLaMo-2 model #14560

Conversation

mitmul commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

mitmul Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

whoreson commented Jul 9, 2025

Uh oh!

mitmul commented Jul 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mitmul commented Jul 7, 2025 •

edited

Loading