Skip to content

model : add hunyuan moe #14425

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Jul 8, 2025
Merged

model : add hunyuan moe #14425

merged 27 commits into from
Jul 8, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Jun 27, 2025

Fix #14415

TODO:

@github-actions github-actions bot added the python python script changes label Jun 27, 2025
@ngxson
Copy link
Collaborator Author

ngxson commented Jun 27, 2025

Ok, getting somewhere now. The model runs, but output gibberish

[UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧]

@ubergarm
Copy link

Thanks for working on this!

I got the same looking output trying llama-server on ngxson/xsn/hunyuan-moe@51886a47a with the freshly converted bf16.

The only odd things I noticed were:

  1. I had to pip install tiktoken to get it to convert
  2. Conversion had an odd warning WARNING:gguf.vocab:Adding merges requested but no merges found, output may be non-functional.
  3. On startup llama-server printed this warning:
load: control-looking token: 127957 '<|endoftext|>' was not control-type; this is probably a bug in the model. its type will be overridden
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect

Tested on an AMD 7965WX 24x Core 256GB DDR5@4800 + Dual RTX A6000 (96GB Total VRAM) rig.

👈 a few more commands and logs fwiw

convert

python \
    convert_hf_to_gguf.py \
    --outtype bf16 \
    --split-max-size 50G \
    --outfile /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/ \
    /mnt/raid/models/tencent/Hunyuan-A13B-Instruct/

...

llama-server

model=/mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-BF16-00001-of-00004.gguf

./build/bin/llama-server \
  --model "$model" \
  -fa \
  -ctk f16 -ctv f16 \
  -c 8192 \
  -ts 48,48 \
  -ngl 10 \
  --threads 24 \
  --host 127.0.0.1 \
  --port 8080

...

client

>>> User:

Tell a funny joke in English.

>>> Assistant:

[UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧]

@arch-btw
Copy link
Contributor

arch-btw commented Jun 27, 2025

I don't know as much about this as you guys but, could it be that the tokenizer is splitting characters like 新 ("new") into raw bytes?

So the UTF-8 sequence 0xe696b0 becomes 3 separate bytes (e6, 96, b0). And the other character 旧 ("old") splits into 3 bytes as well (e6, 97, a7).

And so the fragments get wrapped in [UNK_BYTE_] prefixes. The token stream becomes corrupt in the output and sort of traps the model in a "new --> old" loop, which then blocks normal text generation?

Because common Chinese characters always use 3 bytes in UTF-8:

  • converts to b'\xe6\x96\xb0' (3 bytes)
  • converts to b'\xe6\x97\xa7' (3 bytes)

It matches the error: [UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧]

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 27, 2025

The cgraph is still not correct. Testing with this tiny random weight: https://huggingface.co/ngxson/hunyuan-moe-tiny-random/tree/main

Seems like the problem is from the self-attention block

@kooshi
Copy link
Contributor

kooshi commented Jun 28, 2025

I don't know if the improvements I am seeing are from your last wip commit, or from my edits to the convert script, but I currently get almost intelligible responses.

The changes I made were:

  • specify the BOS token explicitly, as it is incorrect in hunyuan's config.json self.gguf_writer.add_bos_token_id(127959)
  • use tokenizer.special_tokens.values() instead of tokenizer.get_added_vocab() to determine control tokens
  • skip lm_head.weight as the embedding weights are tied
  • changed the base model from LlamaModel to TextModel for a more generic foundation

my edits are here: https://github.com/kooshi/llama.cpp/tree/hunyuan
full disclaimer though, I have no idea what I'm doing. The BOS token was definitely broken though.

> hello
<think>[UNK_BYTE_0x0a>
]Okay,[UNK_BYTE_0x20 the]the[UNK_BYTE_0x20 user]user[UNK_BYTE_0x20 said]said[UNK_BYTE_0x20 "]"hello".[UNK_BYTE_0x20 I]I[UNK_BYTE_0x20 need]need[UNK_BYTE_0x20 to]to[UNK_BYTE_0x20 respond]respond[UNK_BYTE_0x20 appropriately]appropriately.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]First,[UNK_BYTE_0x20 hello]hello.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hello.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hello[UNK_BYTE_0x20 there]there![UNK_BYTE_0x0a!

][UNK_BYTE_0x0a!

]Hi[UNK_BYTE_0x20 there]there.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hello.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hello.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hi.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hello.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hi.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hey.[UNK_BYTE_0x0a.

(continues forever)

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 28, 2025

The more looking at the upstream implementation, the more I wonder if it actually works.

My Mac M3 Ultra can't load the original model even though having 512GB of RAM.

Now, testing with the tiny weight. Switching between eager and sdpa, they give different output result, which indicates that one of the 2 attn impl is buggy.

Also, flash_attn does not work at all, they haven't even verified the code path before shipping (NameError: name 'flash_attn_func' is not defined)

And more importantly, attention_mask is None everywhere, even using the example code provided on HF.

If that is true, it means they messed up badly this time.

@Downtown-Case
Copy link

modeling_hunyuan.py is basically identical to the file for the old hunyuan-large, with 1 changed line:

https://www.diffchecker.com/P3e0hQM5/

https://huggingface.co/tencent/Tencent-Hunyuan-Large/blob/main/Hunyuan-A52B-Instruct/

And hunyuan.py (the actual model class here) is largly copied from modeling_hunyuan.py, including unused features like CLA:

https://www.diffchecker.com/P9FIR5OD/

In other words, its almost Hunyuan large? I'm not sure why the HF attention implementations would be bugged. But other reimplementations like vllm's seem to work, so maybe they can shed some light on this:

quinnrong94/vllm@5302fbf

@Downtown-Case
Copy link

Downtown-Case commented Jun 28, 2025

I take that back, apparently vllm is only sometimes working with A13B, heh:

ikawrakow/ik_llama.cpp#561 (comment)

vllm-project/vllm#20183

vllm-project/vllm#20114

@Noeda
Copy link
Contributor

Noeda commented Jun 28, 2025

I had the original model from Huggingface work coherently on pure CPU. It uses the HunYuanSdpaAttention codepath.

This is all tentative as I just got it running at all:

If I compare logits for a single-token prompt, I get a very similar logit distribution from both llama.cpp and the HF. More than one token and things look different. I'm purely going with numerical token IDs for llama.cpp as the tokenizer is messed up as observed (I tried 'a' the token 64 for single-token prompt and '12' prompt (16, 17) for two-token test, e.g. llama-eval-callback --no-escape --model hunyuan-q8.gguf -n 1 -c 512 -p '12').

This is with the code from combined @ngxson and @kooshi with the .gguf made with @kooshi 's code (I took latest efforts I saw here in the discussion to start off).


Below in the dropbox is the transformers test program that makes coherent text for me (up to 100 tokens because I was too impatient to try longer prompts). I think installing accelerate and asking it to use bfloat16 really helps with memory. I think that would make it run on the M3 512GB machine too, IIRC when I did this for dots.llm1 I really had to use bfloat16 to not run out of memory.

My machine has 256GB of memory, a Hetzner server with a modern AMD EPYC CPU. I do have a Mac Studio (M2, 192GB) as well but for CPU work this Hetzner is usually much faster.

(I don't know why asking it to use bfloat16 helps, maybe it doesn't make giant copies of tensors or something when you ask it to do that; it's just something I observed and never checked what's it doing behind the scenes).

test.py

This is a version of the example code from the Huggingface page that I modified a bit.

#!/usr/bin/env python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os
import re

def main():
    with torch.no_grad():
        model_path = '/home/shannon/llama.cpp/tencent_Hunyuan-A13B-Instruct'

        tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True, trust_remote_code=True)
        model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True, device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True)

        messages = [
            {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
        ]
        tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt",
                                                          enable_thinking=True # Toggle thinking mode (default: True)
                                                      )

        outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=20)
        output_text = tokenizer.decode(outputs[0])
        print(outputs)
        print(output_text)


if __name__ == '__main__':
    main()
stdout of test.py

The output has output as token IDs and as text (two prints()) in there. To run this, you need to install accelerate into your Python environment for the device_map line thingy.

(hunyuan) shannon@soga ~/hunyuan_llama.cpp/hf> ./test.py
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  6.09it/s]
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
tensor([[127958,   8144,    264,   2875,  12399,    315,    279,   7720,    315,
           5912,  10368, 127962,  14023,    771,    397,  33413,     11,    358,
           1205,    311,   3350,    264,   2875,  12399,    922,    279,   7720,
            315,   5912,  10368,     13,   6914]])
<|startoftext|>Write a short summary of the benefits of regular exercise<|extra_0|><think>
Okay, I need to write a short summary about the benefits of regular exercise. Let

I'm on and off this weekend trying to also figure out where computation graph is off exactly. If I find out before someone else does, I'll let you all know.

(Runs surprisingly fast on transformers+CPU, I'm used to that combo being extraordinarily slow. It is still very slow, just not like "it will take 30 minutes to make 10 tokens" slow).

@jacekpoplawski
Copy link

Is it possible to load this model in 4-bit precision using Transformers? Does bitsandbytes support this model? I’m limited to a total of 72GB of VRAM across several GPUs, so bfloat16 won’t work for me.

@ubergarm
Copy link

ubergarm commented Jun 28, 2025

@jacekpoplawski

Is it possible to load this model in 4-bit precision using Transformers? Does bitsandbytes support this model? I’m limited to a total of 72GB of VRAM across several GPUs, so bfloat16 won’t work for me.

Their official inference script for running the int4 quant on vllm is using --dtype bfloat16

(still didn't work for me though)

@Noeda
Copy link
Contributor

Noeda commented Jun 28, 2025

To add to @ubergarm options, I did notice there are some quantized versions like https://huggingface.co/tencent/Hunyuan-A13B-Instruct-FP8 or https://huggingface.co/tencent/Hunyuan-A13B-Instruct-GPTQ-Int4 (they look like they are designed to work with transformers at first glance. I've never in my entire life ran vLLM or sglang even once.).

The GPTQ-Int4 one has a single model.safetensors at 43.7GB which maybe works. One would hope 😉

Haven't tried any of them. For computation graph work feels better to get whatever is highest precision I am able to run conveniently.

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 28, 2025

If someone can run it, could you please verify if attention_mask inside HunYuanDecoderLayer has a non-Nonevalue? Thanks.

@Noeda
Copy link
Contributor

Noeda commented Jun 28, 2025

(hunyuan) shannon@soga ~/hunyuan_llama.cpp/hf> ./test2.py
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  3.91it/s]
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None

@ngxson is this the part you wanted to see if it's None or not? Argument to the forward()?

Screenshot 2025-06-28 at 12 28 33

Edit: took a bigger screenshot to show more clearly where I put that. HunYuanDecoderLayer's forward(). The line numbers you see won't match with original because I have more print() debugging at the top of the file and other hacky stuff I added.

Stdout tail because that first paste is cut off, I see None throughout the entire run. Output looks coherent.

Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
tensor([[127958,   8144,    264,   2875,  12399,    315,    279,   7720,    315,
           5912,  10368, 127962,  14023,    771,    397,  33413,     11,    358,
           1205,    311,   3350,    264,   2875,  12399,    922,    279,   7720,
            315,   5912,  10368,     13,   6914]])
<|startoftext|>Write a short summary of the benefits of regular exercise<|extra_0|><think>
Okay, I need to write a short summary about the benefits of regular exercise. Let

Edit2: I'm going to let this thing generate a full response which might take a while. But I feel this might be a bit short as a test; it almost verbatim mentions the prompt in the <think> so maybe it's about to repeat itself or something. I'll paste as a new comment when it's done. Just want to get more confirmation the HF implementation itself works beyond very short generations.

@Noeda
Copy link
Contributor

Noeda commented Jun 28, 2025

Full response example of the transformers version; I gave it 5000 token max:

stdout from test2.py (I cut off all the parts that said attention mask is None)
tensor([[127958,   8144,    264,   2875,  12399,    315,    279,   7720,    315,
           5912,  10368, 127962,  14023,    771,    397,  33413,     11,    358,
           1205,    311,   3350,    264,   2875,  12399,    922,    279,   7720,
            315,   5912,  10368,     13,   6914,    757,   1212,    555,  89746,
           1148,    358,   1440,     13,   5629,     11,   7106,   2890,   7720,
             25,  96931,    279,   4851,     11,  36050,  35855,     11,   8779,
            449,   4785,   6373,     13,   5112,  10723,   2890,     25,  26338,
           8631,     11,  18547,     11,  18710,     13,  10926,   1101,  67232,
           4907,   5990,     13,   8840,     11,    323,   1317,   9860,   6392,
           1093,  18189,   5326,    315,  21249,  19338,   2345,   8747,  16629,
             11,  63308,     11,   1063,  51423,     13,   8840,     11,    323,
          25702,   7720,     11,   1093,   2731,   5044,    477,   8271,    734,
             13,   6914,    757,  31335,   1521,   3585,    382,   3563,    449,
            459,  17219,    430,   5415,   5912,  10368,    706,  12387,   7720,
             13,   5112,   1464,   1523,   1139,   7106,     11,  10723,     11,
            323,   7344,   1023,  11306,     13,   1789,   7106,     25,   4851,
           2890,    320,   4620,    261,   4851,     11,   4827,   6680,   7410,
            705,   4785,   6373,    320,  22464,     82,  25247,     11,  22890,
          16124,    705,  22852,   1887,    320,  37860,    570,  38895,     25,
            842,  16751,   1354,     11,  26338,   8631,  56592,  16708,     11,
           3698,   1900,  18710,     13,  73235,     25,  57924,   5357,     11,
           5044,     11,   7344,  32174,  25702,  18174,     13,   7429,     11,
           3674,   7720,    422,   1912,  23783,     11,    719,   7344,    430,
            596,  10309,     13,  14998,    311,   2567,    433,  64694,     11,
            779,   7344,    220,     19,     12,     20,   1401,   3585,     13,
          35106,    503,  71921,     13,   7557,   2771,    433,  28555,     13,
           6914,    757,   1817,    422,    358,  13942,   4205,     13,   8840,
             11,   4907,   5990,   2345,  64562,    649,   5376,  61784,     11,
            539,   1120,   8395,  25247,     13,  22335,     11,    430,    596,
           3062,     13,   2100,  63179,    682,   1521,   1139,    264,  56887,
          14646,     13,   6914,    757,  10165,   1473,  31504,  10368,   6209,
            264,   7029,   2134,    315,   7720,    369,   8244,   1664,  33851,
             13,  13101,   2740,     11,    433,  96931,    279,   4851,     11,
          18899,  35855,    323,  46301,    279,   5326,    315,   4787,   1093,
          63308,    323,   4851,   8624,     11,   1418,  86387,    304,   4785,
           6373,   1555,  52703,  20252,    323,  16124,   4857,     13,  49693,
            750,     11,    433,  31854,    279,   4984,    315,    842,  16751,
           1354,     11,  18189,   8631,     11,  18547,     11,    323,  13803,
            315,  18710,     11,    323,  57924,  25702,    734,     11,  56028,
           5357,     11,   5044,     11,    323,  13893,  80430,   4325,  14228,
          10723,  18174,     13,  23212,     11,   5912,   5820,  12231,    988,
           4907,   5990,    555,  18899,  61784,    323,  11815,    264,  16643,
          22852,   1887,     11,  18189,  17563,   5326,     13,  32255,     11,
           1521,   6372,  17210,    311,    264,   5129,     11,  39345,     11,
            323,    810,  24770,   2324,    382,  14524,     11,    374,    430,
           2288,   1317,     30,  10926,  74481,     13,   6914,    757,   1518,
             13,    330,  31504,  10368,   5825,  62387,    582,  25489,   7720,
             13,  13101,   2740,     11,    433,  96931,    279,   4851,     11,
          73115,   6680,   7410,     11,  52797,   4785,   6373,     11,    323,
          67232,  40368,     13,  49693,    750,     11,    433,  19786,    842,
          16751,   1354,     11,  18189,   8631,     11,  18547,     11,    323,
          18710,     11,   1418,  47594,   5357,    323,   5044,     13,   1102,
           1101,  12992,   4907,   5990,    323,   1253,   7781,  25702,  18174,
             13,  28993,     11,    433,  39990,    264,   5129,     11,  39345,
           2324,   1210,   3011,    596,   2731,     13,   4497,  64694,     13,
           4343,    369,  32373,     13,  22335,     11,    430,   4375,     13,
           7557,   2771,    311,   6420,   1401,   5789,   2085,   3794,   2288,
          11944,     13,   3011,   1288,   3504,    433,    627,    524,  27963,
            397,     27,   9399,    397,  31504,  10368,  28421,  28254,   7720,
           4028,   7106,     11,  10723,     11,    323,  25702,  31576,     13,
          13101,   2740,     11,    433,  96931,    279,   4851,     11,  36050,
          35855,     11,    323,  73115,   6680,   7410,     11,  18189,    279,
           5326,    315,   4851,   8624,     11,  63308,     11,    323,  12943,
             13,   1102,  52797,   4785,   6373,    555,  20252,  25247,    323,
           4857,  16025,  16124,     11,   1418,   1101,  47594,  22852,    734,
             13,  49693,    750,     11,  10368,  31854,    842,  16751,    258,
           4984,     11,  46649,  23747,   8631,     11,  18547,     11,    323,
          13803,    315,  18710,     11,    323,  67232,   5357,     11,   5044,
             11,    323,  14604,  56062,     13,   1102,   4726,  12231,    988,
           4907,   5990,    555,  18899,  61784,    323,   1253,   7781,   4325,
          14228,  25702,  18174,     13,  21153,   3210,     11,   1521,   6372,
          12192,    264,   5129,     11,  39345,     11,    323,    810,  24770,
           2324,    627,    524,   9399,     29, 127960]])
<|startoftext|>Write a short summary of the benefits of regular exercise<|extra_0|><think>
Okay, I need to write a short summary about the benefits of regular exercise. Let me start by recalling what I know. First, physical health benefits: strengthens the heart, improves circulation, helps with weight management. Then mental health: reduces stress, anxiety, depression. Maybe also boosts energy levels. Oh, and long-term stuff like reducing risk of chronic diseases—diabetes, hypertension, some cancers. Oh, and cognitive benefits, like better memory or brain function. Let me organize these points.

Start with an introduction that states regular exercise has numerous benefits. Then break down into physical, mental, and maybe other categories. For physical: heart health (stronger heart, lower blood pressure), weight management (burns calories, builds muscle), immune system (maybe). Mental: endorphins, reduces stress/anxiety, combats depression. Cognitive: enhances focus, memory, maybe delays cognitive decline. Also, social benefits if group exercises, but maybe that's optional. Need to keep it concise, so maybe 4-5 key points. Avoid jargon. Make sure it flows. Let me check if I missed anything. Oh, energy levels—exercise can increase stamina, not just burn calories. Yeah, that's important. So summarize all these into a coherent paragraph. Let me draft:

Regular exercise offers a wide range of benefits for overall well-being. Physically, it strengthens the heart, improving circulation and lowering the risk of conditions like hypertension and heart disease, while aiding in weight management through calorie burning and muscle building. Mentally, it triggers the release of endorphins, reducing stress, anxiety, and symptoms of depression, and enhances cognitive function, boosting focus, memory, and potentially delaying age-related mental decline. Additionally, regular activity elevates energy levels by improving stamina and supports a stronger immune system, reducing illness risk. Together, these effects contribute to a longer, healthier, and more balanced life.

Wait, is that too long? Maybe shorten. Let me see. "Regular exercise provides multifaceted benefits. Physically, it strengthens the heart, lowers blood pressure, aids weight management, and boosts immunity. Mentally, it releases endorphins, reducing stress, anxiety, and depression, while enhancing focus and memory. It also increases energy levels and may delay cognitive decline. Overall, it promotes a longer, healthier life." That's better. More concise. Check for clarity. Yeah, that works. Make sure to mention key areas without getting too detailed. That should cover it.
</think>
<answer>
Regular exercise delivers profound benefits across physical, mental, and cognitive domains. Physically, it strengthens the heart, improves circulation, and lowers blood pressure, reducing the risk of heart disease, hypertension, and stroke. It aids weight management by burning calories and building lean muscle, while also enhancing immune function. Mentally, exercise triggers endorphin release, alleviating stress, anxiety, and symptoms of depression, and boosts focus, memory, and emotional resilience. It further elevates energy levels by improving stamina and may delay age-related cognitive decline. Collectively, these effects promote a longer, healthier, and more balanced life.
</answer><|eos|>

Code is almost same as before, pasting for reproducibility:

test2.py
#!/usr/bin/env python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os
import re

def main():
    with torch.no_grad():
        model_path = '/home/shannon/llama.cpp/tencent_Hunyuan-A13B-Instruct'

        tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True, trust_remote_code=True)
        model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True, device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True)

        messages = [
            {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
        ]
        tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt",
                                                          enable_thinking=True # Toggle thinking mode (default: True)
                                                      )

        outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=5000)
        output_text = tokenizer.decode(outputs[0])
        print(outputs)
        print(output_text)


if __name__ == '__main__':
    main()

The output looks normal to me and it answered the prompt. It does look like to me it works.

CPU-only, 256GB Hetzner server.

@ggerganov
Copy link
Member

Here is the PPL with the pretrain model from https://huggingface.co/tencent/Hunyuan-A13B-Pretrain:

make -j && ./bin/llama-perplexity -m ../models/hunyuan-a13b-pt/ggml-model-q8_0.gguf -f wikitext-2-raw/wiki.test.raw -fa

Final estimate: PPL = 5.2861 +/- 0.03234

The logits still doesn't match 100% due to the problem with router algorithm that I pointed out in #14425 (comment) , but I think we can have a look on this afterwards.

@ngxson Does the logits match if this new expert router algorithm is disabled in the reference implementation?

@fernandaspets
Copy link

I'm running the polyglot aider test with Q8 gguf from Bullerwins. Its not passing any tests. the responses are well formed with thinking ON but with thinking OFF it just misses everything. - dirname: 2025-07-03-08-17-48--Hunyuan-A13B-Instruct-q8_0-5
test_cases: 10
model: openai/Hunyuan-A13B-Instruct-GGUF
edit_format: diff
commit_hash: 3db4d37
pass_rate_1: 0.0
pass_rate_2: 0.0
pass_num_1: 0
pass_num_2: 0
percent_cases_well_formed: 80.0
error_outputs: 5
num_malformed_responses: 5
num_with_malformed_responses: 2
user_asks: 2
lazy_comments: 0
syntax_errors: 0
indentation_errors: 0
exhausted_context_windows: 0
prompt_tokens: 119378
completion_tokens: 109052
test_timeouts: 0
total_tests: 225
command: aider --model openai/Hunyuan-A13B-Instruct-GGUF
date: 2025-07-03
versions: 0.85.2.dev
seconds_per_case: 226.9
total_cost: 0.0000

Copy link

@kzjeef kzjeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Work,

I've pull the code test on a X86 CPU server, the fp16 and int8 inference is work, but seems result not quite accurate as the running on vLLM.

Just give some comments about model version, and also the chat template.

@@ -6436,6 +6439,155 @@ def set_gguf_parameters(self):
super().set_gguf_parameters()
self.gguf_writer.add_audio_stack_factor(self.global_config["stack_factor"])


@ModelBase.register("HunYuanMoEV1ForCausalLM")
class HunYuanMoEModel(TextModel):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you align with Hunyuan's naming , with version V1 suffix?


@ModelBase.register("HunYuanMoEV1ForCausalLM")
class HunYuanMoEModel(TextModel):
model_arch = gguf.MODEL_ARCH.HUNYUAN_MOE
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also could you add the version suffix on the arch name, like the arch name in model 's config.json ?

@@ -656,6 +657,7 @@ class MODEL_TENSOR(IntEnum):
MODEL_ARCH.DOTS1: "dots1",
MODEL_ARCH.ARCEE: "arcee",
MODEL_ARCH.ERNIE4_5: "ernie4_5",
MODEL_ARCH.HUNYUAN_MOE: "hunyuan-moe",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hunyuan-moe-v1 will be a better name for later model updating.

@@ -117,6 +117,7 @@ extern "C" {
LLAMA_VOCAB_PRE_TYPE_LLAMA4 = 33,
LLAMA_VOCAB_PRE_TYPE_PIXTRAL = 34,
LLAMA_VOCAB_PRE_TYPE_SEED_CODER = 35,
LLAMA_VOCAB_PRE_TYPE_HUNYUAN = 36,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a version suffix on vocab type will be better.

@@ -77,6 +77,7 @@ static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
{ LLM_ARCH_DOTS1, "dots1" },
{ LLM_ARCH_ARCEE, "arcee" },
{ LLM_ARCH_ERNIE4_5, "ernie4_5" },
{ LLM_ARCH_HUNYUAN_MOE, "hunyuan-moe" },
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also .

@@ -665,6 +668,21 @@ int32_t llm_chat_apply_template(
if (add_ass) {
ss << "<|response|>";
}
} else if (tmpl == LLM_CHAT_TEMPLATE_HUNYUAN_MOE) {
// tencent/Hunyuan-A13B-Instruct
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the chat template of hunyuan a13b shoule be a much complex one ? with a quick and slow think option.

also the model default enable the slow think,

does llama cpp have some option on enable_think like the huggingface exmaple ?

@@ -1656,6 +1657,10 @@ void llama_vocab::impl::load(llama_model_loader & ml, const LLM_KV & kv) {
tokenizer_pre == "seed-coder") {
pre_type = LLAMA_VOCAB_PRE_TYPE_SEED_CODER;
clean_spaces = false;
} else if (
tokenizer_pre == "hunyuan") {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tokenizer verison

@@ -815,6 +815,9 @@ def get_vocab_base_pre(self, tokenizer) -> str:
if chkhsh == "1431a23e583c97432bc230bff598d103ddb5a1f89960c8f1d1051aaa944d0b35":
# ref: https://huggingface.co/sapienzanlp/Minerva-7B-base-v1.0
res = "minerva-7b"
if chkhsh == "7e57df22b1fe23a7b1e1c7f3dc4e3f96d43a4eb0836d0c6bdc3436d7b2f1c664":
# ref: https://huggingface.co/tencent/Hunyuan-A13B-Instruct
res = "hunyuan"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model name better with hunyuan A13B

@@ -137,6 +137,7 @@ class TOKENIZER_TYPE(IntEnum):
{"name": "chatglm-bpe", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/THUDM/glm-4-9b-chat", "chkhsh": "81d72c7348a9f0ebe86f23298d37debe0a5e71149e29bd283904c02262b27516"},
{"name": "glm4", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/THUDM/glm-4-9b-hf", "chkhsh": "a1336059768a55c99a734006ffb02203cd450fed003e9a71886c88acf24fdbc2"},
{"name": "minerva-7b", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/sapienzanlp/Minerva-7B-base-v1.0", "chkhsh": "1431a23e583c97432bc230bff598d103ddb5a1f89960c8f1d1051aaa944d0b35"},
{"name": "hunyuan", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/tencent/Hunyuan-A13B-Instruct", "chkhsh": "7e57df22b1fe23a7b1e1c7f3dc4e3f96d43a4eb0836d0c6bdc3436d7b2f1c664"},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model name should be hunyuan a13b, from my source , they will release more llm model soon, we'd better add some identify for the mdoel.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is tokenizer name, not model name

@BahamutRU
Copy link

Perfectly working commit, can you review and approve this, pls? @ggerganov 🥺🙏

@qingy1337
Copy link

Perfectly working commit, can you review and approve this, pls? @ggerganov 🥺🙏

I think the logits still have to be verified between the GGUF and the original model implementation (disabling the custom expert router mechanism) first. There hasn't been an update yet from @ngxson as to whether it does match.

@bennmann
Copy link

bennmann commented Jul 8, 2025

Based on community testing, these merges are coherent

#14425 (comment)

It's just a small improvement in the future to investigate the router block of code from #14425 (comment)

I encourage merge based on the evidence so far. Great looking model.

@kooshi
Copy link
Contributor

kooshi commented Jul 8, 2025

For the record, when I skimmed the vllm PR that added the "inference only" model code, it did not appear to implement the custom expert selection either.

I would also vote to merge as is, unless someone with the time and hardware can do some deeper comparisons with vllm at f16.

In the mean time, it's quite usable.

@qingy1337
Copy link

qingy1337 commented Jul 8, 2025

Just adding my +1 for merge; I went and tested the latest code with Q6_K from bullerwins/Hunyuan-A13B-Instruct-GGUF:

./llama-server -m ~/Hunyuan-A13B-Instruct-Q6_K-00001-of-00002.gguf -ngl 99 -c 16384 --host 0.0.0.0 --port 8181 --jinja

On H100 it looks really nice in terms of speed:

slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 16384, n_keep = 0, n_prompt_tokens = 179
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 179, n_tokens = 179, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 179, n_tokens = 179
slot      release: id  0 | task 0 | stop processing: n_past = 3211, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =     450.27 ms /   179 tokens (    2.52 ms per token,   397.54 tokens per second)
       eval time =   36626.59 ms /  3033 tokens (   12.08 ms per token,    82.81 tokens per second)

Also llama-bench just for completeness:

ubuntu@lumpy-iris-fox-65d7c85d9b-vp97w:~/llama.cpp/build/bin$ ./llama-bench -m ~/Hunyuan-A13B-Instruct-Q6_K-00001-of-00002.gguf -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| hunyuan-moe A13B Q6_K          |  61.44 GiB |    80.39 B | CUDA       |  99 |           pp512 |       1956.58 ± 8.77 |
| hunyuan-moe A13B Q6_K          |  61.44 GiB |    80.39 B | CUDA       |  99 |           tg128 |         87.45 ± 1.40 |

build: e5fe0892 (5813)

Notes:

  • Nothing really noticeably wrong with the model, I tested with couple MATH-500 Level 5 questions and it got them all right.
  • No weird formatting issues in outputs.
  • /no_think & /think works as expected.

It looks good!

@ggerganov
Copy link
Member

For the record, when I skimmed the vllm PR that added the "inference only" model code, it did not appear to implement the custom expert selection either.

Ok, that sounds like a good explanation.

I would also vote to merge as is, unless someone with the time and hardware can do some deeper comparisons with vllm at f16.

@kooshi Earlier you said that the model behaves weird. Did something change?

@ggerganov ggerganov merged commit 8f22dc0 into ggml-org:master Jul 8, 2025
51 checks passed
@kooshi
Copy link
Contributor

kooshi commented Jul 8, 2025

@kooshi Earlier you said that the model behaves weird. Did something change?

The weirdness I was seeing may have been from my settings, or perhaps inherent to the model. It was quite smart, just stumbled over its own <answer> formatting in multiturn chats sometimes. I can't run the vllm version to compare (3 gpus and the model doesn't yet support pipeline parallel), so I'm not sure where the issue lies, if there is any.

Edit: thinking back, I was running it with --presence-penalty, just because I was using my qwen settings. That could have thrown it off. @ubergarm also reported multiturn issues, but was also using my settings iirc.

@Downtown-Case
Copy link

Downtown-Case commented Jul 8, 2025

@kooshi In my testing, it's extremely sensitive to sampling. The model is both very prone to loop, very sensitive to prompt formatting, yet "uncertain" about its own think formatting. It's also multiple tokens (eg not a single token), which gives it more opportunity to 'mess up.'

A relatively high MinP seems to help it behave. But the default sampling in some UIs would definitely trip it up.

gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Jul 8, 2025
* origin/master:
model : fix hunyuan moe chat template (ggml-org#14584)
model : add SmolLM3 (ggml-org#14581)
memory : fix broken batch splits for recurrent cache (ggml-org#14575)
vulkan : fix rope with partial rotation and non-cont src (ggml-org#14582)
server: Add ability to mount server at prefix (ggml-org#14544)
model : add hunyuan moe (ggml-org#14425)
vulkan: increase timeout for CI (ggml-org#14574)
cuda : fix rope with partial rotation and non-cont src (ggml-org#14580)
CUDA: add bilinear interpolation for upscale (ggml-org#14563)
musa: fix build warnings (unused variable) (ggml-org#14561)
llama : fix incorrect minicpm3 v_states shape (ggml-org#14571)
llama : remove ggml_cont where possible (ggml-org#14568)
@ddh0
Copy link
Contributor

ddh0 commented Jul 8, 2025

This model is broken for me. I converted the HF weights to GGUF this morning after the PR was merged and made a fresh Q4_K_M quantization. I'm getting lots of broken output and, as @Downtown-Case mentioned, the model doesn't seem to know how to format its own messages. It will close and open the <think> and <answer> blocks at random and often generates EOS early. I suspect a RoPE issue but I haven't been able to find it yet.

@ubergarm
Copy link

ubergarm commented Jul 8, 2025

@ddh0 did you try the very latest version that is a few hours old with the chat template fix: #14584 ?

I'm re-testing perplexity with that now

@ddh0
Copy link
Contributor

ddh0 commented Jul 8, 2025

Oh let me see. I'll try that now.

@ddh0
Copy link
Contributor

ddh0 commented Jul 8, 2025

It's working! I fed the model the entire llama.h file as it currently appears:

↕️ Click to expand llama-server console output ...

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
build: 5849 (6efcd659) with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 192.168.68.66, port: 20480, http threads: 15
main: loading model
srv    load_model: loading model '/opt/workspace/gguf/Hunyuan-A13B-Instruct-Q4_K_X.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 15956 MiB free
llama_model_loader: loaded meta data with 41 key-value pairs and 482 tensors from /opt/workspace/gguf/Hunyuan-A13B-Instruct-Q4_K_X.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = hunyuan-moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Hunyuan-A13B-Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Hunyuan
llama_model_loader: - kv   5:                         general.size_label str              = A13B
llama_model_loader: - kv   6:                            general.license str              = other
llama_model_loader: - kv   7:                       general.license.name str              = tencent-hunyuan-a13b
llama_model_loader: - kv   8:                       general.license.link str              = https://github.com/Tencent-Hunyuan/Hu...
llama_model_loader: - kv   9:                    hunyuan-moe.block_count u32              = 32
llama_model_loader: - kv  10:                 hunyuan-moe.context_length u32              = 262144
llama_model_loader: - kv  11:               hunyuan-moe.embedding_length u32              = 4096
llama_model_loader: - kv  12:            hunyuan-moe.feed_forward_length u32              = 3072
llama_model_loader: - kv  13:           hunyuan-moe.attention.head_count u32              = 32
llama_model_loader: - kv  14:        hunyuan-moe.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                 hunyuan-moe.rope.freq_base f32              = 11158840.000000
llama_model_loader: - kv  16: hunyuan-moe.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                   hunyuan-moe.expert_count u32              = 64
llama_model_loader: - kv  18: hunyuan-moe.expert_shared_feed_forward_length u32              = 3072
llama_model_loader: - kv  19:     hunyuan-moe.expert_feed_forward_length u32              = 3072
llama_model_loader: - kv  20:              hunyuan-moe.expert_used_count u32              = 8
llama_model_loader: - kv  21:            hunyuan-moe.expert_shared_count u32              = 1
llama_model_loader: - kv  22:              hunyuan-moe.rope.scaling.type str              = none
llama_model_loader: - kv  23:            hunyuan-moe.rope.scaling.factor f32              = 1.000000
llama_model_loader: - kv  24: hunyuan-moe.rope.scaling.original_context_length u32              = 262144
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = hunyuan
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,128167]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,128167]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,127698]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 127959
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 127960
llama_model_loader: - kv  32:          tokenizer.ggml.seperator_token_id u32              = 127962
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 127961
llama_model_loader: - kv  34:                    tokenizer.chat_template str              = {% set loop_messages = messages %}\n{%...
llama_model_loader: - kv  35:               general.quantization_version u32              = 2
llama_model_loader: - kv  36:                          general.file_type u32              = 15
llama_model_loader: - kv  37:                      quantize.imatrix.file str              = /opt/workspace/imatrices/Hunyuan-A13B...
llama_model_loader: - kv  38:                   quantize.imatrix.dataset str              = imatrix-training-full-3
llama_model_loader: - kv  39:             quantize.imatrix.entries_count u32              = 352
llama_model_loader: - kv  40:              quantize.imatrix.chunks_count u32              = 320
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q8_0:   64 tensors
llama_model_loader: - type q4_K:  161 tensors
llama_model_loader: - type q5_K:   96 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 45.38 GiB (4.85 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 210
load: token to piece cache size = 0.7868 MB
print_info: arch             = hunyuan-moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 262144
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 3072
print_info: n_expert         = 64
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = none
print_info: freq_base_train  = 11158840.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 262144
print_info: rope_finetuned   = unknown
print_info: model type       = A13B
print_info: model params     = 80.39 B
print_info: general.name     = Hunyuan-A13B-Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128167
print_info: n_merges         = 127698
print_info: BOS token        = 127959 '<|bos|>'
print_info: EOS token        = 127960 '<|eos|>'
print_info: EOT token        = 127957 '<|endoftext|>'
print_info: SEP token        = 127962 '<|extra_0|>'
print_info: PAD token        = 127961 '<|pad|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 127957 '<|endoftext|>'
print_info: EOG token        = 127960 '<|eos|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:        CUDA0 model buffer size =  1922.66 MiB
load_tensors:   CPU_Mapped model buffer size = 46459.90 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 98304
llama_context: n_ctx_per_seq = 98304
llama_context: n_batch       = 4096
llama_context: n_ubatch      = 1024
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 11158840.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (98304) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.49 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size = 12288.00 MiB
llama_kv_cache_unified: size = 12288.00 MiB ( 98304 cells,  32 layers,  1 seqs), K (f16): 6144.00 MiB, V (f16): 6144.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context:      CUDA0 compute buffer size =  1168.00 MiB
llama_context:  CUDA_Host compute buffer size =   400.01 MiB
llama_context: graph nodes  = 2183
llama_context: graph splits = 98 (with bs=1024), 66 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 98304
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 98304
main: model loaded
main: chat template, chat_template: {% set loop_messages = messages %}
{% if tools %}
    {% set weekday_map = {'Monday': '星期一', 'Tuesday': '星期二', 'Wednesday': '星期三', 'Thursday': '星期四', 'Friday': '星期五', 'Saturday': '星期六', 'Sunday': '星期日'} %}
    {% set weekday_cn = weekday_map[strftime_now('%A')] %}
    {% set datetime_str = strftime_now('%Y-%m-%d %H:%M:%S') %}
    {% set datetime_str = datetime_str + ' ' + weekday_cn %}
    {% for message in loop_messages %}
        {% if 'content' in message %}
            {% set content = message['content'] %}
        {% else %}
            {% set content = '' %}
        {% endif %}
        {% if loop.index0 == 0 %}
            {% set content_tmp = '你是一位函数组合专家。你会得到一个问题和一组可能的函数。根据问题,你需要进行一个或多个函数/工具调用以实现目的。
如果没有一个函数可以使用,请直接使用自然语言回复用户,以助手:开头。
如果给定的问题缺少函数所需的参数,请使用自然语言进行提问,向用户询问必要信息,以助手:开头。
如果调用结果已经足够回答用户问题,请对历史结果进行总结,使用自然语言回复用户,以助手:开头。
你应该只在工具调用部分返回函数调用。如果你决定调用任何函数,你必须将其格式化为<tool_calls>[{"name": "func_name1", "arguments": {"argument1": "value1", "argument2": "value2"}},...]</tool_calls>。你不应该在回复中包含任何其他文本。以下是你可以调用的函数列表,格式为JSON。
' %}
            {% set content_tmp = content_tmp + '
' + tools | tojson + '
' %}
            {% if message['role'] == 'system' %}
                {% set content_tmp = content_tmp + '
额外要求:
' + content + '

如果你决定返回函数调用,请将其格式化为<tool_calls>[{"name": "func_name1", "arguments": {"argument1": "value1", "argument2": "value2"}},...]</tool_calls>,不得包含其他文本。如果额外要求里有格式要求,请忽略,以此处为准。
否则,请参考开头说的三种情况,以助手:开头进行回复。

如果额外要求里有时间信息,就以额外要求里的时间为准,否则,参考当前时间:' + datetime_str %}
                {% set content = '<|startoftext|>' + content_tmp + '<|extra_4|>' %}
            {% elif message['role'] == 'user' %}
                {% set content_tmp = content_tmp + '
如果你决定返回函数调用,请将其格式化为<tool_calls>[{"name": "func_name1", "arguments": {"argument1": "value1", "argument2": "value2"}},...]</tool_calls>,不得包含其他文本。
否则,请参考开头说的三种情况,以助手:开头进行回复。

当前时间:' + datetime_str %}
                {% set content_tmp = '<|startoftext|>' + content_tmp + '<|extra_4|>'%}
                {% set content = content_tmp + '用户:' + content + '<|extra_0|>' %}
            {% endif %}
        {% else %}
            {% if message['role'] == 'user' %}
                {% set content = '用户:' + content + '<|extra_0|>' %}
            {% elif message['role'] == 'assistant' %}
                {% if 'tool_calls' in message %}
                    {% set tool_calls = message['tool_calls'] %}
                    {% set ns = namespace(tool_calls="[") %}
                    {% for tool_call in tool_calls %}
                        {% set function = tool_call['function'] %}
                        {% set name = function['name'] %}
                        {% set ns.tool_calls = ns.tool_calls + '{"name": "' + name + '", '%}
                        {% set arguments = function['arguments'] %}
                        {% if arguments is not string %}
                            {% set arguments = arguments | tojson %}
                        {% endif %}
                        {% set ns.tool_calls = ns.tool_calls + '"arguments": ' + arguments + '}' %}
                        {% if not loop.last %}
                            {% set ns.tool_calls = ns.tool_calls + ', '%}
                        {% endif %}
                    {% endfor %}
                    {% set ns.tool_calls = ns.tool_calls + ']' %}
                    {% set content = content + '<tool_calls>' + ns.tool_calls + '</tool_calls>' %}
                {% else %}
                    {% set content = '助手:' + content %}
                {% endif %}
                {% set content = content + '<|eos|>' %}
            {% elif message['role'] == 'tool' %}
                {% if content is not string %}
                    {set content = content | tojson }
                {% endif %}
                {% set content = '<tool_response>' + content + '</tool_response>' %}
                {% set content = content + '<|extra_0|>' %}
            {% endif %}
        {% endif %}
    {{- content -}}
    {% endfor %}
{% else %}
    {% set context = {'has_head': true} %}
    {% for message in loop_messages %}
        {% if 'content' in message %}
            {% set content = message['content'] %}
        {% else %}
            {% set content = '' %}
        {% endif %}
        {% if loop.index0 == 0 %}
            {% if content == '' %}
                {% set _ = context.update({'has_head': false}) %}
            {% elif message['role'] == 'system' %}
                {% set content = '<|startoftext|>' + content + '<|extra_4|>' %}
            {% endif %}
        {% endif %}
        {% if message['role'] == 'user' %}
            {% if loop.index0 == 1 and not context.has_head %}
                {% set content = '<|startoftext|>' + content %}
            {% endif %}
            {% if loop.index0 == 1 and context.has_head %}
                {% set content = content + '<|extra_0|>' %}
            {% else %}
                {% set content = '<|startoftext|>' + content + '<|extra_0|>' %}
            {% endif %}
        {% elif message['role'] == 'assistant' %}
            {% set content = content + '<|eos|>' %}
        {% elif message['role'] == 'tool' %}
            {% set content = content + '<|extra_0|>' %}
        {% endif %}
        {{- content -}}
    {% endfor %}
{% endif %}
{%- if enable_thinking is defined and enable_thinking is false %}
    {{- '<think>\n\n</think>\n' }}
{%- endif %}, example_format: '<|startoftext|>You are a helpful assistant<|extra_4|><|startoftext|>Hello<|extra_0|><|startoftext|>Hi there<|eos|><|startoftext|>How are you?<|extra_0|>'
main: server is listening on http://192.168.68.66:20480 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 98304, n_keep = 0, n_prompt_tokens = 17856
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 4096, n_tokens = 4096, progress = 0.229391
slot update_slots: id  0 | task 0 | kv cache rm [4096, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 8192, n_tokens = 4096, progress = 0.458781
slot update_slots: id  0 | task 0 | kv cache rm [8192, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 12288, n_tokens = 4096, progress = 0.688172
slot update_slots: id  0 | task 0 | kv cache rm [12288, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 16384, n_tokens = 4096, progress = 0.917563
slot update_slots: id  0 | task 0 | kv cache rm [16384, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 17856, n_tokens = 1472, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 17856, n_tokens = 1472
slot      release: id  0 | task 0 | stop processing: n_past = 22788, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =   79030.58 ms / 17856 tokens (    4.43 ms per token,   225.94 tokens per second)
       eval time =  575503.98 ms /  4933 tokens (  116.66 ms per token,     8.57 tokens per second)
      total time =  654534.55 ms / 22789 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 192.168.68.58 200


↕️ Click to expand model response ...

<think>
Alright, here's my breakdown of the `llama.h` file. I'm seeing a header file for the `llama.cpp` library, which is pretty popular for running LLaMA and other large language models on consumer-grade hardware. My goal is to give a concise summary of its structure, purpose, and key components.

First, the file begins with include guards, which is good practice to prevent multiple inclusions. It's pulling in some GGML-related headers – `ggml.h`, `ggml-cpu.h`, `ggml-backend.h`, and `ggml-opt.h` – which tells me this library is built on top of GGML, a GPU-accelerated math library. I also see standard C libraries like `stddef.h`, `stdint.h`, `stdio.h`, and `stdbool.h`.

Then, there's some conditional compilation for defining `LLAMA_SHARED` and `LLAMA_API`. This is typical for library development, allowing for export of symbols on Windows when building as a DLL and import otherwise. Next, a bunch of constants and defines, like `LLAMA_DEFAULT_SEED`, `LLAMA_TOKEN_NULL`, and magic numbers like `LLAMA_FILE_MAGIC_GGLA`, which seem to be used for identifying file formats. I see version numbers too, like `LLAMA_SESSION_VERSION` and `LLAMA_STATE_SEQ_VERSION`.

Now the real meat begins – the C interface declarations. I see structs like `llama_vocab`, `llama_model`, `llama_context`, and `llama_sampler`. `llama_vocab` likely handles the vocabulary, which is crucial for tokenization. `llama_model` probably holds the loaded model parameters. `llama_context` seems to manage the state during inference, like the current sequence and KV cache. `llama_sampler` is for the sampling strategies used during text generation, like greedy or top-p.

There are several typedefs for common types like `llama_token`, `llama_seq_id`, and pointers like `llama_memory_t`. Enums are everywhere, defining types like `llama_vocab_type`, `llama_rope_type`, `llama_token_type`, etc. These enums provide a clean way to specify different options for things like tokenization methods, attention types, and memory management.

I see a struct `llama_model_params` which looks like the configuration for loading a model. It has info about devices, tensor types, offloading options, progress callbacks, etc. `llama_context_params` seems similar but for the context itself, defining things like batch sizes, thread counts, and attention settings. `llama_model_quantize_params` is probably for the quantization process, allowing models to run on devices with less memory.

The API functions are the heart of the library. I see functions for initializing the backend (`llama_backend_init`), loading models (`llama_model_load_from_file`, `llama_model_load_from_splits`), managing memory (`llama_memory_clear`, `llama_memory_seq_rm`), and running inference (`llama_decode`, `llama_encode`). There are also functions for tokenization (`llama_tokenize`, `llama_detokenize`), chat templates (`llama_chat_apply_template`), and sampling (`llama_sampler_chain_init`, `llama_sampler_sample`).

I see some deprecated functions, indicated by the `DEPRECATED` macro. These are old interfaces that have been replaced by newer ones. For example, `llama_free_model` is deprecated in favor of `llama_model_free`.

There's also a section for "Adapters", which suggests a way to modify models, likely through LoRA (Low-Rank Adaptation). `llama_adapter_lora` and related functions handle loading, applying, and managing these adapters.

Finally, there are functions for performance measurement (`llama_perf_context_print`), logging (`llama_log_set`), and training-related features (`llama_opt_init`, `llama_opt_epoch`).

Overall, this `llama.h` file provides a comprehensive C API for interacting with the `llama.cpp` library. It covers everything from loading models, managing memory and context, tokenizing text, running inference with different sampling strategies, and even fine-tuning through adapters. The use of structs and enums for configuration makes the API expressive and flexible. The comments suggest that some parts are still evolving, like the sampling API and memory management features. It's clear that this library aims to be efficient, with support for GPU acceleration and various quantization strategies, making large language models accessible on a wide range of hardware.
</think>
<answer>
Okay, here's a detailed analysis of the `llama.h` header file, focusing on its purpose, structure, and key components.

**Overall Purpose:**

This header file defines the public C interface for the `llama.cpp` library. `llama.cpp` is an open-source C/C++ library that implements the LLaMA (Large Language Model Meta AI) architecture, enabling efficient inference of large language models on consumer-grade GPUs and CPUs. The header declares the data structures (structs, enums, typedefs) and function signatures that external applications would use to interact with the library.

**Key Components:**

1.  **Includes and Basic Definitions:**
    *   `#ifndef LLAMA_H ... #endif`: Standard include guard to prevent multiple inclusions of the header.
    *   `#include "ggml.h"`, `#include "ggml-cpu.h"`, `#include "ggml-backend.h"`, `#include "ggml-opt.h"`: Includes the core GGML library headers. GGML is a GPU-accelerated math library, which `llama.cpp` leverages for performance. `ggml-cpu.h` provides CPU-specific backend implementations, while `ggml-backend.h` and `ggml-opt.h` offer backend abstraction and optimization capabilities.
    *   `#include <stddef.h>`, `#include <stdint.h>`, `#include <stdio.h>`, `#include <stdbool.h>`: Standard C library headers for common types (`size_t`, `int32_t`, etc.), I/O (`stdio.h`), and boolean values (`stdbool.h`).
    *   **Macros:**
        *   `LLAMA_SHARED`, `LLAMA_API`: Conditional compilation to define `LLAMA_API`. This is crucial for library distribution. On Windows (`_WIN32`), if `LLAMA_SHARED` is defined and the library is being built (`LLAMA_BUILD`), it exports symbols using `__declspec(dllexport)`. If included by an application, it imports symbols using `__declspec(dllimport)`. On other platforms (Linux, macOS), it uses `__attribute__((visibility ("default")))` to make symbols visible by default. If `LLAMA_SHARED` is not defined, `LLAMA_API` is empty.
        *   `DEPRECATED(func, hint)`: A macro to mark functions as deprecated, providing a compiler warning with a specific hint (the second argument). This helps users migrate code to newer API versions.
        *   `LLAMA_DEFAULT_SEED`: Default value for random number generation seeds.
        *   `LLAMA_TOKEN_NULL`: A special value (`-1`) to represent a null or invalid token.
        *   **File Magic Numbers:** Constants like `LLAMA_FILE_MAGIC_GGLA`, `LLAMA_FILE_MAGIC_GGSN`, `LLAMA_FILE_MAGIC_GGSQ` (0x67676c61u, 0x6767736eu, 0x67677371u) are used as file signatures to identify different types of GGUF model files. GGLA might relate to a specific GGUF variant or internal format. GGSN and GGSQ are used for GGUF state/session files.
        *   **Version Numbers:** `LLAMA_SESSION_MAGIC` (using GGSN magic), `LLAMA_SESSION_VERSION` (9), `LLAMA_STATE_SEQ_MAGIC` (using GGSQ magic), and `LLAMA_STATE_SEQ_VERSION` (2) are used to structure session/state files written by the library.

2.  **C Interface:** The `extern "C"` block indicates that the following C++ code (if present) should be compiled as C, ensuring compatibility with C applications.
    *   **Struct Definitions:** These are the core data structures users interact with.
        *   `struct llama_vocab;`: Represents the vocabulary used by the model (token mapping, special tokens, tokenizer type).
        *   `struct llama_model;`: Holds the loaded model parameters (weights, configurations).
        *   `struct llama_context;`: Represents an inference session. It holds the current state, KV cache, memory manager, and parameters for this specific generation run.
        *   `struct llama_sampler;`: Manages sampling strategies for token selection during generation.
        *   `typedef struct llama_memory_i * llama_memory_t;`: A forward declaration for a pointer to an internal memory management structure. (Note: `struct llama_kv_cache` is deprecated in favor of `llama_memory_t`).
        *   **Token Types:** `llama_pos` (position), `llama_token` (integer ID), `llama_seq_id` (ID for a sequence within a multi-sequence context).
        *   **Enums:** Extensive enums provide type-safe options for various configurations.
            *   `llama_vocab_type`: Defines how tokens are represented (SPM, BPE, WPM, UGM, RWKV). This affects tokenizer and decoder behavior.
            *   `llama_vocab_pre_type`: Specifies preprocessing methods for special tokens, potentially tied to specific model families or tokenizers.
            *   `llama_rope_type`: Defines Rotary Position Embedding (RoPE) types (None, Normal, NeoX, MROPE, Vision). RoPE is crucial for capturing relative positions in long sequences.
            *   `llama_token_type`: (TODO: remove) Enum for token attributes (e.g., normal, control, user-defined). The comment indicates these are temporary until per-token attributes are natively supported.
            *   `llama_token_attr`: Detailed attributes for individual tokens (e.g., normalization, stripping whitespace, single-word mode).
            *   `llama_ftype`: Enum for model weight formats (F32, F16, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, IQ2_XXS, IQ3_XS, IQ4_NL, BF16, etc.). This dictates how model weights are stored and used, impacting memory usage and precision.
            *   `llama_rope_scaling_type`: Enum for scaling factors in RoPE (None, Linear, YARN, LongRoPE).
            *   `llama_pooling_type`: Enum for output embedding aggregation methods (None, Mean, CLS, Rank).
            *   `llama_attention_type`: Enum for attention mechanisms (Causal, Non-Causal).
            *   `llama_split_mode`: Enum for model parallelism strategies (None, Layer-wise, Row-wise tensor parallelism).
            *   `llama_model_kv_override_type`, `llama_model_tensor_buft_override`: Structs for overriding model KV cache types or tensor buffer types, useful for advanced use cases like custom memory allocation.
    *   **Typedefs:** Simplifies usage of common types like `int32_t`, `int64_t`, `size_t`.
    *   **Token Data Structures:**
        *   `llama_token_data`: A struct holding a token's ID, log-odds (raw score), and probability.
        *   `llama_token_data_array`: Holds an array of `llama_token_data`, along with metadata like sorting status. Used by samplers.

3.  **Helper Functions:**
    *   `llama_progress_callback`, `llama_abort_callback`: Function pointer types for progress and abort notifications.
    *   `llama_batch`: A struct to pass input data (tokens, embeddings, positions) and parameters (logits output flag) to `llama_decode`. It supports multiple sequences (multi-prompt).
    *   `llama_model_kv_override`, `llama_model_tensor_buft_override`: Structs for specifying overrides for KV cache types and tensor buffer types during model loading.

4.  **Model Parameters and Context Defaults:**
    *   `llama_model_default_params()`, `llama_context_default_params()`, `llama_sampler_chain_default_params()`, `llama_model_quantize_default_params()`: Functions that return instances of the parameter structs with sensible default values. These are useful for quick initialization.

5.  **Core Library Functions:**
    *   **Initialization/Finalization:**
        *   `llama_backend_init()`: Initializes the GGML backend (e.g., CUDA, Metal, OpenCL, CPU). Call once at the start.
        *   `llama_backend_free()`: Cleans up the backend resources. Call once at the end.
        *   `llama_numa_init()`: (Optional) Sets NUMA awareness for memory allocation.
        *   `llama_attach_threadpool()`, `llama_detach_threadpool()`: Manages threadpools for computation and batching.
    *   **Model Loading/Saving:**
        *   `llama_model_load_from_file()`, `llama_model_load_from_splits()`: Loads a `.gguf` model file (potentially split into multiple parts).
        *   `llama_model_save_to_file()`: Saves the model to a file.
        *   `llama_model_free()`: Frees memory associated with a loaded `llama_model` object.
    *   **Context Management:**
        *   `llama_init_from_model()`, `llama_new_context_with_model()`: Creates an inference context from a loaded model.
        *   `llama_free()`: Frees all resources associated with a `llama_context`.
        *   `llama_get_model()`, `llama_get_memory()`: Accessors for the model and memory within a context.
        *   `llama_time_us()`: Gets current time in microseconds.
        *   `llama_max_devices()`, `llama_max_parallel_sequences()`: Utility functions.
        *   `llama_supports_mmap()`, `llama_supports_mlock()`, `llama_supports_gpu_offload()`, `llama_supports_rpc()`: Check for feature support.
        *   `llama_n_ctx()`, `llama_n_batch()`, `llama_n_ubatch()`, `llama_n_seq_max()`: Get context configuration.
        *   `llama_model_n_ctx_train()`, `llama_model_n_embd()`, `llama_model_n_layer()`, `llama_model_n_head()`, `llama_model_n_head_kv()`, `llama_model_n_swa()`: Get model architecture details.
        *   `llama_model_rope_freq_scale_train()`: Gets RoPE frequency scaling factor.
        *   `llama_model_n_cls_out()`, `llama_model_cls_label()`: Get classifier head details.
        *   `llama_vocab_type()`, `llama_vocab_n_tokens()`: Query vocabulary information.
        *   `llama_model_meta_val_str()`, `llama_model_meta_count()`, `llama_model_meta_key_by_index()`, `llama_model_meta_val_str_by_index()`: Functions to inspect GGUF metadata.
        *   `llama_model_desc()`: Gets a human-readable description of the model.
        *   `llama_model_size()`: Gets the total size of model parameters in bytes.
        *   `llama_model_chat_template()`: Retrieves the chat template.
        *   `llama_model_n_params()`: Gets the total number of parameters.
        *   `llama_model_has_encoder()`, `llama_model_has_decoder()`: Check model type.
        *   `llama_model_decoder_start_token()`: Gets the token ID to start decoder generation.
        *   `llama_model_is_recurrent()`: Checks if the model is recurrent (e.g., RWKV).
    *   **Memory Management (Modern KV Cache):**
        *   `llama_memory_clear()`: Clears memory contents and optionally data buffers.
        *   `llama_memory_seq_rm()`, `llama_memory_seq_cp()`, `llama_memory_seq_keep()`, `llama_memory_seq_add()`, `llama_memory_seq_div()`: Functions to manipulate token sequences within the memory manager (e.g., removing a sequence, copying between sequences, adding offsets to positions). These replace the older KV cache deprecation methods.
        *   `llama_memory_can_shift()`: Checks if memory supports shifting operations.
    *   **KV Cache (Deprecated):** Functions like `llama_kv_self_...` and `llama_kv_...` (clear, seq_rm, seq_cp, etc.) are marked as deprecated, indicating the shift to `llama_memory_t`. These old functions likely operated directly on the context's KV cache structure.
    *   **State Management (Sessions/Checkpoints):**
        *   Functions like `llama_state_get_size()`, `llama_state_get_data()`, `llama_state_set_data()` allow saving and restoring the entire inference state (logits, embeddings, memory) to/from disk. This is useful for resuming generation or sharing checkpoints.
        *   Functions like `llama_state_seq_get_size()`, `llama_state_seq_get_data()`, `llama_state_seq_set_data()` handle state for individual sequences, crucial for multi-turn conversations or stateful generation using memory.
        *   `llama_state_load_file()`, `llama_state_save_file()`, `llama_state_seq_load_file()`, `llama_state_seq_save_file()` are convenience wrappers for these functions.
    *   **Decoding (Inference):**
        *   `llama_encode()`: Processes input tokens using the encoder part of the model (if present in an encoder-decoder model).
        *   `llama_decode()`: The core function for generating text. It processes the input batch using the decoder, utilizes the KV cache, and updates the context state. It returns status codes and allows for error handling.
        *   `llama_set_n_threads()`, `llama_n_threads()`, `llama_n_threads_batch()`: Configure and query the number of threads for generation and batching.
        *   `llama_set_embeddings()`: Controls whether the model should compute and return output embeddings.
        *   `llama_set_causal_attn()`: Enables/disables causal attention masking.
        *   `llama_set_warmup()`: (Likely experimental) Activates warmup strategies where all model tensors are loaded into VRAM.
        *   `llama_set_abort_callback()`: Sets a callback to allow user intervention to abort generation.
        *   `llama_synchronize()`: Ensures all pending GPU/CPU operations are complete.
    *   **Accessing Results:**
        *   `llama_get_logits()`: Retrieves the raw logits from the last `llama_decode` call.
        *   `llama_get_logits_ith()`: Gets logits for a specific token index within the logits array.
        *   `llama_get_embeddings()`, `llama_get_embeddings_ith()`: Retrieves output token embeddings.
        *   `llama_get_embeddings_seq()`: Retrieves embeddings for a specific sequence ID.
    *   **Vocabulary Access:**
        *   `llama_vocab_get_text()`, `llama_vocab_get_score()`, `llama_vocab_get_attr()`: Get information about a specific token.
        *   `llama_vocab_is_eog()`, `llama_vocab_is_control()`: Check token properties.
        *   `llama_vocab_bos()`, `llama_vocab_eos()`, etc.: Access special tokens (Beginning-of-Sentence, End-of-Sentence, etc.).
    *   **Tokenization:**
        *   `llama_tokenize()`: Converts text input into a sequence of token IDs.
        *   `llama_token_to_piece()`: Converts an integer token ID back into its string representation (piece).
        *   `llama_detokenize()`: Converts an array of token IDs back into text.
    *   **Chat Templates:**
        *   `llama_chat_apply_template()`: Formats a conversation (array of `llama_chat_message` structs) into a single prompt string using a specified template. It supports built-in templates.
        *   `llama_chat_builtin_templates()`: Retrieves a list of available built-in template names.
    *   **Sampling API:**
        *   `llama_sampler_chain_init()`: Creates a sampling chain, which allows chaining multiple sampling methods (e.g., top-k followed by temperature).
        *   `llama_sampler_chain_add()`: Adds a sampler (like greedy, top-p, mirostat) to the chain.
        *   `llama_sampler_chain_get()`, `llama_sampler_chain_n()`, `llama_sampler_chain_remove()`: Manage the samplers within a chain.
        *   `llama_sampler_free()`: Frees memory associated with a sampler chain.
        *   Various `llama_sampler_init_...` functions define different sampling strategies (Greedy, Top-k, Top-p, Minimum-p, Typical, Temp, Mirostat, Grammar-based, etc.). Each returns a `llama_sampler` pointer.
        *   `llama_sampler_name()`, `llama_sampler_accept()`, `llama_sampler_apply()`, `llama_sampler_reset()`, `llama_sampler_clone()`, `llama_sampler_free()`: Functions to operate on a `llama_sampler` object once obtained.
        *   `llama_sampler_sample()`: A helper to perform sampling from the last token's logits using a specified sampler in a chain and accept the chosen token.
    *   **Performance Utilities:**
        *   `llama_perf_context_data`, `llama_perf_sampler_data`: Structs to store performance metrics for context-level and sampler-level operations.
        *   `llama_perf_context()`, `llama_perf_context_print()`, `llama_perf_context_reset()`: Functions to measure and report context-based timing.
        *   `llama_perf_sampler()`, `llama_perf_sampler_print()`, `llama_perf_sampler_reset()`: Functions to measure and report sampler-based timing.
    *   **Training (Experimental):**
        *   `llama_opt_init()`, `llama_opt_epoch()`: Functions related to fine-tuning/optimization passes, using callback functions for parameter filtering, optimizer parameter calculation, and epoch callbacks. This part seems less mature or feature-complete than the inference part.

**Summary:**

`llama.h` provides a comprehensive C API for interacting with the `llama.cpp` library. It abstracts the underlying complexity of model loading, memory management, GPU acceleration (via GGML), tokenization, and diverse sampling strategies. The use of structs and enums for configuration, clear function naming, and detailed comments make it relatively easy to integrate `llama.cpp` into applications for tasks like text generation, chatbots, and other NLP applications. The API supports various quantization levels, model splitting, multi-turn generation via memory management, and customizable sampling, caterering to a wide range of hardware capabilities and user needs. The presence of deprecated functions and experimental APIs (like the training part) suggests the library is actively developed and evolving.
</answer>

Very impressive!

@danielhanchen
Copy link
Contributor

Re perplexity values - I'm getting PPL increasing from 1 to 3 to 5 to 7 to 27 to 37 and now 227 :(

@kzjeef
Copy link

kzjeef commented Jul 9, 2025

Re perplexity values - I'm getting PPL increasing from 1 to 3 to 5 to 7 to 27 to 37 and now 227 :(

Hi @danielhanchen

This sounds not good, do you have apply this MR when testing?

Seems the chat template should be fixed in this PR.
#14584

@danielhanchen
Copy link
Contributor

@kzjeef Yes I recompiled from source - I'll see how high the PPL goes - I'll still try to make some quants!

@kooshi
Copy link
Contributor

kooshi commented Jul 9, 2025

The tested PPL has been absurdly high in every test of the Instruct model, including the official implementation, despite it being coherent in chats. The base model gives a perfectly reasonable score: #14425 (comment)

If anyone can verify what it's actually predicting it might help (probably trying to start with <answer> or something).

I hope it doesn't get in the way of the heuristics for the dynamic quants. I always look forward to them.

@kzjeef
Copy link

kzjeef commented Jul 9, 2025

The tested PPL has been absurdly high in every test of the Instruct model, including the official implementation, despite it being coherent in chats. The base model gives a perfectly reasonable score: #14425 (comment)

If anyone can verify what it's actually predicting it might help (probably trying to start with <answer> or something).

About the reason parser, what's the location of llama.cpp ? i'm working on vllm's reason parser(vllm-project/vllm#20625), maybe some one or myself can porting this to llama.cpp

Actually we have tested some complex math case internally after this PR: #14584
, it looks good.

I hope it doesn't get in the way of the heuristics for the dynamic quants. I always look forward to them.

@danielhanchen
Copy link
Contributor

danielhanchen commented Jul 9, 2025

So my imatrix gets Final estimate: PPL = 188.6129 +/- 1.33950 so very high - I also used the chat template directly, and it increases over time.

However I think as someone mentioned it's due to <answer></answer> and so the PPL shoots up - I'm not 100% sure, but it's very likely.

Quants at https://huggingface.co/unsloth/Hunyuan-A13B-Instruct-GGUF

Usage:

./llama.cpp/llama-cli -hf unsloth/Hunyuan-A13B-Instruct-GGUF:Q4_K_XL -ngl 99 --jinja --temp 0.7 --top-k 20 --top-p 0.8 --repeat-penalty 1.05

@jukofyork
Copy link
Collaborator

jukofyork commented Jul 9, 2025

The tested PPL has been absurdly high in every test of the Instruct model, including the official implementation, despite it being coherent in chats. The base model gives a perfectly reasonable score: #14425 (comment)

If anyone can verify what it's actually predicting it might help (probably trying to start with <answer> or something).

I hope it doesn't get in the way of the heuristics for the dynamic quants. I always look forward to them.

Probably the easiest way to see what is causing this is to start generation from a single BOS token and see what it generates with temperature = 1.

EDIT: I think this also shows it might be time to consider letting llama-imatrix use data inside chat templates (even if the data isn't really chat data). Some of the newer models seem to be using crazy detailed templates now and using way more fine-tuning data than they used to, so this sort of problem is only likely to get worse in the future!

qnixsynapse pushed a commit to menloresearch/llama.cpp that referenced this pull request Jul 10, 2025
* model : add hunyuan moe

* tokenizer ok

* fix tensor name

* cgraph init

* chat template

* wip

* almost working

* skip embed, fix bos

* cleanup

* yarn scaling

* cleanup

* correct rope type

* failed token fix

* ntk alpha freq_base

* tokenization working

* cleanup and pr changes

* vocab_size sanity check

* ntk alpha generic

* Update convert_hf_to_gguf.py

* Apply suggestions from code review

* fix regression

* fix style

---------

Co-authored-by: kooshi <[email protected]>
qnixsynapse pushed a commit to menloresearch/llama.cpp that referenced this pull request Jul 10, 2025
* model : add hunyuan moe

* tokenizer ok

* fix tensor name

* cgraph init

* chat template

* wip

* almost working

* skip embed, fix bos

* cleanup

* yarn scaling

* cleanup

* correct rope type

* failed token fix

* ntk alpha freq_base

* tokenization working

* cleanup and pr changes

* vocab_size sanity check

* ntk alpha generic

* Update convert_hf_to_gguf.py

* Apply suggestions from code review

* fix regression

* fix style

---------

Co-authored-by: kooshi <[email protected]>
@ggerganov ggerganov added the hot Something that is hot label Jul 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hot Something that is hot python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Hunyuan-A13B model support