Skip to content

Conversation

@amirai21
Copy link
Contributor

@amirai21 amirai21 commented Oct 8, 2025

The JambaModel implementation at convert_hf_to_gguf.py was incorrectly constructing its vocab using the gpt-2 tokenizer logic when no SentencePiece model was present (i.e., tokenizer.json path). Jamba actually uses a llama tokenizer, not gpt-2.

This change updates the vocab build path to use the correct llama tokenizer for non-SentencePiece Jamba models. Also includes several small adjustments within the Jamba llama based tokenizer construction.

No changes are expected for other model types.

Testing
Verified with local conversion of Jamba GGUF model (tokenizer.json mode) and confirmed generated vocab matches the llama tokenizer layout. SentencePiece mode GGUF was also verified and it remains unaffected.

@amirai21 amirai21 requested a review from CISC as a code owner October 8, 2025 10:20
@github-actions github-actions bot added the python python script changes label Oct 8, 2025
Copy link
Collaborator

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this works correctly, basically you're recreating a SPM vocab without scores, have you checked that tokenization is identical to AutoTokenizers (you can use convert_hf_to_gguf_update.py to generate test files and test with test-tokenizer-0)?

@amirai21
Copy link
Contributor Author

amirai21 commented Oct 11, 2025

@CISC Thanks for reviewing! The initial implementation was incorrect. Following your feedback, we realized there’s already an existing function - _set_vocab_llama_hf in the TextModel class that correctly handles LLaMA tokenizers. Since Jamba uses a LLaMA tokenizer jamba-3b-reasoning tokenizer.config) , we switched to using that implementation instead.

@CISC CISC merged commit 477a66b into ggml-org:master Oct 11, 2025
7 checks passed
yael-works pushed a commit to yael-works/llama.cpp that referenced this pull request Oct 15, 2025
* fix: convert_hf_to_gguf - change Jamba non-sentencepiece mode (tokenizer.json) vocab construction

* fix: convert_hf_to_gguf - jamba non-sentencepiece tokenizer to use _set_vocab_llama_hf func

* fix: convert_hf_to_gguf - removed get_vocab_base_pre from jamba
pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 23, 2025
* fix: convert_hf_to_gguf - change Jamba non-sentencepiece mode (tokenizer.json) vocab construction

* fix: convert_hf_to_gguf - jamba non-sentencepiece tokenizer to use _set_vocab_llama_hf func

* fix: convert_hf_to_gguf - removed get_vocab_base_pre from jamba
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants