Use a trie of all tokens to speed up grammar sampling. #14166

CharlesChen888 · 2025-06-13T07:49:26Z

CharlesChen888
Jun 13, 2025

I am new to llama.cpp and I am trying to make models to output JSON with specific structure. I convert JSON schema to GBNF and use a grammar sampler to achieve this, and it works. However I found the grammar sampler significantly reduces the sampling efficiency.

I read the code briefly, and if my understanding of the grammar stack is correct, it seems for every sampling it just goes through all the tokens in the vocabulary: it checks if the first byte of each token and preserve the tokens that fit, and then check the second byte of the rest of the tokens, and iterate this recursively.

So if we build a trie for all tokens, when the root of a sub trie does not fit, the whole sub trie can be skipped. I believe this can speed up grammar sampling a bit, especially for some multilingual models having a very large vocabulary of more than 100000 tokens. Might need a few hundreds MB of memory though.

ggerganov · 2025-06-13T07:57:55Z

ggerganov
Jun 13, 2025
Maintainer

Sounds about right (see #1773 (comment)). PRs welcome.

Might need a few hundreds MB of memory though.

Makes sure to print the required memory somewhere in the logs. If we decide it is too much, there should be an option to disable.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use a trie of all tokens to speed up grammar sampling. #14166

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Use a trie of all tokens to speed up grammar sampling. #14166

Uh oh!

Uh oh!

CharlesChen888 Jun 13, 2025

Replies: 1 comment

Uh oh!

ggerganov Jun 13, 2025 Maintainer

CharlesChen888
Jun 13, 2025

ggerganov
Jun 13, 2025
Maintainer