Use a trie of all tokens to speed up grammar sampling. #14166
CharlesChen888
started this conversation in
Ideas
Replies: 1 comment
-
Sounds about right (see #1773 (comment)). PRs welcome.
Makes sure to print the required memory somewhere in the logs. If we decide it is too much, there should be an option to disable. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am new to llama.cpp and I am trying to make models to output JSON with specific structure. I convert JSON schema to GBNF and use a grammar sampler to achieve this, and it works. However I found the grammar sampler significantly reduces the sampling efficiency.
I read the code briefly, and if my understanding of the grammar stack is correct, it seems for every sampling it just goes through all the tokens in the vocabulary: it checks if the first byte of each token and preserve the tokens that fit, and then check the second byte of the rest of the tokens, and iterate this recursively.
So if we build a trie for all tokens, when the root of a sub trie does not fit, the whole sub trie can be skipped. I believe this can speed up grammar sampling a bit, especially for some multilingual models having a very large vocabulary of more than 100000 tokens. Might need a few hundreds MB of memory though.
Beta Was this translation helpful? Give feedback.
All reactions