Skip to content

Conversation

@Majdoddin
Copy link

@Majdoddin Majdoddin commented Aug 5, 2024

This PR realizes the wish expressed in current code to use the faster Regex.

The text is splitted to pieces, before tokenization, according to regular expression patterns. This PR drops a lookahead part of the pattern, the part for catching the whitespaces, and handles the whitespaces with scripting instead, with mathematically provable exactly same output.
This makes it possible to use linear-time Regex instead of fancy-regex, as Regex does not support lookahead, resulting in a 14x speedup of pattern matching. As pattern matching currently comprises 90% of the encoding runtime, the total runtime is boosted 6x.

Although fancy_regex delegates to Regex, when the pattern has no special features, it is still some 10% slower in test, thus we directly use Regex.
This improvement is for pattern matching of the parts with ordinary text. Catching the special tokens is still done with fancy_regex.

Tests
For encoding o200k_base (used by model GPT-4o)

Text Number of tokens Current Runtime PR Runtime
wikitext-103 (100 MB) 22138325 18.94s 4.94s
Linux code (100 MB) 36119543 30.28s 4.59s

…a 6x speedup. To make the

regex patterns compatible with Regex, drops part of thpatterns for whitespaces, and handles the
whitespaces with scripting instead of regex. Still with exact same output. _encode_native calls
_encode_ordinary_native_impl directly (_encode_ordinary_native is a wrapper of
_encode_ordinary_native now).
@Bigheem
Copy link

Bigheem commented Aug 31, 2024

@bigheemseafood

tmm1 pushed a commit to anysphere/tiktoken-rs that referenced this pull request Nov 9, 2024
Based on openai#331

Uses Regex in _encode_ordinary_native instead of fancy-regex, to get a 6x speedup. To make the
regex patterns compatible with Regex, drops part of thpatterns for whitespaces, and handles the
whitespaces with scripting instead of regex. Still with exact same output. _encode_native calls
_encode_ordinary_native_impl directly (_encode_ordinary_native is a wrapper of
_encode_ordinary_native now).
@tmm1
Copy link

tmm1 commented Nov 9, 2024

Thanks for your work on this!

I noticed this code block which sounds like it would need to change along with these regexes?

tiktoken/src/lib.rs

Lines 405 to 409 in 6352764

// For example, with gpt2, the use of \s+(?!\S) means that "\n\n" could
// develop a split, e.g. "\n\n0" splits into "\n"+"\n"+"0", making "\n" a possible token.
// Here is a quick and dirty fix:
// This isn't right if we ever remove \s+(?!\S)
if unstable_bytes.len() > 1 {

@Majdoddin
Copy link
Author

@tmm1 I've implement it for encode_ordinary(). Than part is for unstable encoding.
By the way, I'd really appreciate if you submit a review for this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants