Skip to content

A test suite comparing Node.js BPE tokenizers for use with AI models.

License

Notifications You must be signed in to change notification settings

DoubleTechnologies/compare-tokenizers

 
 

Repository files navigation

Compare Tokenizers

A test suite comparing Node.js BPE tokenizers for use with AI models.

Build Status MIT License Prettier Code Formatting

Intro

This repo contains a small test suite for comparing the results of different Node.js BPE tokenizers for use with LLMs like GPT-3.

Check out OpenAI's tiktoken Rust / Python lib for reference and OpenAI's Tokenizer Playground to experiment with different inputs.

This repo only tests tokenizers aimed at text, not code-specific tokenizers like the ones used by Codex.

Benchmark

Package / encoder Average Time (ms) Variance (ms)
gpt3-tokenizer 56132 334621
gpt-3-encoder 31148 333120
@dqbd/tiktoken gpt2 9267 1490
@dqbd/tiktoken text-davinci-003 9078 733

(lower times are better)

@dqbd/tiktoken which is a wasm port of the official Rust tiktoken is ~3-6x faster than the JS variants with significantly less memory overhead and variance. 🔥

To reproduce:

npx tsx src/bench.ts
# or
pnpm build
node build/bench.mjs

Tokenization Tests

This maps over an array of test fixtures in different languages and prints the number of tokens generated for each of the tokenizers.

0) 5 chars "hello" ⇒ {
  'gpt3-tokenizer': 1,
  'gpt-3-encoder': 1,
  '@dqbd/tiktoken gpt2': 1,
  '@dqbd/tiktoken text-davinci-003': 1
}
1) 17 chars "hello 👋 world 🌍" ⇒ {
  'gpt3-tokenizer': 7,
  'gpt-3-encoder': 7,
  '@dqbd/tiktoken gpt2': 7,
  '@dqbd/tiktoken text-davinci-003': 7
}
2) 445 chars "Lorem ipsum dolor si..." ⇒ {
  'gpt3-tokenizer': 153,
  'gpt-3-encoder': 153,
  '@dqbd/tiktoken gpt2': 153,
  '@dqbd/tiktoken text-davinci-003': 153
}
3) 2636 chars "Lorem ipsum dolor si..." ⇒ {
  'gpt3-tokenizer': 939,
  'gpt-3-encoder': 939,
  '@dqbd/tiktoken gpt2': 939,
  '@dqbd/tiktoken text-davinci-003': 922
}
4) 246 chars "也称乱数假文或者哑元文本, 是印刷及排版..." ⇒ {
  'gpt3-tokenizer': 402,
  'gpt-3-encoder': 402,
  '@dqbd/tiktoken gpt2': 402,
  '@dqbd/tiktoken text-davinci-003': 402
}
5) 359 chars "利ヘオヒヲ特逆もか意書購サ米公え出主トほ..." ⇒ {
  'gpt3-tokenizer': 621,
  'gpt-3-encoder': 621,
  '@dqbd/tiktoken gpt2': 621,
  '@dqbd/tiktoken text-davinci-003': 621
}
6) 2799 chars "это текст-"рыба", ча..." ⇒ {
  'gpt3-tokenizer': 2813,
  'gpt-3-encoder': 2813,
  '@dqbd/tiktoken gpt2': 2813,
  '@dqbd/tiktoken text-davinci-003': 2811
}
7) 658 chars "If the dull substanc..." ⇒ {
  'gpt3-tokenizer': 175,
  'gpt-3-encoder': 175,
  '@dqbd/tiktoken gpt2': 175,
  '@dqbd/tiktoken text-davinci-003': 170
}
8) 3189 chars "Enter [two Players a..." ⇒ {
  'gpt3-tokenizer': 876,
  'gpt-3-encoder': 876,
  '@dqbd/tiktoken gpt2': 876,
  '@dqbd/tiktoken text-davinci-003': 872
}
9) 17170 chars "ANTONY. [To CAESAR] ..." ⇒ {
  'gpt3-tokenizer': 5801,
  'gpt-3-encoder': 5801,
  '@dqbd/tiktoken gpt2': 5801,
  '@dqbd/tiktoken text-davinci-003': 5306
}

To reproduce:

npx tsx src/index.ts
# or
pnpm build
node build/index.mjs

License

MIT © Travis Fischer

If you found this project interesting, please consider sponsoring me or following me on twitter twitter

About

A test suite comparing Node.js BPE tokenizers for use with AI models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • TypeScript 98.5%
  • JavaScript 1.3%
  • Shell 0.2%