v2.0.0 🤗tokenizers integration and TokSequence
TL;DR
This major update brings:
- The integration of the Hugging Face 🤗tokenizers library as Byte Pair Encoding (BPE) backend. BPE is now between 30 to 50 times faster, for both training and encoding ! 🙌
- A new
TokSequence
object to represent tokens! This objects holds tokens as tokens (strings), ids (integers to pass to models), Events and bytes (used internally for BPE). - Many internal changes, methods and variables renamed, that require you to update some of your code (details below).
Changes
- a9b82e4
Vocabulary
class is being replaced by a dictionary. Other (protected) dictionaries are also added for token <--> id <--> byte conversions; - a9b82e4 New
special_tokens
constructor argument for all tokenizers, in place of the previouspad
,mask
,sos_eos
andsep
arguments. It is a list of tokens (str) for more versatility. By default, special tokens are["PAD", "BOS", "EOS", "MASK"]
; - a9b82e4
__getitem__
now handles both ids (int) and tokens (str), with multi-vocab; - 36bf0f6 Some methods of
MIDITokenizer
meant to be used internally are now protected; - a2db7b9 New training method with 🤗tokenizers BPE model;
- 9befb8d
TokSequence
object, used as in and out object formidi_to_tokens
andtokens_to_midi
methods, thanks to the_in_as_seq
and_out_as_complete_seq
decorators; - 9befb8d
complete_sequence
method allowing to automatically convert the uninitiated attributes of aTokSequence
(ids, tokens); - 9befb8d
tokens_to_events
renamed_ids_to_tokens
, and new id / token / byte conversion methods with recursivity; - 9befb8d Tokens are now saved and loaded with the
ids
key (previouslytokens
); - cddd29c Tokenization files moves to dedicated tokenizations module;
- cddd29c
decompose_bpe
method renameddecode_bpe
; - d520128
tokenize_dataset
allows to apply BPE afterwards.
Compatibility
Tokens and tokenizers from v1.4.3 and before are compatible, this update does not change anything on the specific tokenizations.
However you will need to adapt your files to load them, and to update some of your code to adapt to new changes:
- Tokens are now saved and loaded with the
ids
key (previouslytokens
). To adapt your previously saved tokens, open them with json and rewrite them with theids
key instead; midi_to_tokens
(also called withtokenizer(midi)
) now outputs a list ofTokSequence
s, each holding tokens as tokens (str) and their ids (int). It previously returned token ids. You can now get them by accessing the.ids
attribute, astokseq.ids
;Vocabulary
class deleted. You can still access to the vocabulary withtokenizer.vocab
but it is now a dictionary. The methods of theVocabulary
class are now directly integrated inMIDITokenizer
;- For all tokenizers, the
pad
,mask
,sos_eos
andsep
constructor arguments need to be replaced with the newspecial_tokens
argument; decompose_bpe
method renameddecode_bpe
.
Bug reports
With all big changes can come hidden bugs. We carefully tested that all methods pass the previous tests, while assessing the robustness of the new methods. Despite these efforts, if you encounter any bugs, please report them by opening an issue, and we will do our best to solve them as quickly as possible.