v1.3.0 Special tokens update 🛠
Highlight
Version 1.3.0 changes the way the vocabulary, and by extension tokenizers, handle special tokens: PAD
, SOS
, EOS
and MASK
. It brings a cleaner way to instantiate these classes.
It might bring incompatibilities with data and models used with previous MidiTok versions.
Changes
- b9218bf
Vocabulary
class now takespad
argument to specify to include special padding token. This option is set to True by default, as it is more common to train networks with batches of unequal sequence lengths. - b9218bf
Vocabulary
class: theevent_to_token
argument of the constructor is renamedevents
and has to be given as a list of events. - b9218bf
Vocabulary
class: when adding a token to the vocabulary, the index is automatically set. The index argument is removed as it could cause issues / confusion when mapping indexes with models. - b9218bf The
Event
class now takesvalue
argument in second (order) - b9218bf Fix when learning BPE if
files_lim
were higher than the number of files itself - f9cb109 For all tokenizers, a new constructor argument
pad
specifies to use padding token, and thesos_eos_tokens
argument is renamed tosos_eos
- f9cb109 When creating a
Vocabulary
, the SOS and EOS tokens are now registered before the MASK token. This change is motivated so the order matches the one of special token arguments in tokenizers constructors, and as the SOS and EOS tokens are more commonly used in symbolic music applications. - 84db19d The dummy
StructuredEncoding
,MuMIDIEncoding
,OctupleEncoding
andOctupleMonoEncoding
classes removed frominit.py
. These classes from early versions had no record of being used. Other dummy classes (REMI, MIDILike and CPWord) remain.
Compatibility
- You might need to update your code when creating your tokenizer to handle the new
pad
argument. - Data tokenized with REMI, and models trained with, will be incompatible with v1.3.0 if you used special tokens. The BAR token was previously at index 1, and is now added after special tokens.
- If you created custom tokenizer inheriting
MIDITokenizer
, make sure to update the calls tosuper().__init__
with newpad
arg and renamedsos_eos
arg (example for MIDILike: f9cb109) - If you used both SOS/EOS and MASK special tokens, their order (indexes) is now swapped as SOS/EOS are now registered before MASK. As these tokens should are not used during the tokenization, your previously tokenized datasets remain compatible, unless you intentionally inserted SOS/EOS/MASK tokens. Trained models will however be incompatible as the indices are swapped. If you want to use v1.3.0 with a previously trained model, you can manually invert the predictions of these tokens.
- No incompatibilities outside of these cases
Please reach out if you have any issue / question! 🙌