Releases: Natooz/MidiTok
v2.0.0 🤗tokenizers integration and TokSequence
TL;DR
This major update brings:
- The integration of the Hugging Face 🤗tokenizers library as Byte Pair Encoding (BPE) backend. BPE is now between 30 to 50 times faster, for both training and encoding ! 🙌
- A new
TokSequence
object to represent tokens! This objects holds tokens as tokens (strings), ids (integers to pass to models), Events and bytes (used internally for BPE). - Many internal changes, methods and variables renamed, that require you to update some of your code (details below).
Changes
- a9b82e4
Vocabulary
class is being replaced by a dictionary. Other (protected) dictionaries are also added for token <--> id <--> byte conversions; - a9b82e4 New
special_tokens
constructor argument for all tokenizers, in place of the previouspad
,mask
,sos_eos
andsep
arguments. It is a list of tokens (str) for more versatility. By default, special tokens are["PAD", "BOS", "EOS", "MASK"]
; - a9b82e4
__getitem__
now handles both ids (int) and tokens (str), with multi-vocab; - 36bf0f6 Some methods of
MIDITokenizer
meant to be used internally are now protected; - a2db7b9 New training method with 🤗tokenizers BPE model;
- 9befb8d
TokSequence
object, used as in and out object formidi_to_tokens
andtokens_to_midi
methods, thanks to the_in_as_seq
and_out_as_complete_seq
decorators; - 9befb8d
complete_sequence
method allowing to automatically convert the uninitiated attributes of aTokSequence
(ids, tokens); - 9befb8d
tokens_to_events
renamed_ids_to_tokens
, and new id / token / byte conversion methods with recursivity; - 9befb8d Tokens are now saved and loaded with the
ids
key (previouslytokens
); - cddd29c Tokenization files moves to dedicated tokenizations module;
- cddd29c
decompose_bpe
method renameddecode_bpe
; - d520128
tokenize_dataset
allows to apply BPE afterwards.
Compatibility
Tokens and tokenizers from v1.4.3 and before are compatible, this update does not change anything on the specific tokenizations.
However you will need to adapt your files to load them, and to update some of your code to adapt to new changes:
- Tokens are now saved and loaded with the
ids
key (previouslytokens
). To adapt your previously saved tokens, open them with json and rewrite them with theids
key instead; midi_to_tokens
(also called withtokenizer(midi)
) now outputs a list ofTokSequence
s, each holding tokens as tokens (str) and their ids (int). It previously returned token ids. You can now get them by accessing the.ids
attribute, astokseq.ids
;Vocabulary
class deleted. You can still access to the vocabulary withtokenizer.vocab
but it is now a dictionary. The methods of theVocabulary
class are now directly integrated inMIDITokenizer
;- For all tokenizers, the
pad
,mask
,sos_eos
andsep
constructor arguments need to be replaced with the newspecial_tokens
argument; decompose_bpe
method renameddecode_bpe
.
Bug reports
With all big changes can come hidden bugs. We carefully tested that all methods pass the previous tests, while assessing the robustness of the new methods. Despite these efforts, if you encounter any bugs, please report them by opening an issue, and we will do our best to solve them as quickly as possible.
v1.4.3 BPE fix & Documentation
Changes
- 77f7c53 @dinhviettoanle (#24) Fixing a bug skipping tokens repetitions with BPE
- New documentation : miditok.readthedocs.io. We finally have a proper documentation website! 🙌 With this comes many improvements and fixes in the docstrings.
- 201c9b7 Legacy
REMIEncoding
,MIDILikeEncoding
andCPWordEncoding
classes removed. - e92a414
token_types_errors
ofMIDITokenizer
class handling basic / common cases of errors - Small minor code improvements
- 1486204 Use of
dataclasses
. This means that Python 3.6 (and previous) are no longer compatible. Python 3.6 was compatible but not supported (tested) up to v1.4.2.
v1.4.2 SEP token & data augmentation offset combinations argument
Changes
- f6225a1 Added the option to have a
SEP
special token, that can be used to train models to perform tasks such as "Next sequence prediction" - bb24512 Data augmentation can now receive the
all_offset_combinations
argument, which will perform augmentation with all the combinations of offsets. With the offsets$\left( x_1 , x_2 , x_3 \right)$ , it will perform a total of$\prod_i x_i$ combinations ($\prod_i (x_i \times 2)$ if going up and down). This is disabled by default to save you from hundreds of augmentations 🤓 (and is not chained withtokenize_midi_dataset
), by defaults augmentations are done on the original input only.
v1.4.1 Bugfix tokenize_midi_dataset
Changes
- 0e9131d Bugfix in
tokenize_midi_dataset
method when directly performing data augmentation, was not indented as it should
v1.4.0 Data augmentation and optimization
This pretty big update brings data augmentation, some bug fixes and optimizations, allowing to write more elegant code.
Changes
- 8f201e0 308fb27 Data augmentation methods ! 🙌 They can be applied on both MIDI and tokens, to augment data by shifting the pitch, velocity and duration values.
- 1d8e903 You can perform data augmentation while tokenizing a dataset (
tokenize_midi_dataset
method) with thedata_augment_offsets
argument. This will be done at the token level, as its faster than augmenting MIDI objects. - 0634ade BPE is now implemented in the main tokenizer class! This means all tokenizers can benefit form it in a much prettier way!
- 0634ade
bpe
method renamed tolearn_bpe
, and now returns metrics (that are also showed in the progress bar during the learning) on the number of token combinations and sequence length reduction - 7b8c977 Retrocompatibility when loading tokenizer config files with BPE from older versions
- 3cea9aa @nturusin Example notebook of GPT2 Hugging Face music transformer: fixes in training
- 65afa6b The
tokens_to_midi
andsave_tokens
methods can now receive tokens as Tensors and numpy arrays. PyTorch, TensorFlow and Jax (numpy) tensors are supported. Theconvert_tokens_tensors_to_list
decorator will convert them to lists, you can use it on your custom methods. - aab64aa The
__call__
magic method now automatically route tomidi_to_tokens
ortokens_to_midi
following what you give it. You can now use more elegantly tokenizers astokenizer(midi_obj)
ortokenizer(generated_tokens)
. - e90b20a Bugfix in
Structured
causing a possible infinite while loop with illegal token types successions - 947af8c Big refactor of MuMIDI, which have now fixed vocab / type idx. It is easier to handle and use. (thanks @gonzaloarca)
- 947af8c CPWord "Ignore" tokens are all renamed
Ignore_None
by convention, making operations easier in data augmentation and other methods.
Compatibility
- code with BPE would have to updated: remove
bpe(tokenizer)
and just declare tokenizers normally, rename thebpe
method tolearn_bpe
- MuMIDI tokens and tokenizers will be incompatible with v1.4.0
v1.3.3 Minor bugfixes
Changes
- 4f4e49e Magic method
len
bugfix with multi-vocal tokenizers,len
is now also a property - 925c7ae & 5b4f410 Bugfix of token types initialization when loading tokenizer from params file
- c873456 Removed hyphens from token types names, for better visibility. Be convention tokens types are all written in CamelCase.
- 5e51e84 New
multi_voc
property - b3b0cc7
tokenize_dataset
, progress bar now show the saving directory name
Compatibility
- All good 🙌
v1.3.2 Bugfix
v1.3.1 unique_track parameter & minor fixes / changes
Highlights
This versions uniformly cleans how the save_params
is called, brings related minor fixes and new features.
Changes
- 3c4adf8 Tokenizers now take a
unique_track
argument at creation. This parameter specifies if the tokenizer represents and handles music as a single track, or stream of tokens. This is the case of Octuple and MuMIDI, and probably most representations that natively support multitrack music. If given True, the tokens will be saved in json files as a single track. This parameter can then help when loading tokenized datasets. - 3c4adf8
save_params
method:out_dir
argument renamed toout_path
- 3c4adf8
save_params
method:out_path
can now specify the full path and name of the config file saved - 3c4adf8 fixes in
save_params
method for MuMIDI - 3c4adf8 The current version number is fixed (was 1.2.9 instead of 1.3.0 for v1.3.0)
- 4be897b
bpe
method (learning BPE vocabulary) now has aprint_seq_len_variation
argument, to optionally print the mean sequence length before and after BPE, and the variation in %. (default: True)
Compatibility
- You might need to update your code when:
-
- creating a tokenizer, to handle the new
unique_track
argument.
- creating a tokenizer, to handle the new
-
- saving a tokenizer's config to handle the
out_dir
argument renamed toout_path
- saving a tokenizer's config to handle the
- For datasets tokenized with BPE will need to change the
token_to_event
key tovocab
in the associated tokenizer configuration file
v1.3.0 Special tokens update 🛠
Highlight
Version 1.3.0 changes the way the vocabulary, and by extension tokenizers, handle special tokens: PAD
, SOS
, EOS
and MASK
. It brings a cleaner way to instantiate these classes.
It might bring incompatibilities with data and models used with previous MidiTok versions.
Changes
- b9218bf
Vocabulary
class now takespad
argument to specify to include special padding token. This option is set to True by default, as it is more common to train networks with batches of unequal sequence lengths. - b9218bf
Vocabulary
class: theevent_to_token
argument of the constructor is renamedevents
and has to be given as a list of events. - b9218bf
Vocabulary
class: when adding a token to the vocabulary, the index is automatically set. The index argument is removed as it could cause issues / confusion when mapping indexes with models. - b9218bf The
Event
class now takesvalue
argument in second (order) - b9218bf Fix when learning BPE if
files_lim
were higher than the number of files itself - f9cb109 For all tokenizers, a new constructor argument
pad
specifies to use padding token, and thesos_eos_tokens
argument is renamed tosos_eos
- f9cb109 When creating a
Vocabulary
, the SOS and EOS tokens are now registered before the MASK token. This change is motivated so the order matches the one of special token arguments in tokenizers constructors, and as the SOS and EOS tokens are more commonly used in symbolic music applications. - 84db19d The dummy
StructuredEncoding
,MuMIDIEncoding
,OctupleEncoding
andOctupleMonoEncoding
classes removed frominit.py
. These classes from early versions had no record of being used. Other dummy classes (REMI, MIDILike and CPWord) remain.
Compatibility
- You might need to update your code when creating your tokenizer to handle the new
pad
argument. - Data tokenized with REMI, and models trained with, will be incompatible with v1.3.0 if you used special tokens. The BAR token was previously at index 1, and is now added after special tokens.
- If you created custom tokenizer inheriting
MIDITokenizer
, make sure to update the calls tosuper().__init__
with newpad
arg and renamedsos_eos
arg (example for MIDILike: f9cb109) - If you used both SOS/EOS and MASK special tokens, their order (indexes) is now swapped as SOS/EOS are now registered before MASK. As these tokens should are not used during the tokenization, your previously tokenized datasets remain compatible, unless you intentionally inserted SOS/EOS/MASK tokens. Trained models will however be incompatible as the indices are swapped. If you want to use v1.3.0 with a previously trained model, you can manually invert the predictions of these tokens.
- No incompatibilities outside of these cases
Please reach out if you have any issue / question! 🙌
v1.2.9 BPE speed boost & small improvements
Changes
- 212a943 BPE: Speed boost in
apply_bpe
method, about 1.5 times faster 🚀 - 4b8ccb9 BPE:
tokens_to_events
method is not longer inplace - be3e244
save_tokens
method now takes**kwargs
arguments to save additional information in json files - b690cab fix when computing
max_tick
attribute of a MIDI, when it have tracks with no notes - f1855b6 MidiTok package version is now saved with tokenizer parameters. It allows to keep track of the version used.
- Lint and coverage improvements ✨
Compatibility
- If you explicitly used
tokens_to_events
, you might need to do an adaptation as it is no longer inplace.