-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Rework Pretrained Tokenizer and Indexer #3453
Conversation
This one is ready to be reviewed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks much better. There's just one small design issue I'm still concerned about, mentioned in the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me at this point, but it would be nice to hear from @DeNeutoy or @brendan-ai2 on the bigger question about tokenize_sentences
before merging, if you have any opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the idea here of wanting to be able to switch between two types of tokenizers for sentence pairs - but i'm not sure you actually get the desired behaviour. For example, suppose you are doing sentence pair classification - with Bert, you use a single TextField
which contains both the sentences and separators - whereas with another tokenizer, you use two TextFields
, one for each sentence. So having the ability to be able to switch tokenizers in the two sentence case is not particularly useful, as you have to change your input representation anyway.
It is clearly useful to be able to switch the tokenizer for single sentences, which this PR achieves nicely - I just think we can remove entirely the tokenize_sentences
method.
allennlp/tests/data/token_indexers/pretrained_transformer_indexer_test.py
Outdated
Show resolved
Hide resolved
allennlp/tests/data/token_indexers/pretrained_transformer_indexer_test.py
Outdated
Show resolved
Hide resolved
allennlp/tests/data/token_indexers/pretrained_transformer_indexer_test.py
Show resolved
Hide resolved
I addressed feedback and also moved BERT at al. vocab creation inside the There are 2 tests failing that I do not see to be related to the code I changed (please tell me if they are since all my local pytest checks pass). So I believe this PR is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking very close, thanks for all of your work here!
The failing tests look like they come from a spacy update. I'm not sure how to handle that failure, though. @DeNeutoy, any ideas?
I addressed the feedback. All the relevant tests pass! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @maksym-del!
Thanks! |
The idea is to both tokenize and index tokens in the tokenizer while adding token ids to the Token data structure. In this setup, PretrainedTokenIndexer only needs to pick up correct token id as well as create allennlp vocabulary from the pretrained one.
Thus we subclass SingleIdTokenIndexer.
To get special tokens handled, we first index the string with huggingface method that adds special token ids as well and then converting indexes back to string tokens (to be used by AllenNLP tokens).
This feels to me the cleanest overall approach.
We also now have token_type_id field added to the token. It can now be computed automatically with huggingface's code. I do not use it in any places yet since it is not used by recent transformers like XLM or RoBERTa. However, future PR can better utilize it.
Fixes #3356 and answers the question in #3224.