Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Rework Pretrained Tokenizer and Indexer #3453

Merged
merged 17 commits into from
Nov 19, 2019
Merged

Rework Pretrained Tokenizer and Indexer #3453

merged 17 commits into from
Nov 19, 2019

Conversation

MaksymDel
Copy link
Contributor

@MaksymDel MaksymDel commented Nov 14, 2019

The idea is to both tokenize and index tokens in the tokenizer while adding token ids to the Token data structure. In this setup, PretrainedTokenIndexer only needs to pick up correct token id as well as create allennlp vocabulary from the pretrained one.
Thus we subclass SingleIdTokenIndexer.
To get special tokens handled, we first index the string with huggingface method that adds special token ids as well and then converting indexes back to string tokens (to be used by AllenNLP tokens).
This feels to me the cleanest overall approach.

We also now have token_type_id field added to the token. It can now be computed automatically with huggingface's code. I do not use it in any places yet since it is not used by recent transformers like XLM or RoBERTa. However, future PR can better utilize it.

Fixes #3356 and answers the question in #3224.

@MaksymDel
Copy link
Contributor Author

This one is ready to be reviewed.

Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks much better. There's just one small design issue I'm still concerned about, mentioned in the comments.

Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me at this point, but it would be nice to hear from @DeNeutoy or @brendan-ai2 on the bigger question about tokenize_sentences before merging, if you have any opinion.

Copy link
Contributor

@DeNeutoy DeNeutoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the idea here of wanting to be able to switch between two types of tokenizers for sentence pairs - but i'm not sure you actually get the desired behaviour. For example, suppose you are doing sentence pair classification - with Bert, you use a single TextField which contains both the sentences and separators - whereas with another tokenizer, you use two TextFields, one for each sentence. So having the ability to be able to switch tokenizers in the two sentence case is not particularly useful, as you have to change your input representation anyway.

It is clearly useful to be able to switch the tokenizer for single sentences, which this PR achieves nicely - I just think we can remove entirely the tokenize_sentences method.

allennlp/data/tokenizers/tokenizer.py Outdated Show resolved Hide resolved
@MaksymDel
Copy link
Contributor Author

MaksymDel commented Nov 17, 2019

I addressed feedback and also moved BERT at al. vocab creation inside the Vocabulary constructor. It fixes #3097.

There are 2 tests failing that I do not see to be related to the code I changed (please tell me if they are since all my local pytest checks pass).

So I believe this PR is ready for review.

Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking very close, thanks for all of your work here!

The failing tests look like they come from a spacy update. I'm not sure how to handle that failure, though. @DeNeutoy, any ideas?

allennlp/data/vocabulary.py Outdated Show resolved Hide resolved
@MaksymDel
Copy link
Contributor Author

I addressed the feedback. All the relevant tests pass!

Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @maksym-del!

@matt-gardner matt-gardner merged commit bb1b1c6 into allenai:master Nov 19, 2019
@MaksymDel
Copy link
Contributor Author

Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve getting start, end and intermediate tokens in pretrained_transformer_tokenizer
4 participants