Rework Pretrained Tokenizer and Indexer #3453

MaksymDel · 2019-11-14T12:56:44Z

The idea is to both tokenize and index tokens in the tokenizer while adding token ids to the Token data structure. In this setup, PretrainedTokenIndexer only needs to pick up correct token id as well as create allennlp vocabulary from the pretrained one.
Thus we subclass SingleIdTokenIndexer.
To get special tokens handled, we first index the string with huggingface method that adds special token ids as well and then converting indexes back to string tokens (to be used by AllenNLP tokens).
This feels to me the cleanest overall approach.

We also now have token_type_id field added to the token. It can now be computed automatically with huggingface's code. I do not use it in any places yet since it is not used by recent transformers like XLM or RoBERTa. However, future PR can better utilize it.

Fixes #3356 and answers the question in #3224.

MaksymDel · 2019-11-14T15:55:04Z

This one is ready to be reviewed.

matt-gardner

Thanks, this looks much better. There's just one small design issue I'm still concerned about, mentioned in the comments.

allennlp/data/tokenizers/pretrained_transformer_tokenizer.py

matt-gardner

This looks good to me at this point, but it would be nice to hear from @DeNeutoy or @brendan-ai2 on the bigger question about tokenize_sentences before merging, if you have any opinion.

DeNeutoy

I get the idea here of wanting to be able to switch between two types of tokenizers for sentence pairs - but i'm not sure you actually get the desired behaviour. For example, suppose you are doing sentence pair classification - with Bert, you use a single TextField which contains both the sentences and separators - whereas with another tokenizer, you use two TextFields, one for each sentence. So having the ability to be able to switch tokenizers in the two sentence case is not particularly useful, as you have to change your input representation anyway.

It is clearly useful to be able to switch the tokenizer for single sentences, which this PR achieves nicely - I just think we can remove entirely the tokenize_sentences method.

allennlp/data/tokenizers/tokenizer.py

allennlp/data/tokenizers/pretrained_transformer_tokenizer.py

allennlp/tests/data/token_indexers/pretrained_transformer_indexer_test.py

allennlp/data/tokenizers/pretrained_transformer_tokenizer.py

allennlp/tests/data/token_indexers/pretrained_transformer_indexer_test.py

allennlp/tests/data/tokenizers/pretrained_transformer_tokenizer_test.py

allennlp/data/token_indexers/pretrained_transformer_indexer.py

MaksymDel · 2019-11-17T07:04:16Z

I addressed feedback and also moved BERT at al. vocab creation inside the Vocabulary constructor. It fixes #3097.

There are 2 tests failing that I do not see to be related to the code I changed (please tell me if they are since all my local pytest checks pass).

So I believe this PR is ready for review.

matt-gardner

Looking very close, thanks for all of your work here!

The failing tests look like they come from a spacy update. I'm not sure how to handle that failure, though. @DeNeutoy, any ideas?

allennlp/data/dataset_readers/next_token_lm.py

allennlp/data/vocabulary.py

MaksymDel · 2019-11-19T16:33:17Z

I addressed the feedback. All the relevant tests pass!

matt-gardner

Thanks @maksym-del!

MaksymDel · 2019-11-20T15:06:16Z

Thanks!

New Pretrained Tokenizer and Indexer

fce1b8b

MaksymDel mentioned this pull request Nov 14, 2019

Automatically handle special symbols for pretrained transformers #3451

Closed

MaksymDel added 4 commits November 14, 2019 16:28

Fix formatting

5fb75b9

Fix ident

b9b2fb8

Fix commas

e1e9af1

Fix docs

51c7f41

MaksymDel mentioned this pull request Nov 14, 2019

pytorch_transformers -> transformers #3449

Merged

Allow future transformers versions

7597f1d

matt-gardner reviewed Nov 14, 2019

View reviewed changes

Address feedback

2d7095b

matt-gardner reviewed Nov 14, 2019

View reviewed changes

allennlp/data/tokenizers/pretrained_transformer_tokenizer.py Outdated Show resolved Hide resolved

matt-gardner reviewed Nov 14, 2019

View reviewed changes

Introduce tokenize_sentences method to tokenizer

56d29a8

DeNeutoy reviewed Nov 15, 2019

View reviewed changes

allennlp/data/tokenizers/tokenizer.py Outdated Show resolved Hide resolved

allennlp/data/tokenizers/pretrained_transformer_tokenizer.py Outdated Show resolved Hide resolved

allennlp/tests/data/token_indexers/pretrained_transformer_indexer_test.py Outdated Show resolved Hide resolved

brendan-ai2 reviewed Nov 15, 2019

View reviewed changes

MaksymDel mentioned this pull request Nov 15, 2019

Pretrained vocabulary from transformers is sometimes not saved in our Vocabulary object #3456

Closed

MaksymDel added 6 commits November 16, 2019 01:54

Improve tests, partially fix vocab issue, fix typos

e9e04f0

Adapt existing dataset readers to work with new tokenizer

1d24404

tokenize_sentences -> tokenize_sentence_pair

b8a1405

Make mypy happy

e3b343c

Add transformers vocab on Vocabualry constructor

ad728a9

Make mypy happy

d86491e

Remove tokenizer field

57ded79

matt-gardner reviewed Nov 17, 2019

View reviewed changes

allennlp/data/dataset_readers/next_token_lm.py Show resolved Hide resolved

allennlp/data/vocabulary.py Outdated Show resolved Hide resolved

This was referenced Nov 18, 2019

Fix bug for RNNs for AutoRegressiveSeqDecoder #3464

Merged

Fixing some bugs in auto_regressive_seq_decoder #3433

Merged

Address feedback

3e6b667

Merge branch 'master' into master

ec95636

matt-gardner approved these changes Nov 19, 2019

View reviewed changes

matt-gardner merged commit bb1b1c6 into allenai:master Nov 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework Pretrained Tokenizer and Indexer #3453

Rework Pretrained Tokenizer and Indexer #3453

MaksymDel commented Nov 14, 2019 •

edited

Loading

MaksymDel commented Nov 14, 2019

matt-gardner left a comment

matt-gardner left a comment

DeNeutoy left a comment

MaksymDel commented Nov 17, 2019 •

edited

Loading

matt-gardner left a comment

MaksymDel commented Nov 19, 2019

matt-gardner left a comment

MaksymDel commented Nov 20, 2019

Rework Pretrained Tokenizer and Indexer #3453

Rework Pretrained Tokenizer and Indexer #3453

Conversation

MaksymDel commented Nov 14, 2019 • edited Loading

MaksymDel commented Nov 14, 2019

matt-gardner left a comment

Choose a reason for hiding this comment

matt-gardner left a comment

Choose a reason for hiding this comment

DeNeutoy left a comment

Choose a reason for hiding this comment

MaksymDel commented Nov 17, 2019 • edited Loading

matt-gardner left a comment

Choose a reason for hiding this comment

MaksymDel commented Nov 19, 2019

matt-gardner left a comment

Choose a reason for hiding this comment

MaksymDel commented Nov 20, 2019

MaksymDel commented Nov 14, 2019 •

edited

Loading

MaksymDel commented Nov 17, 2019 •

edited

Loading