Pretrained vocabulary from transformers is sometimes not saved in our Vocabulary object #3456

MaksymDel · 2019-11-15T23:50:47Z

We only add tokens from pretrained vocab from transformers lib if the tokenizer has vocab key inside.

allennlp/allennlp/data/token_indexers/pretrained_transformer_indexer.py

Line 62 in 88fe007

if not self._added_to_vocabulary and hasattr(self.tokenizer, "vocab"):

This is not the case for all models in the lib and I did not find a unified get_vocab method in huggingface code.
Anyways, the tokens will still be correctly indexed. It is just that we will not have tokens in our vocabulary object with all the consequences.

In #3453 I will add support for RoBERTa and XLM that have encoder keys instead of vocab key, but I do not know about other models.

The text was updated successfully, but these errors were encountered:

matt-gardner · 2019-11-16T00:04:57Z

See also #3097 on issues about when vocabularies are actually written to disk. This is a separate issue, though, right?

MaksymDel · 2019-11-17T03:55:01Z

Yes, I think they are separate. This one is about correctly getting BERT vocab into our Vocabulary object while #3097 is about saving the vocab to disk.

MaksymDel · 2020-03-28T00:08:45Z

The newest version of transformers now includes get_vocab method added to tokenizers, which can be used to retrieve all the vocabulary from the tokenizers. Using it should fix this issue.

Edit: it is not blocked anymore since #4018

MaksymDel mentioned this issue Nov 15, 2019

Rework Pretrained Tokenizer and Indexer #3453

Merged

MaksymDel changed the title ~~Pretrained vocabulary from transformers is sometimes not saved~~ Pretrained vocabulary from transformers is sometimes not saved in our Vocabulary object Nov 17, 2019

ahnjaewoo mentioned this issue Feb 17, 2020

How can I get vocabulary from instances with pre-trained transformer tokenizer(indexer)? #3791

Closed

MaksymDel mentioned this issue Apr 4, 2020

Fetch transformers vocabs automatically #4019

Merged

matt-gardner closed this as completed in #4019 Apr 4, 2020

McKracken mentioned this issue Mar 16, 2021

from_pretrained_transformer not called in commnad line predict mode with BasicClassifier #5055

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretrained vocabulary from transformers is sometimes not saved in our Vocabulary object #3456

Pretrained vocabulary from transformers is sometimes not saved in our Vocabulary object #3456

MaksymDel commented Nov 15, 2019

matt-gardner commented Nov 16, 2019

MaksymDel commented Nov 17, 2019

MaksymDel commented Mar 28, 2020 •

edited

Loading

Pretrained vocabulary from transformers is sometimes not saved in our Vocabulary object #3456

Pretrained vocabulary from transformers is sometimes not saved in our Vocabulary object #3456

Comments

MaksymDel commented Nov 15, 2019

matt-gardner commented Nov 16, 2019

MaksymDel commented Nov 17, 2019

MaksymDel commented Mar 28, 2020 • edited Loading

MaksymDel commented Mar 28, 2020 •

edited

Loading