Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Pretrained vocabulary from transformers is sometimes not saved in our Vocabulary object #3456

Closed
MaksymDel opened this issue Nov 15, 2019 · 3 comments · Fixed by #4019
Closed

Comments

@MaksymDel
Copy link
Contributor

We only add tokens from pretrained vocab from transformers lib if the tokenizer has vocab key inside.

if not self._added_to_vocabulary and hasattr(self.tokenizer, "vocab"):

This is not the case for all models in the lib and I did not find a unified get_vocab method in huggingface code.
Anyways, the tokens will still be correctly indexed. It is just that we will not have tokens in our vocabulary object with all the consequences.

In #3453 I will add support for RoBERTa and XLM that have encoder keys instead of vocab key, but I do not know about other models.

@matt-gardner
Copy link
Contributor

See also #3097 on issues about when vocabularies are actually written to disk. This is a separate issue, though, right?

@MaksymDel
Copy link
Contributor Author

Yes, I think they are separate. This one is about correctly getting BERT vocab into our Vocabulary object while #3097 is about saving the vocab to disk.

@MaksymDel MaksymDel changed the title Pretrained vocabulary from transformers is sometimes not saved Pretrained vocabulary from transformers is sometimes not saved in our Vocabulary object Nov 17, 2019
@MaksymDel
Copy link
Contributor Author

MaksymDel commented Mar 28, 2020

The newest version of transformers now includes get_vocab method added to tokenizers, which can be used to retrieve all the vocabulary from the tokenizers. Using it should fix this issue.

Edit: it is not blocked anymore since #4018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants