Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Vocabulary mismatch when using pre-trained transformers #3434

Closed
nicola-decao opened this issue Nov 6, 2019 · 2 comments · Fixed by #3446
Closed

Vocabulary mismatch when using pre-trained transformers #3434

nicola-decao opened this issue Nov 6, 2019 · 2 comments · Fixed by #3446

Comments

@nicola-decao
Copy link
Contributor

Default OOV and unknown tokens are hardcoded as default in the vocabulary class:

self._padding_token = DEFAULT_PADDING_TOKEN

and
self._oov_token = DEFAULT_OOV_TOKEN

so when using any pre-trained token embedder where OOV and unknown tokens are defined differently (e.g., OOV is [UNK] instead of @@UNKNOWN@@) it breaks.

I would add padding_token and oov_token as parameters of the Vocabulary class inizialiser but I am not sure if this will add inconsistencies.

@matt-gardner
Copy link
Contributor

I think this would be ok. It might also help with #3097. PR welcome! Just be sure that it doesn't break any tests, and that you add sufficient tests for the new functionality.

@nicola-decao
Copy link
Contributor Author

I am gonna look into this and make a PR in the weekend.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants