Vocabulary mismatch when using pre-trained transformers #3434

nicola-decao · 2019-11-06T17:54:27Z

Default OOV and unknown tokens are hardcoded as default in the vocabulary class:

allennlp/allennlp/data/vocabulary.py

Line 226 in 9a6962f

self._padding_token = DEFAULT_PADDING_TOKEN

and

allennlp/allennlp/data/vocabulary.py

Line 227 in 9a6962f

self._oov_token = DEFAULT_OOV_TOKEN

so when using any pre-trained token embedder where OOV and unknown tokens are defined differently (e.g., OOV is [UNK] instead of @@UNKNOWN@@) it breaks.

I would add padding_token and oov_token as parameters of the Vocabulary class inizialiser but I am not sure if this will add inconsistencies.

The text was updated successfully, but these errors were encountered:

matt-gardner · 2019-11-08T02:36:00Z

I think this would be ok. It might also help with #3097. PR welcome! Just be sure that it doesn't break any tests, and that you add sufficient tests for the new functionality.

nicola-decao · 2019-11-08T07:50:11Z

I am gonna look into this and make a PR in the weekend.

nicola-decao mentioned this issue Nov 6, 2019

Beginning and end of sentece mismatch when using pre-trained transformers #3435

Closed

matt-gardner added the Contributions welcome label Nov 8, 2019

nicola-decao mentioned this issue Nov 12, 2019

Adding padding_token and oov_token as parameters of Vocabulary #3446

Merged

matt-gardner closed this as completed in #3446 Nov 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vocabulary mismatch when using pre-trained transformers #3434

Vocabulary mismatch when using pre-trained transformers #3434

nicola-decao commented Nov 6, 2019

matt-gardner commented Nov 8, 2019

nicola-decao commented Nov 8, 2019

Vocabulary mismatch when using pre-trained transformers #3434

Vocabulary mismatch when using pre-trained transformers #3434

Comments

nicola-decao commented Nov 6, 2019

matt-gardner commented Nov 8, 2019

nicola-decao commented Nov 8, 2019