Adding padding_token and oov_token as parameters of Vocabulary #3446

nicola-decao · 2019-11-12T14:32:14Z

I added padding_token and oov_token as parameters of the Vocabulary class initialiser.

When using any pre-trained token embedder where OOV and unknown tokens are defined differently (e.g., OOV is [UNK] instead of @@UNKNOWN@@) this is useful.

Fixes #3434. Partial fix for #3097.

matt-gardner

Thanks, this is great! Can you add a test that reads a vocab from a file and makes sure that the OOV token is handled correctly?

matt-gardner

Thanks again!

NicolasAG · 2021-01-13T20:31:02Z

Hi,
This is a great feature, thanks @nicola-decao for making this available.
However, is it possible to do that using a config file? My current setup is the following:

{
    "dataset_reader": {
        "type": "my-reader",
        "source_tokenizer": {
            "type": "pretrained_transformer",
            "model_name": "t5-base",
        },
        "source_token_indexers": {
            "pretrained_transformer": {
                "type": "pretrained_transformer",
                "model_name": "t5-base",
        },
    },
}

then in my DataReader class I have access to all special tokens of the pre-trained T5Tokenizer:

logger.info(f"tokenizer start token: {source_tokenizer.tokenizer.bos_token}")
logger.info(f"tokenizer end token: {source_tokenizer.tokenizer.eos_token}")
logger.info(f"tokenizer unk token: {source_tokenizer.tokenizer.unk_token}")
logger.info(f"tokenizer pad token: {source_tokenizer.tokenizer.pad_token}")

What is the best way from here to set the padding_token and oov_token of the Vocab ??

Thanks.

nicola-decao · 2021-01-14T13:43:37Z

It has been a year since I do not use this feature but you should be able to specify it in te configuration when you define the vocabulary. Something like:

vocabulary: {
    type: "from_files",
    directory: "/some/dir/",   
    oov_token: "[UNK]",
    padding_token: "[PAD]"
  }

NicolasAG · 2021-01-14T16:28:02Z

Aie sorry I'm still new with AllenNLP and I didn't know we could specify the "vocabulary": {...} in the config file... 🤦‍♂️
Thanks!

nicola-decao added 2 commits November 12, 2019 15:27

custom padding_token and oov_token in Vocabulary

7582f10

custom padding_token and oov_token in Vocabulary with docs

8f6dd5a

nicola-decao changed the title ~~Fix vocab~~ Adding padding_token and oov_token as parameters of Vocabulary Nov 12, 2019

matt-gardner reviewed Nov 12, 2019

View reviewed changes

adding test for OOV and fixing other classes for this pull request

6e02aab

matt-gardner approved these changes Nov 13, 2019

View reviewed changes

matt-gardner merged commit 5238561 into allenai:master Nov 13, 2019

nicola-decao deleted the fix_vocab branch November 18, 2019 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding padding_token and oov_token as parameters of Vocabulary #3446

Adding padding_token and oov_token as parameters of Vocabulary #3446

nicola-decao commented Nov 12, 2019 •

edited by matt-gardner

Loading

matt-gardner left a comment

matt-gardner left a comment

NicolasAG commented Jan 13, 2021

nicola-decao commented Jan 14, 2021 •

edited

Loading

NicolasAG commented Jan 14, 2021

Adding padding_token and oov_token as parameters of Vocabulary #3446

Adding padding_token and oov_token as parameters of Vocabulary #3446

Conversation

nicola-decao commented Nov 12, 2019 • edited by matt-gardner Loading

matt-gardner left a comment

Choose a reason for hiding this comment

matt-gardner left a comment

Choose a reason for hiding this comment

NicolasAG commented Jan 13, 2021

nicola-decao commented Jan 14, 2021 • edited Loading

NicolasAG commented Jan 14, 2021

nicola-decao commented Nov 12, 2019 •

edited by matt-gardner

Loading

nicola-decao commented Jan 14, 2021 •

edited

Loading