Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Adding padding_token and oov_token as parameters of Vocabulary #3446

Merged
merged 3 commits into from
Nov 13, 2019

Conversation

nicola-decao
Copy link
Contributor

@nicola-decao nicola-decao commented Nov 12, 2019

I added padding_token and oov_token as parameters of the Vocabulary class initialiser.

When using any pre-trained token embedder where OOV and unknown tokens are defined differently (e.g., OOV is [UNK] instead of @@UNKNOWN@@) this is useful.

Fixes #3434. Partial fix for #3097.

@nicola-decao nicola-decao changed the title Fix vocab Adding padding_token and oov_token as parameters of Vocabulary Nov 12, 2019
Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is great! Can you add a test that reads a vocab from a file and makes sure that the OOV token is handled correctly?

Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again!

@matt-gardner matt-gardner merged commit 5238561 into allenai:master Nov 13, 2019
@nicola-decao nicola-decao deleted the fix_vocab branch November 18, 2019 10:39
@NicolasAG
Copy link

Hi,
This is a great feature, thanks @nicola-decao for making this available.
However, is it possible to do that using a config file? My current setup is the following:

{
    "dataset_reader": {
        "type": "my-reader",
        "source_tokenizer": {
            "type": "pretrained_transformer",
            "model_name": "t5-base",
        },
        "source_token_indexers": {
            "pretrained_transformer": {
                "type": "pretrained_transformer",
                "model_name": "t5-base",
        },
    },
}

then in my DataReader class I have access to all special tokens of the pre-trained T5Tokenizer:

logger.info(f"tokenizer start token: {source_tokenizer.tokenizer.bos_token}")
logger.info(f"tokenizer end token: {source_tokenizer.tokenizer.eos_token}")
logger.info(f"tokenizer unk token: {source_tokenizer.tokenizer.unk_token}")
logger.info(f"tokenizer pad token: {source_tokenizer.tokenizer.pad_token}")

What is the best way from here to set the padding_token and oov_token of the Vocab ??

Thanks.

@nicola-decao
Copy link
Contributor Author

nicola-decao commented Jan 14, 2021

It has been a year since I do not use this feature but you should be able to specify it in te configuration when you define the vocabulary. Something like:

vocabulary: {
    type: "from_files",
    directory: "/some/dir/",   
    oov_token: "[UNK]",
    padding_token: "[PAD]"
  }

@NicolasAG
Copy link

Aie sorry I'm still new with AllenNLP and I didn't know we could specify the "vocabulary": {...} in the config file... 🤦‍♂️
Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vocabulary mismatch when using pre-trained transformers
3 participants