Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Option to extend the vocabulary of TokenEmbedder and Tokenizer. #4397

Open
dhruvdcoder opened this issue Jun 24, 2020 · 1 comment
Open

Option to extend the vocabulary of TokenEmbedder and Tokenizer. #4397

dhruvdcoder opened this issue Jun 24, 2020 · 1 comment

Comments

@dhruvdcoder
Copy link

Is your feature request related to a problem? Please describe.

I wish to use a pre-trained embedding layer in my model with possible vocabulary extension before finetuning. Currently, there is no way to do this using a config file unless I write my own train command. Such a feature would be really useful while finetuning a general Transformer in a PretrainedTransformerEmbedder layer for a specific domain.

Ideally, this approach should be available with any TokenEmbedder with the details about how to extend the vocabulary left to the concrete class. Depending on the embedder, we might also want to pass the extra words to the tokenizer. As far as PretrainedTransformerEmbedder goes, the Huggingface APIs offer a way to extend the vocabulary of the tokenizer as well as the model (ref: https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.add_tokens ).

Describe the solution you'd like
The easiest way I can think of achieving this would be to add an extra parameter in the constructor of TokenEmbedder and Tokenizer which points to an extra vocab file (which can optionally also contain weights to initialize the extra vocabulary tokens for the embedder).

Describe alternatives you've considered
I have not really considered an alternative approach. I can come up with it if the proposed approach does not seem reasonable.

@epwalsh
Copy link
Member

epwalsh commented Jun 26, 2020

We should try to figure this out when / after we rethink our vocab. See #3097.

But in the meantime if there's a quick fix for this, we are open to that.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants