Option to extend the vocabulary of TokenEmbedder and Tokenizer. #4397

dhruvdcoder · 2020-06-24T00:17:39Z

Is your feature request related to a problem? Please describe.

I wish to use a pre-trained embedding layer in my model with possible vocabulary extension before finetuning. Currently, there is no way to do this using a config file unless I write my own train command. Such a feature would be really useful while finetuning a general Transformer in a PretrainedTransformerEmbedder layer for a specific domain.

Ideally, this approach should be available with any TokenEmbedder with the details about how to extend the vocabulary left to the concrete class. Depending on the embedder, we might also want to pass the extra words to the tokenizer. As far as PretrainedTransformerEmbedder goes, the Huggingface APIs offer a way to extend the vocabulary of the tokenizer as well as the model (ref: https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.add_tokens ).

Describe the solution you'd like
The easiest way I can think of achieving this would be to add an extra parameter in the constructor of TokenEmbedder and Tokenizer which points to an extra vocab file (which can optionally also contain weights to initialize the extra vocabulary tokens for the embedder).

Describe alternatives you've considered
I have not really considered an alternative approach. I can come up with it if the proposed approach does not seem reasonable.

The text was updated successfully, but these errors were encountered:

epwalsh · 2020-06-26T16:22:17Z

We should try to figure this out when / after we rethink our vocab. See #3097.

But in the meantime if there's a quick fix for this, we are open to that.

dhruvdcoder added the Feature request label Jun 24, 2020

epwalsh added the Contributions welcome label Jun 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to extend the vocabulary of TokenEmbedder and Tokenizer. #4397

Option to extend the vocabulary of TokenEmbedder and Tokenizer. #4397

dhruvdcoder commented Jun 24, 2020

epwalsh commented Jun 26, 2020

Option to extend the vocabulary of TokenEmbedder and Tokenizer. #4397

Option to extend the vocabulary of TokenEmbedder and Tokenizer. #4397

Comments

dhruvdcoder commented Jun 24, 2020

epwalsh commented Jun 26, 2020