You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.
Is your feature request related to a problem? Please describe.
I wish to use a pre-trained embedding layer in my model with possible vocabulary extension before finetuning. Currently, there is no way to do this using a config file unless I write my own train command. Such a feature would be really useful while finetuning a general Transformer in a PretrainedTransformerEmbedder layer for a specific domain.
Ideally, this approach should be available with any TokenEmbedder with the details about how to extend the vocabulary left to the concrete class. Depending on the embedder, we might also want to pass the extra words to the tokenizer. As far as PretrainedTransformerEmbedder goes, the Huggingface APIs offer a way to extend the vocabulary of the tokenizer as well as the model (ref: https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.add_tokens ).
Describe the solution you'd like
The easiest way I can think of achieving this would be to add an extra parameter in the constructor of TokenEmbedder and Tokenizer which points to an extra vocab file (which can optionally also contain weights to initialize the extra vocabulary tokens for the embedder).
Describe alternatives you've considered
I have not really considered an alternative approach. I can come up with it if the proposed approach does not seem reasonable.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
I wish to use a pre-trained embedding layer in my model with possible vocabulary extension before finetuning. Currently, there is no way to do this using a config file unless I write my own train command. Such a feature would be really useful while finetuning a general Transformer in a
PretrainedTransformerEmbedder
layer for a specific domain.Ideally, this approach should be available with any
TokenEmbedder
with the details about how to extend the vocabulary left to the concrete class. Depending on the embedder, we might also want to pass the extra words to the tokenizer. As far asPretrainedTransformerEmbedder
goes, the Huggingface APIs offer a way to extend the vocabulary of the tokenizer as well as the model (ref: https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.add_tokens ).Describe the solution you'd like
The easiest way I can think of achieving this would be to add an extra parameter in the constructor of
TokenEmbedder
andTokenizer
which points to an extra vocab file (which can optionally also contain weights to initialize the extra vocabulary tokens for the embedder).Describe alternatives you've considered
I have not really considered an alternative approach. I can come up with it if the proposed approach does not seem reasonable.
The text was updated successfully, but these errors were encountered: