-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Train a model with transformer embeddings and additional_special_tokens #4690
Comments
@dirkgr this is just a friendly ping to make sure you haven't forgotten about this issue 😜 |
@dirkgr ? 😅 |
@pvcastro I've managed to approximately get something working like this by writing a wrapper Model class for the huggingface model that calls the HF function model.resize_token_embeddings(new_vocab_size) at the end of the constructor as well as forcibly creating a custom AllenNLP vocab and a Huggingface Vocab. My use case is Bart so I've modified the bart.py file in allennlp-models. I'm now just verifying if this doesn't remove/break the pretrained weight initialisation. I'm not sure if this is useful for you - but let me know if it helps! |
hi @tomsherborne ! |
@pvcastro yes you could do that. i found it more stable/manageable to manually create an instance of the relevant HF tokenizer, use
|
I see. I guess I'll do the same then, until this is supported by the library. |
@pvcastro just FYI, Dirk is currently on vacation and won't be back for another week. @tomsherborne it seems like it would be useful to have a "vocabulary": {
"type": "from_pretrained_transformer",
"model_name": "bert-base-cased"
} This feature request has come up elsewhere as well (see here, for example). |
@epwalsh i may have some aspects of the vocabulary API wrong here but I understood that a specific type was unneccessary right now because the |
@tomsherborne thanks for pointing that out, I actually wasn't aware of that.
Hmm do you need that for your use case? Right now the |
Hi, I have the exact same issue as originally stated: trying to add special tokens to the T5 pretrained tokenizer with this extra argument in my tokenizer config:
but that fails because the list type is unhashable. However, I am not sure to understand the trick of creating a Model wrapper, a new Vocab file, etc... as suggested by @tomsherborne I guess another solution could be to:
and then (2) when creating the model we resize the BasicTextFieldEmbedder.token_embbeder dimension like this:
this is because I have the following config for my model in which I use the pretrained t5 encoder as an embedder:
Would that work? will the vocab passed to the model constructor have the additional tokens....? |
Hi @NicolasAG . I ended up doing something similar. In my custom reader class I did this in init:
In the model I did this:
|
Thanks for sharing! I'll try it out :) |
Can you see if #4781 fixes your original issue? |
Hi @dirkgr , thanks. I'll give it a try and get back to you. |
@dirkgr I think it would still be necessary to do something like was done here
otherwise we'll end up with other errors due to the transformer model having a different dimension, not adjusted to the new vocabulary |
I did confirm that #4781 fixes the parsing of additional tokens from config, but, as I stated in the previous comment, it's still necessary to resize the embeddings layer to accomodate the new vocab. |
@dirkgr would a PR which handles the special case of extending a pretrained HuggingFace model embedder be useful? The AllenNLP model class already does this (here I think) but this doesn't handle the HuggingFace model case since the HF embeddings don't have the "extend_vocab" attribute. Coul this function could call |
@tomsherborne, yes, that would be a welcome addition. I'd be happy to review that PR. I think that's not enough by itself, since we still have to call those new methods then, but it would be easy to do so. |
This is not closed yet. #4781 only goes part of the way. |
@dirkgr just to let you know that I will have time for this after EMNLP this week. To get ahead of things - where should I be writing tests for an addition to the Model class? |
I think the best place is here: https://github.com/allenai/allennlp/blob/master/tests/models/model_test.py There is already one test in there about extending vocabs! |
@dirkgr this is just a friendly ping to make sure you haven't forgotten about this issue 😜 |
@tomsherborne, did you get anywhere with this? |
@dirkgr Yes apologies - it's in the works and I'll open a PR when its done. |
Hi,
logger.info(len(vocab.get_token_to_index_vocabulary(namespace))) # shows "2": probably just the oov_token and pad_token
test1 = "a test string that uses some additional_special_tokens..."
tokenized_test1 = tokenizer.tokenize(test1)
logger.info([t.text for t in tokenized_test1]) # shows the correct tokenization
logger.info(len(vocab.get_token_to_index_vocabulary(namespace))) # shows "2" probably just the oov_token and pad_token
tokenids_test1 = indexers["pretrained_transformer"].tokens_to_indices(tokenized_test1, vocab)
logger.info(tokenids_test1) # shows the correct tokenization with token_ids > 32102 (the initital vocab size of t5-base)
logger.info(len(vocab.get_token_to_index_vocabulary(namespace))) # shows 32102 --> why not 32102+n ??
detokenized1 = indexers["pretrained_transformer"].indices_to_tokens(tokenids_test1, vocab) # throws a KeyError on special tokens that have an id >= 32102 :(
I'm curious to know where you are at @tomsherborne and if you observe something similar? |
@NicolasAG doesn't this work if you do the resize? |
@pvcastro Thanks! you reminded me that I forgot using
and it works fine without doing the resize ✔️ |
Now that #4946 got merged, even though I was the one that opened this issue, I'd like your input before closing it. What do you think @dirkgr and @tomsherborne ? |
I think both sub-issues have been resolved, but I never had a local repro of the problem. Does it work now end-to-end? |
For me it's all good now. I was just wondering because @tomsherborne was preparing an additional contribution. |
@pvcastro I think your solution is good. I was working on something but other commitments have ended up taking higher priority. Apologies that this never came to fruition. |
Thanks for the effort @tomsherborne ! |
Checklist
master
branch of AllenNLP.pip freeze
.Description
Hi there! I'm trying to train a transformer-based text classifier model in AllenNLP, but I need to add 5 additional special tokens, in a way compatible with tokenizers lib. I tried adding them to the jsonnet AllenNLP config file and then to the transformer's model path, but neither worked, with each approach having a different problem, which will be described below.
Python traceback:
Related issues or possible duplicates
Environment
OS: Linux
Python version: 3.7.7
Output of
pip freeze
:Steps to reproduce
First I tried adding the 5 additional special tokens directly in the jsonnet model config, like this:
But I ran into a problem at
allennlp.common.cached_transformer.get_tokenizer
, becausecache_key = (model_name, frozenset(kwargs.items()))
tries to use the "tokenizer_kwargs" value as a cache key, but it can't parse the additional_special_tokens list into a string, throwing the following exception:TypeError: unhashable type: 'list'
I couldn't find a way to work passing the tokens in this way, so I ended up downloading the bert model to my local disk and added the tokenizers config files to the same path (the vocab size of my bert model is 29794, so the last index is 29793). Files contents I changed are in the "Example source" section below.
After debugging, looks like this config at least was enough to get the bert tokenizer to recognize the 5 tokens and tokenize the training data accordingly, but then I ran into another issue once training actually began (the one pasted in the "Python traceback" section of this issue).
Looks like this error is due to the fact that the transformer's model embeddings layer weren't properly resized according to the new vocabulary size, which would be accomplished with a code like this:
model.resize_token_embeddings(len(tokenizer))
. I didn't find any code in the AllenNLP lib that would do something like this, so I'm thinking this is the issue's cause.Is there another way to accomplish this using AllenNLP that I'm not aware of? Looks like both ways to expand the vocab size should be possible.
Example source:
added_tokens.json
:{"<REL_SEP>": 29794, "[[": 29795, "]]": 29796, "<<": 29797, ">>": 29798}
special_tokens_map.json
:{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "additional_special_tokens": ["<REL_SEP>", "[[", "]]", "<<", ">>"]}
tokenizer_config.json
:{"do_lower_case": false, "additional_special_tokens": ["<REL_SEP>", "[[", "]]", "<<", ">>"]}
Thanks!
The text was updated successfully, but these errors were encountered: