Train a model with transformer embeddings and additional_special_tokens #4690

pvcastro · 2020-10-01T10:58:10Z

Checklist

Description

Hi there! I'm trying to train a transformer-based text classifier model in AllenNLP, but I need to add 5 additional special tokens, in a way compatible with tokenizers lib. I tried adding them to the jsonnet AllenNLP config file and then to the transformer's model path, but neither worked, with each approach having a different problem, which will be described below.

Python traceback:

2020-09-30 23:56:17,398 - INFO - allennlp.training.trainer - Epoch 0/9
2020-09-30 23:56:17,398 - INFO - allennlp.training.trainer - Worker 0 memory usage MB: 10065.304
2020-09-30 23:56:17,484 - WARNING - allennlp.common.util - unable to check gpu_memory_mb() due to occasional failure, continuing
Traceback (most recent call last):
  File "/media/discoD/repositorios/allennlp/allennlp/common/util.py", line 415, in gpu_memory_mb
    encoding="utf-8",
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/subprocess.py", line 488, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/subprocess.py", line 800, in __init__
    restore_signals, start_new_session)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/subprocess.py", line 1482, in _execute_child
    restore_signals, start_new_session, preexec_fn)
  File "/media/discoD/pycharm-community-2019.2/plugins/python-ce/helpers/pydev/_pydev_bundle/pydev_monkey.py", line 526, in new_fork_exec
    return getattr(_posixsubprocess, original_name)(args, *patch_fork_exec_executable_list(args, other_args))
OSError: [Errno 12] Cannot allocate memory
2020-09-30 23:56:17,489 - INFO - allennlp.training.trainer - Training
  0%|          | 0/11817 [00:00<?, ?it/s]/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [69,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [69,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [69,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [69,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
...
...
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [102,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [102,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
  0%|          | 0/11817 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 443, in _train_worker
    metrics = train_loop.run()
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 505, in run
    return self.trainer.train()
  File "/media/discoD/repositorios/allennlp/allennlp/training/trainer.py", line 872, in train
    train_metrics = self._train_epoch(epoch)
  File "/media/discoD/repositorios/allennlp/allennlp/training/trainer.py", line 594, in _train_epoch
    batch_outputs = self.batch_outputs(batch, for_training=True)
  File "/media/discoD/repositorios/allennlp/allennlp/training/trainer.py", line 479, in batch_outputs
    output_dict = self._pytorch_model(**batch)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/repositorios/allennlp/allennlp/models/basic_classifier.py", line 121, in forward
    embedded_text = self._text_field_embedder(tokens)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/repositorios/allennlp/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 88, in forward
    token_vectors = embedder(**tensors, **forward_params_values)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/repositorios/allennlp/allennlp/modules/token_embedders/pretrained_transformer_embedder.py", line 184, in forward
    transformer_output = self.transformer_model(**parameters)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_bert.py", line 762, in forward
    output_hidden_states=output_hidden_states,
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_bert.py", line 439, in forward
    output_attentions,
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_bert.py", line 371, in forward
    hidden_states, attention_mask, head_mask, output_attentions=output_attentions,
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_bert.py", line 315, in forward
    hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions,
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_bert.py", line 221, in forward
    mixed_query_layer = self.query(hidden_states)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 91, in forward
    return F.linear(input, self.weight, self.bias)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/functional.py", line 1676, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
python-BaseException
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=710 : device-side assert triggered

Related issues or possible duplicates

None

Environment

OS: Linux

Python version: 3.7.7

Output of pip freeze:

allennlp==1.1.0
allennlp-models==1.1.0
-e [email protected]:allenai/allennlp-server.git@bc56288b9295391051f7b7b042fe34219bfe33ab#egg=allennlp_server
attrs==19.3.0
backcall==0.2.0
bleach==3.1.5
blis==0.4.1
boto3==1.14.31
botocore==1.17.31
cachetools==4.1.1
catalogue==1.0.0
certifi==2020.6.20
chardet==3.0.4
click==7.1.2
conllu==4.1
cycler==0.10.0
cymem==2.0.3
cytoolz==0.10.1
decorator==4.4.2
defusedxml==0.6.0
docutils==0.15.2
eland==7.7.0a1
elasticsearch-dsl==7.2.1
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz
entrypoints==0.3
filelock==3.0.12
fire==0.3.1
Flask==1.1.2
Flask-Cors==3.0.8
ftfy==5.8
future==0.18.2
gevent==20.6.2
greenlet==0.4.16
h5py==2.10.0
idna==2.10
importlib-metadata==1.7.0
iniconfig==1.0.1
ipykernel==5.3.4
ipython==7.16.1
ipython-genutils==0.2.0
ipywidgets==7.5.1
itsdangerous==1.1.0
jedi==0.17.2
jellyfish==0.8.2
Jinja2==2.11.2
jmespath==0.10.0
joblib==0.16.0
jsonnet==0.16.0
jsonpickle==1.4.1
jsonschema==3.2.0
jupyter-client==6.1.6
jupyter-core==4.6.3
Keras==2.4.3
kiwisolver==1.2.0
MarkupSafe==1.1.1
matplotlib==3.3.0
mistune==0.8.4
mkl-fft==1.1.0
mkl-random==1.1.1
mkl-service==2.3.0
more-itertools==8.4.0
murmurhash==1.0.2
nbconvert==5.6.1
nbformat==5.0.7
networkx==2.4
nltk==3.5
notebook==6.0.3
numpy==1.18.5
olefile==0.46
overrides==3.1.0
packaging==20.4
pandas==1.1.0
pandocfilters==1.4.2
parso==0.7.1
pexpect==4.8.0
pickleshare==0.7.5
Pillow==7.2.0
plac==1.1.3
pluggy==0.13.1
preshed==3.0.2
prometheus-client==0.8.0
prompt-toolkit==3.0.5
protobuf==3.12.4
ptyprocess==0.6.0
py==1.9.0
py-rouge==1.1
pydot==1.4.1
pyemd==0.5.1
Pygments==2.6.1
pyparsing==2.4.7
Pyphen==0.9.5
pyrsistent==0.16.0
pytest==6.0.1
python-dateutil==2.8.1
pytz==2020.1
PyYAML==5.3.1
pyzmq==19.0.1
regex==2020.7.14
requests==2.24.0
s3transfer==0.3.3
sacremoses==0.0.43
scikit-learn==0.23.1
scipy==1.5.2
seaborn==0.11.0
Send2Trash==1.5.0
sentencepiece==0.1.91
seqeval==0.0.12
six==1.15.0
spacy==2.3.2
srsly==1.0.2
tensorboardX==2.1
termcolor==1.1.0
terminado==0.8.3
testpath==0.4.4
thinc==7.4.1
threadpoolctl==2.1.0
tokenizers==0.8.1rc1
toml==0.10.1
toolz==0.10.0
torch==1.6.0+cu101
torchvision==0.7.0+cu101
tornado==6.0.4
tqdm==4.48.0
traitlets==4.3.3
transformers==3.0.2
urllib3==1.25.10
visualise-spacy-tree==0.0.6
wasabi==0.7.1
wcwidth==0.2.5
webencodings==0.5.1
Werkzeug==1.0.1
widgetsnbextension==3.5.1
word2number==1.1
zipp==3.1.0
zope.event==4.4
zope.interface==5.1.0

Steps to reproduce

First I tried adding the 5 additional special tokens directly in the jsonnet model config, like this:

    "token_indexers": {
            "tokens": {
                "type": "pretrained_transformer",
                "model_name": transformer_model,
                "max_length": transformer_dim,
                "tokenizer_kwargs": {"additional_special_tokens": [['<REL_SEP>'], ['[['], [']]'], ['<<'], ['>>']], "max_len": transformer_dim}
            }
     },

But I ran into a problem at allennlp.common.cached_transformer.get_tokenizer, because cache_key = (model_name, frozenset(kwargs.items())) tries to use the "tokenizer_kwargs" value as a cache key, but it can't parse the additional_special_tokens list into a string, throwing the following exception:

TypeError: unhashable type: 'list'

Traceback (most recent call last):
  File "/media/discoD/pycharm-community-2019.2/plugins/python-ce/helpers/pydev/pydevd.py", line 1465, in _exec
    runpy._run_module_as_main(module_name, alter_argv=False)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/media/discoD/repositorios/allennlp/allennlp/__main__.py", line 38, in <module>
    run()
  File "/media/discoD/repositorios/allennlp/allennlp/__main__.py", line 34, in run
    main(prog="allennlp")
  File "/media/discoD/repositorios/allennlp/allennlp/commands/__init__.py", line 94, in main
    args.func(args)
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 118, in train_model_from_args
    file_friendly_logging=args.file_friendly_logging,
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 177, in train_model_from_file
    file_friendly_logging=file_friendly_logging,
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 238, in train_model
    file_friendly_logging=file_friendly_logging,
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 433, in _train_worker
    local_rank=process_rank,
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 599, in from_params
    **extras,
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 626, in from_params
    kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 197, in create_kwargs
    cls.__name__, param_name, annotation, param.default, params, **extras
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 306, in pop_and_construct_arg
    return construct_arg(class_name, name, popped_params, annotation, default, **extras)
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 340, in construct_arg
    return annotation.from_params(params=popped_params, **subextras)
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 599, in from_params
    **extras,
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 626, in from_params
    kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 197, in create_kwargs
    cls.__name__, param_name, annotation, param.default, params, **extras
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 306, in pop_and_construct_arg
    return construct_arg(class_name, name, popped_params, annotation, default, **extras)
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 387, in construct_arg
    **extras,
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 340, in construct_arg
    return annotation.from_params(params=popped_params, **subextras)
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 599, in from_params
    **extras,
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 628, in from_params
    return constructor_to_call(**kwargs)  # type: ignore
  File "/media/discoD/repositorios/allennlp/allennlp/data/token_indexers/pretrained_transformer_indexer.py", line 58, in __init__
    model_name, tokenizer_kwargs=tokenizer_kwargs
  File "/media/discoD/repositorios/allennlp/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py", line 71, in __init__
    model_name, add_special_tokens=False, **tokenizer_kwargs
  File "/media/discoD/repositorios/allennlp/allennlp/common/cached_transformers.py", line 101, in get_tokenizer
    cache_key = (model_name, frozenset(kwargs.items()))
TypeError: unhashable type: 'list'

I couldn't find a way to work passing the tokens in this way, so I ended up downloading the bert model to my local disk and added the tokenizers config files to the same path (the vocab size of my bert model is 29794, so the last index is 29793). Files contents I changed are in the "Example source" section below.

After debugging, looks like this config at least was enough to get the bert tokenizer to recognize the 5 tokens and tokenize the training data accordingly, but then I ran into another issue once training actually began (the one pasted in the "Python traceback" section of this issue).

Looks like this error is due to the fact that the transformer's model embeddings layer weren't properly resized according to the new vocabulary size, which would be accomplished with a code like this: model.resize_token_embeddings(len(tokenizer)). I didn't find any code in the AllenNLP lib that would do something like this, so I'm thinking this is the issue's cause.

Is there another way to accomplish this using AllenNLP that I'm not aware of? Looks like both ways to expand the vocab size should be possible.

Example source:

added_tokens.json:

{"<REL_SEP>": 29794, "[[": 29795, "]]": 29796, "<<": 29797, ">>": 29798}

special_tokens_map.json:

{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "additional_special_tokens": ["<REL_SEP>", "[[", "]]", "<<", ">>"]}

tokenizer_config.json:

{"do_lower_case": false, "additional_special_tokens": ["<REL_SEP>", "[[", "]]", "<<", ">>"]}

Thanks!

The text was updated successfully, but these errors were encountered:

github-actions · 2020-10-16T16:30:30Z

@dirkgr this is just a friendly ping to make sure you haven't forgotten about this issue 😜

pvcastro · 2020-10-27T13:40:22Z

@dirkgr ? 😅

tomsherborne · 2020-10-27T13:46:23Z

@pvcastro I've managed to approximately get something working like this by writing a wrapper Model class for the huggingface model that calls the HF function model.resize_token_embeddings(new_vocab_size) at the end of the constructor as well as forcibly creating a custom AllenNLP vocab and a Huggingface Vocab. My use case is Bart so I've modified the bart.py file in allennlp-models. I'm now just verifying if this doesn't remove/break the pretrained weight initialisation. I'm not sure if this is useful for you - but let me know if it helps!

pvcastro · 2020-10-27T14:41:11Z

hi @tomsherborne !
So you had to hardcode tokenizer_kwargs directly in .py to pass additional_special_tokens?

tomsherborne · 2020-10-27T14:45:02Z

@pvcastro yes you could do that. i found it more stable/manageable to manually create an instance of the relevant HF tokenizer, use add_tokens to manually add each extra token i needed and save this to disk. Then in the ANLP config file I reference a local path under the "model_name" argument for tokenizer/indexer. I also converted this HF vocabulary into an ANLP vocabulary and reference this in the config as:

	"vocabulary": {
		"type": "from_files", 
		"directory": "./path/to/allennlp/version/of/extended/vocab",
		"oov_token": "<unk>",
		"padding_token": "<pad>"
	}

pvcastro · 2020-10-27T15:04:35Z

I see. I guess I'll do the same then, until this is supported by the library.
Thanks! @tomsherborne !

epwalsh · 2020-10-27T16:37:50Z

@pvcastro just FYI, Dirk is currently on vacation and won't be back for another week.

@tomsherborne it seems like it would be useful to have a from_pretrained_transformer vocab type? Then you could just give it the model name or path:

"vocabulary": {
    "type": "from_pretrained_transformer",
    "model_name": "bert-base-cased"
}

This feature request has come up elsewhere as well (see here, for example).

tomsherborne · 2020-10-27T16:44:13Z

@epwalsh i may have some aspects of the vocabulary API wrong here but I understood that a specific type was unneccessary right now because the PretrainedTransformerIndexer will force a copy of the pretrained vocabulary into the AllenNLP vocab namespace here. Would you also need a way of forcing the correct OOV/Pad tokens from the pretrained version when they aren't the default AllenNLP strings?

epwalsh · 2020-10-27T17:03:41Z

@tomsherborne thanks for pointing that out, I actually wasn't aware of that.

Would you also need a way of forcing the correct OOV/Pad tokens from the pretrained version when they aren't the default AllenNLP strings?

Hmm do you need that for your use case? Right now the Vocabulary object just has a single padding_token field, but we would really need to be able to specify a different padding_token for each namespace in the Vocabulary. I think that's very doable. More generally, we could have optional namespace-specific settings which could override the default. I'm just wondering if there's a need for that.

NicolasAG · 2020-11-06T18:09:56Z

Hi,
I just want to add my interest in seeing this issue resolved :)

I have the exact same issue as originally stated: trying to add special tokens to the T5 pretrained tokenizer with this extra argument in my tokenizer config:

"tokenizer_kwargs": { "additional_special_tokens": ["##START##", "##END##", ...] }

but that fails because the list type is unhashable.

However, I am not sure to understand the trick of creating a Model wrapper, a new Vocab file, etc... as suggested by @tomsherborne

I guess another solution could be to:
(1) manually add the special tokens when creating the data reader:

special_tokens_dict = {'additional_special_tokens': ['##START##', '##END##', ...]}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
logger.info(f'We have added {num_added_toks} tokens')

and then (2) when creating the model we resize the BasicTextFieldEmbedder.token_embbeder dimension like this:

text_embedder.t5.resize_token_embeddings(len(vocab))

this is because I have the following config for my model in which I use the pretrained t5 encoder as an embedder:

"source_text_embedder": {
    "type": "basic",
    "token_embedders": {
        "t5": { "type": "pretrained_transformer", "model_name": model_name, .... }
    }
}

Would that work? will the vocab passed to the model constructor have the additional tokens....?
Anyway, I'll try it out and see how I can work around this issue, but I'm also really looking forward to this being properly supported by the ANLP library :)

pvcastro · 2020-11-06T18:21:14Z

Hi @NicolasAG . I ended up doing something similar. In my custom reader class I did this in init:

        self._tokenizer = PretrainedTransformerTokenizer(
            model_name=transformer_model_name,
            tokenizer_kwargs=tokenizer_kwargs,
        )
        self._token_indexers = {
            "tokens": PretrainedTransformerIndexer(
                transformer_model_name, tokenizer_kwargs=tokenizer_kwargs
            )
        }
        special_tokens_dict = {'additional_special_tokens': ['<REL_SEP>']}
        self._tokenizer.tokenizer.add_special_tokens(special_tokens_dict)
        self._token_indexers["tokens"]._allennlp_tokenizer.tokenizer.add_special_tokens(special_tokens_dict)

In the model I did this:

        default_vocab_size = self._embedder.token_embedder_tokens.config.vocab_size
        self._embedder.token_embedder_tokens.transformer_model.resize_token_embeddings(default_vocab_size + 1)

NicolasAG · 2020-11-06T18:39:55Z

Thanks for sharing! I'll try it out :)

dirkgr · 2020-11-11T04:17:30Z

Can you see if #4781 fixes your original issue?

pvcastro · 2020-11-11T09:17:19Z

Hi @dirkgr , thanks. I'll give it a try and get back to you.

pvcastro · 2020-11-11T09:41:32Z

@dirkgr I think it would still be necessary to do something like was done here

self._embedder.token_embedder_tokens.transformer_model.resize_token_embeddings

otherwise we'll end up with other errors due to the transformer model having a different dimension, not adjusted to the new vocabulary

pvcastro · 2020-11-11T10:24:36Z

I did confirm that #4781 fixes the parsing of additional tokens from config, but, as I stated in the previous comment, it's still necessary to resize the embeddings layer to accomodate the new vocab.

tomsherborne · 2020-11-11T12:35:31Z

@dirkgr would a PR which handles the special case of extending a pretrained HuggingFace model embedder be useful? The AllenNLP model class already does this (here I think) but this doesn't handle the HuggingFace model case since the HF embeddings don't have the "extend_vocab" attribute. Coul this function could call pretrained_model.resize_token_embeddings(new_vocab_size) as a special case?

dirkgr · 2020-11-11T23:04:17Z

@tomsherborne, yes, that would be a welcome addition. I'd be happy to review that PR. I think that's not enough by itself, since we still have to call those new methods then, but it would be easy to do so.

dirkgr · 2020-11-12T00:27:09Z

This is not closed yet. #4781 only goes part of the way.

tomsherborne · 2020-11-16T21:12:51Z

@dirkgr just to let you know that I will have time for this after EMNLP this week. To get ahead of things - where should I be writing tests for an addition to the Model class?

dirkgr · 2020-11-25T22:38:26Z

I think the best place is here: https://github.com/allenai/allennlp/blob/master/tests/models/model_test.py

There is already one test in there about extending vocabs!

github-actions · 2020-12-10T16:38:34Z

@dirkgr this is just a friendly ping to make sure you haven't forgotten about this issue 😜

dirkgr · 2020-12-10T18:12:02Z

@tomsherborne, did you get anywhere with this?

tomsherborne · 2020-12-11T11:42:52Z

@dirkgr Yes apologies - it's in the works and I'll open a PR when its done.

NicolasAG · 2021-01-13T21:18:36Z

Hi,
I am also trying to figure out a way to work around this issue.
I recently updated my allennlp library to version 1.3 and I can also confirm that #4781 fixes the parsing of additional tokens from config, but I am not sure if we need to call the pretrained_model.resize_token_embeddings(new_vocab_size) method.
Indeed, when I print (form my model class constructor) text_field_embedder.token_embedder_pretrained_transformer.config.vocab_size I get the correct number of tokens, that is 32102 (from t5-base) + n with n being the number of tokens in my "additional_special_tokens" config param. So the text field embedder seems to know that I have extra tokens ✔️

~~However, the issue I find is the following:~~
~~After initializing my model (initializer(self) in the constructor) I have the following debug lines to make sure everything is ok:~~

logger.info(len(vocab.get_token_to_index_vocabulary(namespace)))  # shows "2": probably just the oov_token and pad_token
test1 = "a test string that uses some additional_special_tokens..."
tokenized_test1 = tokenizer.tokenize(test1)
logger.info([t.text for t in tokenized_test1])  # shows the correct tokenization
logger.info(len(vocab.get_token_to_index_vocabulary(namespace)))  # shows "2" probably just the oov_token and pad_token
tokenids_test1 = indexers["pretrained_transformer"].tokens_to_indices(tokenized_test1, vocab)
logger.info(tokenids_test1)  # shows the correct tokenization with token_ids > 32102 (the initital vocab size of t5-base)
logger.info(len(vocab.get_token_to_index_vocabulary(namespace)))  # shows 32102 --> why not 32102+n ??
detokenized1 = indexers["pretrained_transformer"].indices_to_tokens(tokenids_test1, vocab)  # throws a KeyError on special tokens that have an id >= 32102 :(

~~This last line throws a KeyError on token_ids >= 32102. It seems like the tokens_to_indices() function works fine but not the indices_to_tokens(), is this behavior expected..?~~
[EDIT]: this was because I didn't specified any tokenizer_kwargs to the TokenIndexer nor the TokenEmbedder.

I'm curious to know where you are at @tomsherborne and if you observe something similar?
Thanks

pvcastro · 2021-01-14T11:30:32Z

@NicolasAG doesn't this work if you do the resize?
In my scenario I used the tokenizer_kwargs for both the tokenizer and indexer in the dataset reader and for the embedder in the actual model. In the model, only using tokenizer_kwargs wasn't enough, I had to do the resize_token_embeddings as well.

NicolasAG · 2021-01-14T16:13:52Z

@pvcastro Thanks! you reminded me that I forgot using tokenizer_kwargs in the TokenIndexer and the TokenEmbedder! in my config file I only specified the extra tokens to the Tokenizer but not the indexer nor the embedder... This is my config now:

    "source_tokenizer": {
        "type": "pretrained_transformer",
        "model_name": "t5-base",
        "tokenizer_kwargs": { "additional_special_tokens": extra_tokens, },
    },
    "source_token_indexers": {
        "pretrained_transformer": {
            "type": "pretrained_transformer",
            "model_name": "t5-base",
            "tokenizer_kwargs": { "additional_special_tokens": extra_tokens, }
        }
    },
    "source_text_embedder": {
            "type": "basic",
            "token_embedders": {
                "pretrained_transformer": {
                    "type": "pretrained_transformer",
                    "model_name": "t5-base",
                    "tokenizer_kwargs": { "additional_special_tokens": extra_tokens, }
                },
            },
        },

and it works fine without doing the resize ✔️

pvcastro · 2021-02-06T01:22:01Z

Now that #4946 got merged, even though I was the one that opened this issue, I'd like your input before closing it. What do you think @dirkgr and @tomsherborne ?

dirkgr · 2021-02-06T01:28:33Z

I think both sub-issues have been resolved, but I never had a local repro of the problem. Does it work now end-to-end?

pvcastro · 2021-02-06T01:41:38Z

For me it's all good now. I was just wondering because @tomsherborne was preparing an additional contribution.

tomsherborne · 2021-02-09T10:03:03Z

@pvcastro I think your solution is good. I was working on something but other commitments have ended up taking higher priority. Apologies that this never came to fruition.

pvcastro · 2021-02-09T10:30:35Z

Thanks for the effort @tomsherborne !

pvcastro added the bug label Oct 1, 2020

AkshitaB assigned dirkgr Oct 2, 2020

NicolasAG mentioned this issue Nov 10, 2020

t5 with coref model #4088

Closed

dirkgr mentioned this issue Nov 11, 2020

Makes the transformer cache work with custom kwargs #4781

Merged

dirkgr closed this as completed in #4781 Nov 12, 2020

dirkgr reopened this Nov 12, 2020

dirkgr added the Contributions welcome label Dec 10, 2020

dirkgr removed their assignment Dec 10, 2020

pvcastro mentioned this issue Jan 29, 2021

Resize transformers word embeddings layer for additional_special_tokens #4946

Merged

pvcastro closed this as completed Feb 9, 2021

McKracken mentioned this issue Mar 16, 2021

from_pretrained_transformer not called in commnad line predict mode with BasicClassifier #5055

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train a model with transformer embeddings and additional_special_tokens #4690

Train a model with transformer embeddings and additional_special_tokens #4690

pvcastro commented Oct 1, 2020

github-actions bot commented Oct 16, 2020

pvcastro commented Oct 27, 2020

tomsherborne commented Oct 27, 2020 •

edited

Loading

pvcastro commented Oct 27, 2020

tomsherborne commented Oct 27, 2020 •

edited

Loading

pvcastro commented Oct 27, 2020

epwalsh commented Oct 27, 2020

tomsherborne commented Oct 27, 2020

epwalsh commented Oct 27, 2020

NicolasAG commented Nov 6, 2020

pvcastro commented Nov 6, 2020

NicolasAG commented Nov 6, 2020

dirkgr commented Nov 11, 2020

pvcastro commented Nov 11, 2020

pvcastro commented Nov 11, 2020

pvcastro commented Nov 11, 2020

tomsherborne commented Nov 11, 2020

dirkgr commented Nov 11, 2020

dirkgr commented Nov 12, 2020

tomsherborne commented Nov 16, 2020

dirkgr commented Nov 25, 2020

github-actions bot commented Dec 10, 2020

dirkgr commented Dec 10, 2020

tomsherborne commented Dec 11, 2020

NicolasAG commented Jan 13, 2021 •

edited

Loading

pvcastro commented Jan 14, 2021

NicolasAG commented Jan 14, 2021

pvcastro commented Feb 6, 2021

dirkgr commented Feb 6, 2021

pvcastro commented Feb 6, 2021

tomsherborne commented Feb 9, 2021

pvcastro commented Feb 9, 2021

Train a model with transformer embeddings and additional_special_tokens #4690

Train a model with transformer embeddings and additional_special_tokens #4690

Comments

pvcastro commented Oct 1, 2020

Checklist

Description

Related issues or possible duplicates

Environment

Steps to reproduce

github-actions bot commented Oct 16, 2020

pvcastro commented Oct 27, 2020

tomsherborne commented Oct 27, 2020 • edited Loading

pvcastro commented Oct 27, 2020

tomsherborne commented Oct 27, 2020 • edited Loading

pvcastro commented Oct 27, 2020

epwalsh commented Oct 27, 2020

tomsherborne commented Oct 27, 2020

epwalsh commented Oct 27, 2020

NicolasAG commented Nov 6, 2020

pvcastro commented Nov 6, 2020

NicolasAG commented Nov 6, 2020

dirkgr commented Nov 11, 2020

pvcastro commented Nov 11, 2020

pvcastro commented Nov 11, 2020

pvcastro commented Nov 11, 2020

tomsherborne commented Nov 11, 2020

dirkgr commented Nov 11, 2020

dirkgr commented Nov 12, 2020

tomsherborne commented Nov 16, 2020

dirkgr commented Nov 25, 2020

github-actions bot commented Dec 10, 2020

dirkgr commented Dec 10, 2020

tomsherborne commented Dec 11, 2020

NicolasAG commented Jan 13, 2021 • edited Loading

pvcastro commented Jan 14, 2021

NicolasAG commented Jan 14, 2021

pvcastro commented Feb 6, 2021

dirkgr commented Feb 6, 2021

pvcastro commented Feb 6, 2021

tomsherborne commented Feb 9, 2021

pvcastro commented Feb 9, 2021

tomsherborne commented Oct 27, 2020 •

edited

Loading

tomsherborne commented Oct 27, 2020 •

edited

Loading

NicolasAG commented Jan 13, 2021 •

edited

Loading