Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Train a model with transformer embeddings and additional_special_tokens #4690

Closed
10 tasks done
pvcastro opened this issue Oct 1, 2020 · 32 comments · Fixed by #4781
Closed
10 tasks done

Train a model with transformer embeddings and additional_special_tokens #4690

pvcastro opened this issue Oct 1, 2020 · 32 comments · Fixed by #4781

Comments

@pvcastro
Copy link
Contributor

pvcastro commented Oct 1, 2020

Checklist

  • I have verified that the issue exists against the master branch of AllenNLP.
  • I have read the relevant section in the contribution guide on reporting bugs.
  • I have checked the issues list for similar or identical bug reports.
  • I have checked the pull requests list for existing proposed fixes.
  • I have checked the CHANGELOG and the commit log to find out if the bug was already fixed in the master branch.
  • I have included in the "Description" section below a traceback from any exceptions related to this bug.
  • I have included in the "Related issues or possible duplicates" section beloew all related issues and possible duplicate issues (If there are none, check this box anyway).
  • I have included in the "Environment" section below the name of the operating system and Python version that I was using when I discovered this bug.
  • I have included in the "Environment" section below the output of pip freeze.
  • I have included in the "Steps to reproduce" section below a minimally reproducible example.

Description

Hi there! I'm trying to train a transformer-based text classifier model in AllenNLP, but I need to add 5 additional special tokens, in a way compatible with tokenizers lib. I tried adding them to the jsonnet AllenNLP config file and then to the transformer's model path, but neither worked, with each approach having a different problem, which will be described below.

Python traceback:

2020-09-30 23:56:17,398 - INFO - allennlp.training.trainer - Epoch 0/9
2020-09-30 23:56:17,398 - INFO - allennlp.training.trainer - Worker 0 memory usage MB: 10065.304
2020-09-30 23:56:17,484 - WARNING - allennlp.common.util - unable to check gpu_memory_mb() due to occasional failure, continuing
Traceback (most recent call last):
  File "/media/discoD/repositorios/allennlp/allennlp/common/util.py", line 415, in gpu_memory_mb
    encoding="utf-8",
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/subprocess.py", line 488, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/subprocess.py", line 800, in __init__
    restore_signals, start_new_session)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/subprocess.py", line 1482, in _execute_child
    restore_signals, start_new_session, preexec_fn)
  File "/media/discoD/pycharm-community-2019.2/plugins/python-ce/helpers/pydev/_pydev_bundle/pydev_monkey.py", line 526, in new_fork_exec
    return getattr(_posixsubprocess, original_name)(args, *patch_fork_exec_executable_list(args, other_args))
OSError: [Errno 12] Cannot allocate memory
2020-09-30 23:56:17,489 - INFO - allennlp.training.trainer - Training
  0%|          | 0/11817 [00:00<?, ?it/s]/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [69,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [69,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [69,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [69,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
...
...
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [102,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [102,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
  0%|          | 0/11817 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 443, in _train_worker
    metrics = train_loop.run()
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 505, in run
    return self.trainer.train()
  File "/media/discoD/repositorios/allennlp/allennlp/training/trainer.py", line 872, in train
    train_metrics = self._train_epoch(epoch)
  File "/media/discoD/repositorios/allennlp/allennlp/training/trainer.py", line 594, in _train_epoch
    batch_outputs = self.batch_outputs(batch, for_training=True)
  File "/media/discoD/repositorios/allennlp/allennlp/training/trainer.py", line 479, in batch_outputs
    output_dict = self._pytorch_model(**batch)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/repositorios/allennlp/allennlp/models/basic_classifier.py", line 121, in forward
    embedded_text = self._text_field_embedder(tokens)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/repositorios/allennlp/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 88, in forward
    token_vectors = embedder(**tensors, **forward_params_values)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/repositorios/allennlp/allennlp/modules/token_embedders/pretrained_transformer_embedder.py", line 184, in forward
    transformer_output = self.transformer_model(**parameters)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_bert.py", line 762, in forward
    output_hidden_states=output_hidden_states,
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_bert.py", line 439, in forward
    output_attentions,
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_bert.py", line 371, in forward
    hidden_states, attention_mask, head_mask, output_attentions=output_attentions,
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_bert.py", line 315, in forward
    hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions,
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_bert.py", line 221, in forward
    mixed_query_layer = self.query(hidden_states)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 91, in forward
    return F.linear(input, self.weight, self.bias)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/functional.py", line 1676, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
python-BaseException
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=710 : device-side assert triggered

Related issues or possible duplicates

  • None

Environment

OS: Linux

Python version: 3.7.7

Output of pip freeze:

allennlp==1.1.0
allennlp-models==1.1.0
-e [email protected]:allenai/allennlp-server.git@bc56288b9295391051f7b7b042fe34219bfe33ab#egg=allennlp_server
attrs==19.3.0
backcall==0.2.0
bleach==3.1.5
blis==0.4.1
boto3==1.14.31
botocore==1.17.31
cachetools==4.1.1
catalogue==1.0.0
certifi==2020.6.20
chardet==3.0.4
click==7.1.2
conllu==4.1
cycler==0.10.0
cymem==2.0.3
cytoolz==0.10.1
decorator==4.4.2
defusedxml==0.6.0
docutils==0.15.2
eland==7.7.0a1
elasticsearch-dsl==7.2.1
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz
entrypoints==0.3
filelock==3.0.12
fire==0.3.1
Flask==1.1.2
Flask-Cors==3.0.8
ftfy==5.8
future==0.18.2
gevent==20.6.2
greenlet==0.4.16
h5py==2.10.0
idna==2.10
importlib-metadata==1.7.0
iniconfig==1.0.1
ipykernel==5.3.4
ipython==7.16.1
ipython-genutils==0.2.0
ipywidgets==7.5.1
itsdangerous==1.1.0
jedi==0.17.2
jellyfish==0.8.2
Jinja2==2.11.2
jmespath==0.10.0
joblib==0.16.0
jsonnet==0.16.0
jsonpickle==1.4.1
jsonschema==3.2.0
jupyter-client==6.1.6
jupyter-core==4.6.3
Keras==2.4.3
kiwisolver==1.2.0
MarkupSafe==1.1.1
matplotlib==3.3.0
mistune==0.8.4
mkl-fft==1.1.0
mkl-random==1.1.1
mkl-service==2.3.0
more-itertools==8.4.0
murmurhash==1.0.2
nbconvert==5.6.1
nbformat==5.0.7
networkx==2.4
nltk==3.5
notebook==6.0.3
numpy==1.18.5
olefile==0.46
overrides==3.1.0
packaging==20.4
pandas==1.1.0
pandocfilters==1.4.2
parso==0.7.1
pexpect==4.8.0
pickleshare==0.7.5
Pillow==7.2.0
plac==1.1.3
pluggy==0.13.1
preshed==3.0.2
prometheus-client==0.8.0
prompt-toolkit==3.0.5
protobuf==3.12.4
ptyprocess==0.6.0
py==1.9.0
py-rouge==1.1
pydot==1.4.1
pyemd==0.5.1
Pygments==2.6.1
pyparsing==2.4.7
Pyphen==0.9.5
pyrsistent==0.16.0
pytest==6.0.1
python-dateutil==2.8.1
pytz==2020.1
PyYAML==5.3.1
pyzmq==19.0.1
regex==2020.7.14
requests==2.24.0
s3transfer==0.3.3
sacremoses==0.0.43
scikit-learn==0.23.1
scipy==1.5.2
seaborn==0.11.0
Send2Trash==1.5.0
sentencepiece==0.1.91
seqeval==0.0.12
six==1.15.0
spacy==2.3.2
srsly==1.0.2
tensorboardX==2.1
termcolor==1.1.0
terminado==0.8.3
testpath==0.4.4
thinc==7.4.1
threadpoolctl==2.1.0
tokenizers==0.8.1rc1
toml==0.10.1
toolz==0.10.0
torch==1.6.0+cu101
torchvision==0.7.0+cu101
tornado==6.0.4
tqdm==4.48.0
traitlets==4.3.3
transformers==3.0.2
urllib3==1.25.10
visualise-spacy-tree==0.0.6
wasabi==0.7.1
wcwidth==0.2.5
webencodings==0.5.1
Werkzeug==1.0.1
widgetsnbextension==3.5.1
word2number==1.1
zipp==3.1.0
zope.event==4.4
zope.interface==5.1.0

Steps to reproduce

First I tried adding the 5 additional special tokens directly in the jsonnet model config, like this:

    "token_indexers": {
            "tokens": {
                "type": "pretrained_transformer",
                "model_name": transformer_model,
                "max_length": transformer_dim,
                "tokenizer_kwargs": {"additional_special_tokens": [['<REL_SEP>'], ['[['], [']]'], ['<<'], ['>>']], "max_len": transformer_dim}
            }
     },

But I ran into a problem at allennlp.common.cached_transformer.get_tokenizer, because cache_key = (model_name, frozenset(kwargs.items())) tries to use the "tokenizer_kwargs" value as a cache key, but it can't parse the additional_special_tokens list into a string, throwing the following exception:

TypeError: unhashable type: 'list'

Traceback (most recent call last):
  File "/media/discoD/pycharm-community-2019.2/plugins/python-ce/helpers/pydev/pydevd.py", line 1465, in _exec
    runpy._run_module_as_main(module_name, alter_argv=False)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/media/discoD/repositorios/allennlp/allennlp/__main__.py", line 38, in <module>
    run()
  File "/media/discoD/repositorios/allennlp/allennlp/__main__.py", line 34, in run
    main(prog="allennlp")
  File "/media/discoD/repositorios/allennlp/allennlp/commands/__init__.py", line 94, in main
    args.func(args)
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 118, in train_model_from_args
    file_friendly_logging=args.file_friendly_logging,
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 177, in train_model_from_file
    file_friendly_logging=file_friendly_logging,
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 238, in train_model
    file_friendly_logging=file_friendly_logging,
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 433, in _train_worker
    local_rank=process_rank,
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 599, in from_params
    **extras,
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 626, in from_params
    kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 197, in create_kwargs
    cls.__name__, param_name, annotation, param.default, params, **extras
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 306, in pop_and_construct_arg
    return construct_arg(class_name, name, popped_params, annotation, default, **extras)
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 340, in construct_arg
    return annotation.from_params(params=popped_params, **subextras)
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 599, in from_params
    **extras,
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 626, in from_params
    kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 197, in create_kwargs
    cls.__name__, param_name, annotation, param.default, params, **extras
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 306, in pop_and_construct_arg
    return construct_arg(class_name, name, popped_params, annotation, default, **extras)
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 387, in construct_arg
    **extras,
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 340, in construct_arg
    return annotation.from_params(params=popped_params, **subextras)
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 599, in from_params
    **extras,
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 628, in from_params
    return constructor_to_call(**kwargs)  # type: ignore
  File "/media/discoD/repositorios/allennlp/allennlp/data/token_indexers/pretrained_transformer_indexer.py", line 58, in __init__
    model_name, tokenizer_kwargs=tokenizer_kwargs
  File "/media/discoD/repositorios/allennlp/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py", line 71, in __init__
    model_name, add_special_tokens=False, **tokenizer_kwargs
  File "/media/discoD/repositorios/allennlp/allennlp/common/cached_transformers.py", line 101, in get_tokenizer
    cache_key = (model_name, frozenset(kwargs.items()))
TypeError: unhashable type: 'list'

I couldn't find a way to work passing the tokens in this way, so I ended up downloading the bert model to my local disk and added the tokenizers config files to the same path (the vocab size of my bert model is 29794, so the last index is 29793). Files contents I changed are in the "Example source" section below.

After debugging, looks like this config at least was enough to get the bert tokenizer to recognize the 5 tokens and tokenize the training data accordingly, but then I ran into another issue once training actually began (the one pasted in the "Python traceback" section of this issue).

Looks like this error is due to the fact that the transformer's model embeddings layer weren't properly resized according to the new vocabulary size, which would be accomplished with a code like this: model.resize_token_embeddings(len(tokenizer)). I didn't find any code in the AllenNLP lib that would do something like this, so I'm thinking this is the issue's cause.

Is there another way to accomplish this using AllenNLP that I'm not aware of? Looks like both ways to expand the vocab size should be possible.

Example source:

added_tokens.json:

{"<REL_SEP>": 29794, "[[": 29795, "]]": 29796, "<<": 29797, ">>": 29798}

special_tokens_map.json:

{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "additional_special_tokens": ["<REL_SEP>", "[[", "]]", "<<", ">>"]}

tokenizer_config.json:

{"do_lower_case": false, "additional_special_tokens": ["<REL_SEP>", "[[", "]]", "<<", ">>"]}

Thanks!

@github-actions
Copy link

@dirkgr this is just a friendly ping to make sure you haven't forgotten about this issue 😜

@pvcastro
Copy link
Contributor Author

@dirkgr ? 😅

@tomsherborne
Copy link
Contributor

tomsherborne commented Oct 27, 2020

@pvcastro I've managed to approximately get something working like this by writing a wrapper Model class for the huggingface model that calls the HF function model.resize_token_embeddings(new_vocab_size) at the end of the constructor as well as forcibly creating a custom AllenNLP vocab and a Huggingface Vocab. My use case is Bart so I've modified the bart.py file in allennlp-models. I'm now just verifying if this doesn't remove/break the pretrained weight initialisation. I'm not sure if this is useful for you - but let me know if it helps!

@pvcastro
Copy link
Contributor Author

hi @tomsherborne !
So you had to hardcode tokenizer_kwargs directly in .py to pass additional_special_tokens?

@tomsherborne
Copy link
Contributor

tomsherborne commented Oct 27, 2020

@pvcastro yes you could do that. i found it more stable/manageable to manually create an instance of the relevant HF tokenizer, use add_tokens to manually add each extra token i needed and save this to disk. Then in the ANLP config file I reference a local path under the "model_name" argument for tokenizer/indexer. I also converted this HF vocabulary into an ANLP vocabulary and reference this in the config as:

	"vocabulary": {
		"type": "from_files", 
		"directory": "./path/to/allennlp/version/of/extended/vocab",
		"oov_token": "<unk>",
		"padding_token": "<pad>"
	}

@pvcastro
Copy link
Contributor Author

I see. I guess I'll do the same then, until this is supported by the library.
Thanks! @tomsherborne !

@epwalsh
Copy link
Member

epwalsh commented Oct 27, 2020

@pvcastro just FYI, Dirk is currently on vacation and won't be back for another week.

@tomsherborne it seems like it would be useful to have a from_pretrained_transformer vocab type? Then you could just give it the model name or path:

"vocabulary": {
    "type": "from_pretrained_transformer",
    "model_name": "bert-base-cased"
}

This feature request has come up elsewhere as well (see here, for example).

@tomsherborne
Copy link
Contributor

@epwalsh i may have some aspects of the vocabulary API wrong here but I understood that a specific type was unneccessary right now because the PretrainedTransformerIndexer will force a copy of the pretrained vocabulary into the AllenNLP vocab namespace here. Would you also need a way of forcing the correct OOV/Pad tokens from the pretrained version when they aren't the default AllenNLP strings?

@epwalsh
Copy link
Member

epwalsh commented Oct 27, 2020

@tomsherborne thanks for pointing that out, I actually wasn't aware of that.

Would you also need a way of forcing the correct OOV/Pad tokens from the pretrained version when they aren't the default AllenNLP strings?

Hmm do you need that for your use case? Right now the Vocabulary object just has a single padding_token field, but we would really need to be able to specify a different padding_token for each namespace in the Vocabulary. I think that's very doable. More generally, we could have optional namespace-specific settings which could override the default. I'm just wondering if there's a need for that.

@NicolasAG
Copy link

Hi,
I just want to add my interest in seeing this issue resolved :)

I have the exact same issue as originally stated: trying to add special tokens to the T5 pretrained tokenizer with this extra argument in my tokenizer config:

"tokenizer_kwargs": { "additional_special_tokens": ["##START##", "##END##", ...] }

but that fails because the list type is unhashable.

However, I am not sure to understand the trick of creating a Model wrapper, a new Vocab file, etc... as suggested by @tomsherborne

I guess another solution could be to:
(1) manually add the special tokens when creating the data reader:

special_tokens_dict = {'additional_special_tokens': ['##START##', '##END##', ...]}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
logger.info(f'We have added {num_added_toks} tokens')

and then (2) when creating the model we resize the BasicTextFieldEmbedder.token_embbeder dimension like this:

text_embedder.t5.resize_token_embeddings(len(vocab))

this is because I have the following config for my model in which I use the pretrained t5 encoder as an embedder:

"source_text_embedder": {
    "type": "basic",
    "token_embedders": {
        "t5": { "type": "pretrained_transformer", "model_name": model_name, .... }
    }
}

Would that work? will the vocab passed to the model constructor have the additional tokens....?
Anyway, I'll try it out and see how I can work around this issue, but I'm also really looking forward to this being properly supported by the ANLP library :)

@pvcastro
Copy link
Contributor Author

pvcastro commented Nov 6, 2020

Hi @NicolasAG . I ended up doing something similar. In my custom reader class I did this in init:

        self._tokenizer = PretrainedTransformerTokenizer(
            model_name=transformer_model_name,
            tokenizer_kwargs=tokenizer_kwargs,
        )
        self._token_indexers = {
            "tokens": PretrainedTransformerIndexer(
                transformer_model_name, tokenizer_kwargs=tokenizer_kwargs
            )
        }
        special_tokens_dict = {'additional_special_tokens': ['<REL_SEP>']}
        self._tokenizer.tokenizer.add_special_tokens(special_tokens_dict)
        self._token_indexers["tokens"]._allennlp_tokenizer.tokenizer.add_special_tokens(special_tokens_dict)

In the model I did this:

        default_vocab_size = self._embedder.token_embedder_tokens.config.vocab_size
        self._embedder.token_embedder_tokens.transformer_model.resize_token_embeddings(default_vocab_size + 1)

@NicolasAG
Copy link

Thanks for sharing! I'll try it out :)

@dirkgr
Copy link
Member

dirkgr commented Nov 11, 2020

Can you see if #4781 fixes your original issue?

@pvcastro
Copy link
Contributor Author

Hi @dirkgr , thanks. I'll give it a try and get back to you.

@pvcastro
Copy link
Contributor Author

@dirkgr I think it would still be necessary to do something like was done here

self._embedder.token_embedder_tokens.transformer_model.resize_token_embeddings

otherwise we'll end up with other errors due to the transformer model having a different dimension, not adjusted to the new vocabulary

@pvcastro
Copy link
Contributor Author

I did confirm that #4781 fixes the parsing of additional tokens from config, but, as I stated in the previous comment, it's still necessary to resize the embeddings layer to accomodate the new vocab.

@tomsherborne
Copy link
Contributor

@dirkgr would a PR which handles the special case of extending a pretrained HuggingFace model embedder be useful? The AllenNLP model class already does this (here I think) but this doesn't handle the HuggingFace model case since the HF embeddings don't have the "extend_vocab" attribute. Coul this function could call pretrained_model.resize_token_embeddings(new_vocab_size) as a special case?

@dirkgr
Copy link
Member

dirkgr commented Nov 11, 2020

@tomsherborne, yes, that would be a welcome addition. I'd be happy to review that PR. I think that's not enough by itself, since we still have to call those new methods then, but it would be easy to do so.

@dirkgr
Copy link
Member

dirkgr commented Nov 12, 2020

This is not closed yet. #4781 only goes part of the way.

@dirkgr dirkgr reopened this Nov 12, 2020
@tomsherborne
Copy link
Contributor

@dirkgr just to let you know that I will have time for this after EMNLP this week. To get ahead of things - where should I be writing tests for an addition to the Model class?

@dirkgr
Copy link
Member

dirkgr commented Nov 25, 2020

I think the best place is here: https://github.com/allenai/allennlp/blob/master/tests/models/model_test.py

There is already one test in there about extending vocabs!

@github-actions
Copy link

@dirkgr this is just a friendly ping to make sure you haven't forgotten about this issue 😜

@dirkgr
Copy link
Member

dirkgr commented Dec 10, 2020

@tomsherborne, did you get anywhere with this?

@dirkgr dirkgr removed their assignment Dec 10, 2020
@tomsherborne
Copy link
Contributor

@dirkgr Yes apologies - it's in the works and I'll open a PR when its done.

@NicolasAG
Copy link

NicolasAG commented Jan 13, 2021

Hi,
I am also trying to figure out a way to work around this issue.
I recently updated my allennlp library to version 1.3 and I can also confirm that #4781 fixes the parsing of additional tokens from config, but I am not sure if we need to call the pretrained_model.resize_token_embeddings(new_vocab_size) method.
Indeed, when I print (form my model class constructor) text_field_embedder.token_embedder_pretrained_transformer.config.vocab_size I get the correct number of tokens, that is 32102 (from t5-base) + n with n being the number of tokens in my "additional_special_tokens" config param. So the text field embedder seems to know that I have extra tokens ✔️

However, the issue I find is the following:
After initializing my model (initializer(self) in the constructor) I have the following debug lines to make sure everything is ok:

logger.info(len(vocab.get_token_to_index_vocabulary(namespace)))  # shows "2": probably just the oov_token and pad_token
test1 = "a test string that uses some additional_special_tokens..."
tokenized_test1 = tokenizer.tokenize(test1)
logger.info([t.text for t in tokenized_test1])  # shows the correct tokenization
logger.info(len(vocab.get_token_to_index_vocabulary(namespace)))  # shows "2" probably just the oov_token and pad_token
tokenids_test1 = indexers["pretrained_transformer"].tokens_to_indices(tokenized_test1, vocab)
logger.info(tokenids_test1)  # shows the correct tokenization with token_ids > 32102 (the initital vocab size of t5-base)
logger.info(len(vocab.get_token_to_index_vocabulary(namespace)))  # shows 32102 --> why not 32102+n ??
detokenized1 = indexers["pretrained_transformer"].indices_to_tokens(tokenids_test1, vocab)  # throws a KeyError on special tokens that have an id >= 32102 :(

This last line throws a KeyError on token_ids >= 32102. It seems like the tokens_to_indices() function works fine but not the indices_to_tokens(), is this behavior expected..?
[EDIT]: this was because I didn't specified any tokenizer_kwargs to the TokenIndexer nor the TokenEmbedder.

I'm curious to know where you are at @tomsherborne and if you observe something similar?
Thanks

@pvcastro
Copy link
Contributor Author

@NicolasAG doesn't this work if you do the resize?
In my scenario I used the tokenizer_kwargs for both the tokenizer and indexer in the dataset reader and for the embedder in the actual model. In the model, only using tokenizer_kwargs wasn't enough, I had to do the resize_token_embeddings as well.

@NicolasAG
Copy link

@pvcastro Thanks! you reminded me that I forgot using tokenizer_kwargs in the TokenIndexer and the TokenEmbedder! in my config file I only specified the extra tokens to the Tokenizer but not the indexer nor the embedder... This is my config now:

    "source_tokenizer": {
        "type": "pretrained_transformer",
        "model_name": "t5-base",
        "tokenizer_kwargs": { "additional_special_tokens": extra_tokens, },
    },
    "source_token_indexers": {
        "pretrained_transformer": {
            "type": "pretrained_transformer",
            "model_name": "t5-base",
            "tokenizer_kwargs": { "additional_special_tokens": extra_tokens, }
        }
    },
    "source_text_embedder": {
            "type": "basic",
            "token_embedders": {
                "pretrained_transformer": {
                    "type": "pretrained_transformer",
                    "model_name": "t5-base",
                    "tokenizer_kwargs": { "additional_special_tokens": extra_tokens, }
                },
            },
        },

and it works fine without doing the resize ✔️

@pvcastro
Copy link
Contributor Author

pvcastro commented Feb 6, 2021

Now that #4946 got merged, even though I was the one that opened this issue, I'd like your input before closing it. What do you think @dirkgr and @tomsherborne ?

@dirkgr
Copy link
Member

dirkgr commented Feb 6, 2021

I think both sub-issues have been resolved, but I never had a local repro of the problem. Does it work now end-to-end?

@pvcastro
Copy link
Contributor Author

pvcastro commented Feb 6, 2021

For me it's all good now. I was just wondering because @tomsherborne was preparing an additional contribution.

@tomsherborne
Copy link
Contributor

@pvcastro I think your solution is good. I was working on something but other commitments have ended up taking higher priority. Apologies that this never came to fruition.

@pvcastro
Copy link
Contributor Author

pvcastro commented Feb 9, 2021

Thanks for the effort @tomsherborne !

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants