-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Rethink vocabulary API #3097
Comments
Oh, actually, this means that the model won't be getting the right vocab sizes if you use our standard training script. I was hacking my own thing, so I got the right vocab size, but I ran into this OOV error... |
I can fix the error by setting the "bert" namespace as one of the non-padded namespaces (which is appropriate anyway). But this doesn't solve the vocab size issue, if the model needs to know the BERT vocab size. |
The issue is as stated in #3233. Looking forward to any update. |
Moving this to the 1.1 milestone, as I don’t think it’s pressing, nor should it be a breaking change. |
It breaks the |
It also has blocked hosting demos for models that use vocabs constructed from huggingface transformers. |
Yeah, |
I've been banging my head against this all day, because I'm trying to fix I think the cleanest way to do this is to specify a |
I'm not opposed if you think that's the best solution to this problem. It's unfortunate that we'll now have four separate places where you need to specify a particular transformer model, but I'm not sure how to get around that. It's an argument for something much more integrated, like we had in the pre-allennlp days. But that has its own problems. |
Maybe the reader gets a function, |
The reader can't do that in general; if you don't have a pre-built vocabulary, vocabulary construction happens after the reader has finished its job. Or are you suggesting to change the whole pipeline such that instead of calling I suppose one option is to just have the default Well, that breaks any kind of labeling task. So does having a separate transformer-specified vocab object - where does the label vocabulary come from, if you're not iterating over your instances? One possibility: add a parameter to Looking back at what you were complaining about, we have always read through the data twice before starting training (though the second time is technically during training). The only way around that is to precompute your vocab and set it Sorry this is a bit rambling; I'm a bit too tired at the moment to clean it up into a more coherent story. Hopefully the points still make sense. |
I continue to have thoughts about this, but I don't think a GitHub issue is the right place for them right now. |
The way we do vocab stuff with BERT / other pretrained wordpiece tokenizers is a little crazy and leads to subtle bugs. We need to get the BERT wordpiece vocab into our
Vocabulary
object so that we can pass it to the model and use it there if necessary. But if you save a vocab object with this vocabulary, it will give this error on load:Somehow our existing BERT models are not actually saving the BERT vocabulary, even though it appears in the vocab object after indexing data. If I'm understanding things right, this is because we only count vocab items on our first pass through the data, and don't actually index anything when constructing the vocab. It's only after we've saved the vocab that we index stuff:
allennlp/allennlp/training/trainer_pieces.py
Lines 67 to 77 in 417a757
This means that we are basically only avoiding the above bug because of an accident in when we are saving the vocab. We should probably figure out a better solution for this.
The text was updated successfully, but these errors were encountered: