Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does the model handle the OOV problem? #11

Open
VieZhong opened this issue Nov 13, 2018 · 3 comments
Open

How does the model handle the OOV problem? #11

VieZhong opened this issue Nov 13, 2018 · 3 comments

Comments

@VieZhong
Copy link

VieZhong commented Nov 13, 2018

OOV means the out of vocalbary word.

I can't find any code to handle the problem, maybe I miss some important steps?

Looking forward to your advice or answers.

@akanimax
Copy link

akanimax commented Mar 8, 2019

@VieZhong,

I am also looking for the same. Were you able to find a solution for it?
I'll explain my problem a bit formally:

Let's say I have a vocabulary of => ["hello", "I", "am", "akanimax"] and my source statement is => <"akanimax", "is", "a", "good", "boy"> and my target statement is => <"akanimax", "not", "a", "good", "boy">.
Then, while decoding the "not" in the target, following are the two questions:

1.) When the input to the Encoder is "a" or "is" or "good" or "boy", what is actually sent to the Encoder RNN? Is it the same embedding representing <copy> token or are they different randomly initialized embeddings?

2.) When "not" needs to be output, we have no other option than calling it UNK because it is not in chi nor in V. Is this correct?

I would be highly grateful if you could help.

Best regards,
@akanimax

@VieZhong
Copy link
Author

VieZhong commented Mar 8, 2019

Hi, @akanimax
I can't solve the OOV problem, either.
My answer about your two questions may be that:

  1. The words that model doesn't recognize will be noted as the same embedding token.
  2. Yes, it is.

I hope I can help you. My English is not very well, forget it hh.

@nlp4whp
Copy link

nlp4whp commented Jun 6, 2019

Hi, @akanimax
I can't solve the OOV problem, either.
My answer about your two questions may be that:

  1. The words that model doesn't recognize will be noted as the same embedding token.
  2. Yes, it is.

I hope I can help you. My English is not very well, forget it hh.

Hi, @akanimax, @VieZhong

I think the OOV problem can be solved by CopyNet here.
You see, the size of vocabulary (gen_vocab_size) for generate could be small,
And another larger vocabulary including "OOV" for copy can be changed.

Although in real situation, we are probably unable to collect all tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants