Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

how to pad inputs for RNN bucketing #2861

Closed
freddycct opened this issue Jul 28, 2016 · 24 comments
Closed

how to pad inputs for RNN bucketing #2861

freddycct opened this issue Jul 28, 2016 · 24 comments

Comments

@freddycct
Copy link
Contributor

lstm

Hi, if I have the RNN as shown above, how do I pad the inputs or do I just ignore padding the inputs? only the softmax output allows an ignore label but there is nothing for inputs?

@freddycct
Copy link
Contributor Author

@pluskid can you help please? the bucketing tutorial didn't mention much about padding

@antinucleon
Copy link
Contributor

I am working on one example of this

@antinucleon
Copy link
Contributor

I think we can use ignore_label for padding.

@freddycct
Copy link
Contributor Author

freddycct commented Aug 3, 2016

@antinucleon
I was going to look into modifying embedding-inl.h Let me know if you are doing something similar along this line.

Well here's the issue, currently the mx.sym.Embedding layer does not account for PAD symbols. So when a, out-of-range index is given to the mx.sym.Embedding layer, it returns a non-initialized vector. What is needed is for the Embedding layer to return an all zero vector, and for the backpropagation to ignore training on the LSTMs when there's a PAD from the embedding layer.

We can use ignore_label for the softmaxoutputs, but there's currently nothing for the inputs...

The current bucketing tutorial from @pluskid uses zeros. But the embedding layer treats that zero as any other symbol, and will include backpropagating the gradient to the embedding vector representing the zero or PAD.

@antinucleon
Copy link
Contributor

@freddycct

Can we use label 0 to be padding, and ignore all label 0?

@freddycct
Copy link
Contributor Author

@antinucleon
the mx.sym.Embedding function needs to have

ignore_label (float, optional, default=-1) – the label value will be ignored during backward (only works if use_ignore is set to be true).
use_ignore (boolean, optional, default=False) – If set to true, the ignore_label value will not contribute to the backward gradient

This has implications for coding a Sequence to Sequence RNN

@antinucleon
Copy link
Contributor

Let me see whether we can initialize a special embedding weight matrix in numpy to make all index 0 to get 0. Let me double check whether in backprop it will be changed.

@freddycct
Copy link
Contributor Author

@antinucleon
If you can, let the PAD symbol be a choice, instead of always being 0.

@antinucleon
Copy link
Contributor

@freddycct
The ultimate solution is on-the-fly graph build/execution, which may appear after we switch to NNVM.

However for now, I suspect if we have large amount of training data, the padding with random embedding won't affect that much, especially after we have attention.

@freddycct
Copy link
Contributor Author

@antinucleon on-the-fly graph build/execution you mean like the imperative approach similar to Torch?

@pluskid
Copy link
Contributor

pluskid commented Aug 4, 2016

I think using ignore_label on the softmax loss should be good. Input can be padded as zero or whatever. It should be fine because when the loss hit ignore_label, the gradients are zero, therefore, due to chain-rule that particular example (frame) does not contribute to the gradient accumulation for all the model parameters.

However, there do have some caveat. For example, you need to remember to ignore those padded labels when computing metrics (accuracy could be higher if all zeros are counted). Also, if you use BatchNorm, then the statistics counter are still accumulating those zeros. etc.

@pluskid
Copy link
Contributor

pluskid commented Aug 4, 2016

@freddycct My understanding is that the chain-rule makes the back-propagation a series of multiplications. If one of the components (the loss) sets the multiplier to be zero, the whole chain will be zero. So the embedding layer will not get garbage gradient if the corresponding label is ignore_label. Do you agree?

@freddycct
Copy link
Contributor Author

@pluskid Let me give an example of what I tried...

Let's say I have a sequence (X1, X2) that maps to (Y1, Y2) and they are in a bucket with encoder length 3, decoder length 4 according to the figure I showed. Note: I have a RNN that is made up of encoder RNN and decoder RNN, instead of just one RNN.

The training input is then given as
(0, X1, X2, EOS, Y1, Y2, 0)
the training label is
(Y1, Y2, EOS, 0)
Note: I did not connect a softmaxoutput loss to the encoder portion of the RNN.

Then after training for a very small example, in which training loss is almost zero...
the prediction for (0, X1, X2, EOS) gives (Y1, Y2) but the prediction for (X1, X2, EOS) gives some other results instead of (Y1, Y2).

That means the presence or absence of the PAD symbol affects the forward calculations. I think it should not be this way, (0, X1, X2, EOS) should give same results as (X1, X2, EOS).

The issue is MXNet does not recognize PAD inputs at the Embedding layer but only PAD outputs at the softmaxlosslayer.

@pluskid
Copy link
Contributor

pluskid commented Aug 5, 2016

@freddycct Thanks for the explanation! I see the problem now! Because the embedding layer picks some arbitrary non-zero outputs for the PAD symbol, making the forwarded state of the RNN different from an initial zero-state.

@freddycct
Copy link
Contributor Author

@pluskid thank you for noticing! i hope we solve this soon!

@pluskid
Copy link
Contributor

pluskid commented Aug 5, 2016

Yes, it is a little bit trickier than I first thought. Not only the bottom embedding layer, but every upper layers that produces forward states needs to know when a frame is a padding frame, so that the forwarded state could be zero-ed. I think one solution is to pass a sequence of 0-1 masks (0 being a pad frame, and 1 being a normal frame), and let all the forward states to be multiplied (with broadcasting) by those masks.

@antinucleon
Copy link
Contributor

Mask is a good idea. But I think special initialized embedding matrix is
also enough for encoder begin padding and decoder end padding.
On Fri, Aug 5, 2016 at 11:30 Chiyuan Zhang [email protected] wrote:

Yes, it is a little bit trickier than I first thought. Not only the bottom
embedding layer, but every upper layers that produces forward states needs
to know when a frame is a padding frame, so that the forwarded state could
be zero-ed. I think one solution is to pass a sequence of 0-1 masks (0
being a pad frame, and 1 being a normal frame), and let all the forward
states to be multiplied (with broadcasting) by those masks.


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#2861 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABM13gl4d-slp4BJhDOanI7DVlKy2NdQks5qc4FigaJpZM4JWzfq
.

Sent from mobile phone

@freddycct
Copy link
Contributor Author

@pluskid @antinucleon Could be built into the mx.sym.RNN from @sbodenstein #2795 ?

@freddycct
Copy link
Contributor Author

@pluskid @antinucleon Is there anything I can do to help? I could try writing a masking layer in python using mx.operator.CustomOp if it's possible, what do you guys think?

@antinucleon
Copy link
Contributor

@freddycct Thank you! I am also looking into other framework to see how they deal with padding mask

@antinucleon
Copy link
Contributor

@freddycct @pluskid
Here is what I am doing:

I am making a mask LSTM symbol. For each input, we need a data and a mask. The mask will set output to 0 if it is masked as padding. So in general it looks like:

untitled presentation

I am writing a C++ mask op. The mask op will mask output and gradient to original or 0.

@freddycct
Copy link
Contributor Author

@antinucleon thanks! i think what you have is great! Do you think the same mask layer can be used at the outputs? That way, then it will help to model RNN regression with variable output sequence length too.

@antinucleon
Copy link
Contributor

Yes, mask is an operator. I will send a PR this afternoon and cc you and
kid.
On Thu, Aug 11, 2016 at 11:40 Freddy Chua [email protected] wrote:

@antinucleon https://github.com/antinucleon thanks! i think what you
have is great! Do you think the same mask layer can be used at the outputs?
That way, then it will help to model RNN regression with variable output
sequence length too.


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#2861 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABM13lzlR4v8f52o_OaGfkW85Ira0Bv2ks5qe2yMgaJpZM4JWzfq
.

Sent from mobile phone

@freddycct
Copy link
Contributor Author

@antinucleon awesome thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants