how to pad inputs for RNN bucketing #2861

freddycct · 2016-07-28T02:16:17Z

Hi, if I have the RNN as shown above, how do I pad the inputs or do I just ignore padding the inputs? only the softmax output allows an ignore label but there is nothing for inputs?

freddycct · 2016-07-29T04:43:41Z

@pluskid can you help please? the bucketing tutorial didn't mention much about padding

antinucleon · 2016-08-03T17:54:48Z

I am working on one example of this

antinucleon · 2016-08-03T18:11:47Z

I think we can use ignore_label for padding.

freddycct · 2016-08-03T18:13:41Z

@antinucleon
I was going to look into modifying embedding-inl.h Let me know if you are doing something similar along this line.

Well here's the issue, currently the mx.sym.Embedding layer does not account for PAD symbols. So when a, out-of-range index is given to the mx.sym.Embedding layer, it returns a non-initialized vector. What is needed is for the Embedding layer to return an all zero vector, and for the backpropagation to ignore training on the LSTMs when there's a PAD from the embedding layer.

We can use ignore_label for the softmaxoutputs, but there's currently nothing for the inputs...

The current bucketing tutorial from @pluskid uses zeros. But the embedding layer treats that zero as any other symbol, and will include backpropagating the gradient to the embedding vector representing the zero or PAD.

antinucleon · 2016-08-03T18:23:08Z

@freddycct

Can we use label 0 to be padding, and ignore all label 0?

freddycct · 2016-08-03T18:28:18Z

@antinucleon
the mx.sym.Embedding function needs to have

ignore_label (float, optional, default=-1) – the label value will be ignored during backward (only works if use_ignore is set to be true).
use_ignore (boolean, optional, default=False) – If set to true, the ignore_label value will not contribute to the backward gradient

This has implications for coding a Sequence to Sequence RNN

antinucleon · 2016-08-03T18:36:55Z

Let me see whether we can initialize a special embedding weight matrix in numpy to make all index 0 to get 0. Let me double check whether in backprop it will be changed.

freddycct · 2016-08-03T18:45:25Z

@antinucleon
If you can, let the PAD symbol be a choice, instead of always being 0.

antinucleon · 2016-08-04T19:03:31Z

@freddycct
The ultimate solution is on-the-fly graph build/execution, which may appear after we switch to NNVM.

However for now, I suspect if we have large amount of training data, the padding with random embedding won't affect that much, especially after we have attention.

freddycct · 2016-08-04T19:10:14Z

@antinucleon on-the-fly graph build/execution you mean like the imperative approach similar to Torch?

pluskid · 2016-08-04T20:29:15Z

I think using ignore_label on the softmax loss should be good. Input can be padded as zero or whatever. It should be fine because when the loss hit ignore_label, the gradients are zero, therefore, due to chain-rule that particular example (frame) does not contribute to the gradient accumulation for all the model parameters.

However, there do have some caveat. For example, you need to remember to ignore those padded labels when computing metrics (accuracy could be higher if all zeros are counted). Also, if you use BatchNorm, then the statistics counter are still accumulating those zeros. etc.

pluskid · 2016-08-04T20:33:39Z

@freddycct My understanding is that the chain-rule makes the back-propagation a series of multiplications. If one of the components (the loss) sets the multiplier to be zero, the whole chain will be zero. So the embedding layer will not get garbage gradient if the corresponding label is ignore_label. Do you agree?

freddycct · 2016-08-04T20:43:52Z

@pluskid Let me give an example of what I tried...

Let's say I have a sequence (X1, X2) that maps to (Y1, Y2) and they are in a bucket with encoder length 3, decoder length 4 according to the figure I showed. Note: I have a RNN that is made up of encoder RNN and decoder RNN, instead of just one RNN.

The training input is then given as
(0, X1, X2, EOS, Y1, Y2, 0)
the training label is
(Y1, Y2, EOS, 0)
Note: I did not connect a softmaxoutput loss to the encoder portion of the RNN.

Then after training for a very small example, in which training loss is almost zero...
the prediction for (0, X1, X2, EOS) gives (Y1, Y2) but the prediction for (X1, X2, EOS) gives some other results instead of (Y1, Y2).

That means the presence or absence of the PAD symbol affects the forward calculations. I think it should not be this way, (0, X1, X2, EOS) should give same results as (X1, X2, EOS).

The issue is MXNet does not recognize PAD inputs at the Embedding layer but only PAD outputs at the softmaxlosslayer.

pluskid · 2016-08-05T18:18:59Z

@freddycct Thanks for the explanation! I see the problem now! Because the embedding layer picks some arbitrary non-zero outputs for the PAD symbol, making the forwarded state of the RNN different from an initial zero-state.

freddycct · 2016-08-05T18:25:07Z

@pluskid thank you for noticing! i hope we solve this soon!

pluskid · 2016-08-05T18:30:55Z

Yes, it is a little bit trickier than I first thought. Not only the bottom embedding layer, but every upper layers that produces forward states needs to know when a frame is a padding frame, so that the forwarded state could be zero-ed. I think one solution is to pass a sequence of 0-1 masks (0 being a pad frame, and 1 being a normal frame), and let all the forward states to be multiplied (with broadcasting) by those masks.

antinucleon · 2016-08-05T19:38:35Z

Mask is a good idea. But I think special initialized embedding matrix is
also enough for encoder begin padding and decoder end padding.
On Fri, Aug 5, 2016 at 11:30 Chiyuan Zhang [email protected] wrote:

Yes, it is a little bit trickier than I first thought. Not only the bottom
embedding layer, but every upper layers that produces forward states needs
to know when a frame is a padding frame, so that the forwarded state could
be zero-ed. I think one solution is to pass a sequence of 0-1 masks (0
being a pad frame, and 1 being a normal frame), and let all the forward
states to be multiplied (with broadcasting) by those masks.

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#2861 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABM13gl4d-slp4BJhDOanI7DVlKy2NdQks5qc4FigaJpZM4JWzfq
.

Sent from mobile phone

freddycct · 2016-08-05T19:49:51Z

@pluskid @antinucleon Could be built into the mx.sym.RNN from @sbodenstein #2795 ?

freddycct · 2016-08-08T22:29:25Z

@pluskid @antinucleon Is there anything I can do to help? I could try writing a masking layer in python using mx.operator.CustomOp if it's possible, what do you guys think?

antinucleon · 2016-08-09T18:02:45Z

@freddycct Thank you! I am also looking into other framework to see how they deal with padding mask

antinucleon · 2016-08-11T18:29:42Z

@freddycct @pluskid
Here is what I am doing:

I am making a mask LSTM symbol. For each input, we need a data and a mask. The mask will set output to 0 if it is masked as padding. So in general it looks like:

I am writing a C++ mask op. The mask op will mask output and gradient to original or 0.

freddycct · 2016-08-11T18:40:08Z

@antinucleon thanks! i think what you have is great! Do you think the same mask layer can be used at the outputs? That way, then it will help to model RNN regression with variable output sequence length too.

antinucleon · 2016-08-11T18:42:22Z

Yes, mask is an operator. I will send a PR this afternoon and cc you and
kid.
On Thu, Aug 11, 2016 at 11:40 Freddy Chua [email protected] wrote:

@antinucleon https://github.com/antinucleon thanks! i think what you
have is great! Do you think the same mask layer can be used at the outputs?
That way, then it will help to model RNN regression with variable output
sequence length too.

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#2861 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABM13lzlR4v8f52o_OaGfkW85Ira0Bv2ks5qe2yMgaJpZM4JWzfq
.

Sent from mobile phone

freddycct · 2016-08-11T18:43:57Z

@antinucleon awesome thanks!

freddycct closed this as completed Sep 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to pad inputs for RNN bucketing #2861

how to pad inputs for RNN bucketing #2861

freddycct commented Jul 28, 2016

freddycct commented Jul 29, 2016

antinucleon commented Aug 3, 2016

antinucleon commented Aug 3, 2016

freddycct commented Aug 3, 2016 •

edited

Loading

antinucleon commented Aug 3, 2016

freddycct commented Aug 3, 2016

antinucleon commented Aug 3, 2016

freddycct commented Aug 3, 2016

antinucleon commented Aug 4, 2016

freddycct commented Aug 4, 2016

pluskid commented Aug 4, 2016

pluskid commented Aug 4, 2016

freddycct commented Aug 4, 2016

pluskid commented Aug 5, 2016

freddycct commented Aug 5, 2016

pluskid commented Aug 5, 2016

antinucleon commented Aug 5, 2016

freddycct commented Aug 5, 2016

freddycct commented Aug 8, 2016

antinucleon commented Aug 9, 2016

antinucleon commented Aug 11, 2016

freddycct commented Aug 11, 2016

antinucleon commented Aug 11, 2016

freddycct commented Aug 11, 2016

how to pad inputs for RNN bucketing #2861

how to pad inputs for RNN bucketing #2861

Comments

freddycct commented Jul 28, 2016

freddycct commented Jul 29, 2016

antinucleon commented Aug 3, 2016

antinucleon commented Aug 3, 2016

freddycct commented Aug 3, 2016 • edited Loading

antinucleon commented Aug 3, 2016

freddycct commented Aug 3, 2016

antinucleon commented Aug 3, 2016

freddycct commented Aug 3, 2016

antinucleon commented Aug 4, 2016

freddycct commented Aug 4, 2016

pluskid commented Aug 4, 2016

pluskid commented Aug 4, 2016

freddycct commented Aug 4, 2016

pluskid commented Aug 5, 2016

freddycct commented Aug 5, 2016

pluskid commented Aug 5, 2016

antinucleon commented Aug 5, 2016

freddycct commented Aug 5, 2016

freddycct commented Aug 8, 2016

antinucleon commented Aug 9, 2016

antinucleon commented Aug 11, 2016

freddycct commented Aug 11, 2016

antinucleon commented Aug 11, 2016

freddycct commented Aug 11, 2016

freddycct commented Aug 3, 2016 •

edited

Loading