-
Notifications
You must be signed in to change notification settings - Fork 6.8k
how to pad inputs for RNN bucketing #2861
Comments
@pluskid can you help please? the bucketing tutorial didn't mention much about padding |
I am working on one example of this |
I think we can use |
@antinucleon Well here's the issue, currently the mx.sym.Embedding layer does not account for PAD symbols. So when a, out-of-range index is given to the mx.sym.Embedding layer, it returns a non-initialized vector. What is needed is for the Embedding layer to return an all zero vector, and for the backpropagation to ignore training on the LSTMs when there's a PAD from the embedding layer. We can use ignore_label for the softmaxoutputs, but there's currently nothing for the inputs... The current bucketing tutorial from @pluskid uses zeros. But the embedding layer treats that zero as any other symbol, and will include backpropagating the gradient to the embedding vector representing the zero or PAD. |
Can we use label |
@antinucleon
This has implications for coding a Sequence to Sequence RNN |
Let me see whether we can initialize a special embedding weight matrix in numpy to make all index 0 to get 0. Let me double check whether in backprop it will be changed. |
@antinucleon |
@freddycct However for now, I suspect if we have large amount of training data, the padding with random embedding won't affect that much, especially after we have attention. |
@antinucleon on-the-fly graph build/execution you mean like the imperative approach similar to Torch? |
I think using However, there do have some caveat. For example, you need to remember to ignore those padded labels when computing metrics (accuracy could be higher if all zeros are counted). Also, if you use BatchNorm, then the statistics counter are still accumulating those zeros. etc. |
@freddycct My understanding is that the chain-rule makes the back-propagation a series of multiplications. If one of the components (the loss) sets the multiplier to be zero, the whole chain will be zero. So the embedding layer will not get garbage gradient if the corresponding label is |
@pluskid Let me give an example of what I tried... Let's say I have a sequence (X1, X2) that maps to (Y1, Y2) and they are in a bucket with encoder length 3, decoder length 4 according to the figure I showed. Note: I have a RNN that is made up of encoder RNN and decoder RNN, instead of just one RNN. The training input is then given as Then after training for a very small example, in which training loss is almost zero... That means the presence or absence of the PAD symbol affects the forward calculations. I think it should not be this way, (0, X1, X2, EOS) should give same results as (X1, X2, EOS). The issue is MXNet does not recognize PAD inputs at the Embedding layer but only PAD outputs at the softmaxlosslayer. |
@freddycct Thanks for the explanation! I see the problem now! Because the embedding layer picks some arbitrary non-zero outputs for the PAD symbol, making the forwarded state of the RNN different from an initial zero-state. |
@pluskid thank you for noticing! i hope we solve this soon! |
Yes, it is a little bit trickier than I first thought. Not only the bottom embedding layer, but every upper layers that produces forward states needs to know when a frame is a padding frame, so that the forwarded state could be zero-ed. I think one solution is to pass a sequence of 0-1 masks (0 being a pad frame, and 1 being a normal frame), and let all the forward states to be multiplied (with broadcasting) by those masks. |
Mask is a good idea. But I think special initialized embedding matrix is
|
@pluskid @antinucleon Could be built into the mx.sym.RNN from @sbodenstein #2795 ? |
@pluskid @antinucleon Is there anything I can do to help? I could try writing a masking layer in python using mx.operator.CustomOp if it's possible, what do you guys think? |
@freddycct Thank you! I am also looking into other framework to see how they deal with padding mask |
@freddycct @pluskid I am making a mask LSTM symbol. For each input, we need a data and a mask. The mask will set output to 0 if it is masked as padding. So in general it looks like: I am writing a C++ mask op. The mask op will mask output and gradient to original or 0. |
@antinucleon thanks! i think what you have is great! Do you think the same mask layer can be used at the outputs? That way, then it will help to model RNN regression with variable output sequence length too. |
Yes, mask is an operator. I will send a PR this afternoon and cc you and
|
@antinucleon awesome thanks! |
Hi, if I have the RNN as shown above, how do I pad the inputs or do I just ignore padding the inputs? only the softmax output allows an ignore label but there is nothing for inputs?
The text was updated successfully, but these errors were encountered: