-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of DRAM model #180
Comments
@vyouman Yeah I think 2 is best. But I have to initialize the initial state of the second LSTM layer, how can I initialize it? I read through the LSTM code, and I think I may change the userPrevOutput and userPrevCell, right? #176 For your last question. I don't think you need to modify RecurrentAttention. You should be able to make replace the input layer to the Recurrent module (i.e. You might be new to rnn, but you seem to know what your are doing :) |
@nicholas-leonard |
@vyouman Glad you are moving forward with this :) |
@vyouman Any news on this? |
Hi, I'm trying to implement the Deep Recurrent Attention Model described in the paper http://arxiv.org/pdf/1412.7755v2.pdf to apply to image caption generation instead of image classification. I will probably be able to use most of the modules in the RAM model implemented in the rnn package. For the case, I don't need to modify the Reinforce.lua interface and ReinforceNormal.lua since it's able to deal with table of rewards at every time step per batch now. All I should do here is to write a new Criterion and I've written one. And I think I should modify the RecurrentAttention module.
There is a context network, which is a 3-layer convolution network or some other CNN model presented in the paper to get the feature of the low-resolution images to feed to the second recurrent layer as the initial state to get the first location. I come up with two approaches:
It seems that the second approach is more efficient and easy to implement than the first one, since there is some repeated images in the (image, caption) pair, cause one image will have more than one captions.
So I want to wrap the second recurrent layer and the location network with a Recursor to be a locator expected by the RecurrentAttention module. Maybe I don't really need to modify the input in the first time step. The zero tensor will go directly to the second recurrent layer then.
https://github.com/Element-Research/rnn/blob/master/RecurrentAttention.lua#L44-L48
But I have to initialize the initial state of the second LSTM layer, how can I initialize it? I read through the LSTM code, and I think I may change the userPrevOutput and userPrevCell, right?
https://github.com/Element-Research/rnn/blob/master/LSTM.lua#L142-L144
For example, after I get a instance lstm, I should use something like
For the second question, I'll need to add some other input to the first recurrent layer, in my case, it's word vector of every time step. Finally, instead of predict classes of multi-objects, I expect to predict the caption describing the image. And there's also some logic to deal with the captions of variable lengths so I'll probably have to encapsulate a LSTM layer to do this stuff, we can think of it as a language model without the last LogSoftmax layer here, call it lm for now. And then wrap it with a Recursor with the glimpse network as a rnn like https://github.com/Element-Research/rnn/blob/master/examples/recurrent-visual-attention.lua#L130-L131, we use
at last wrap the rnn and the locator as https://github.com/Element-Research/rnn/blob/master/examples/recurrent-visual-attention.lua#L145. So here comes the question, how can I pass the additional input like word vector which is not the direct input of the rnn expected by the RecurrentAttention model? The rnn here is compsed of the glimpse network and a recurrent layer, where the glimpse network expect an input of {image, location}, and in this case the recurrent layer not only expect the gt vector gained from the glimpse network but also the word vector. Should I modify the RecurrentAttention module to get more inputs? Or I don't need to modify the input of the RecurrentAttention module and the word vector can directly go from the lm module I'm going to implement above. Do you think it's feasible to do so or you have some more elegant way to implement it?
I'm a new babie of the rnn package and torch7. I'll appreciate your suggestion.:p
The text was updated successfully, but these errors were encountered: