Implementation of DRAM model

Hi, I'm trying to implement the Deep Recurrent Attention Model described in the paper http://arxiv.org/pdf/1412.7755v2.pdf to apply to image caption generation instead of image classification. I will probably be able to use most of the modules in the RAM model implemented in the rnn package. For the case, I don't need to modify the Reinforce.lua interface and ReinforceNormal.lua since it's able to deal with table of rewards at every time step per batch now. All I should do here is to write a new Criterion and I've written one. And I think I should modify the RecurrentAttention module.

 There is a context network, which is a 3-layer convolution network or some other CNN model presented in the paper to get the feature of the low-resolution images to feed to the second recurrent layer as the initial state to get the first location. I come up with two approaches:
1. Maybe I should assemble the context network, the second recurrent layer and the location network to be the locator expected in the RecurrentAttention.
2. Or I should use the context network to deal with the low-resolution image indepently and feed the feature to the 2nd recurrent layer as the initial state in the first time step directly.
It seems that the second approach is more efficient and easy to implement than the first one, since there is some repeated images in the (image, caption) pair, cause one image will have more than one captions. 
So I want to wrap the second recurrent layer and the location network with a Recursor to be a locator expected by the RecurrentAttention module. Maybe I don't really need to modify the input in the first time step. The zero tensor will go directly to the second recurrent layer then.
https://github.com/Element-Research/rnn/blob/master/RecurrentAttention.lua#L44-L48
But I have to initialize the initial state of the second LSTM layer, how can I initialize it? I read through the LSTM code, and I think I may change the userPrevOutput and userPrevCell, right?
https://github.com/Element-Research/rnn/blob/master/LSTM.lua#L142-L144
For example, after I get a instance lstm, I should use something like

> lstm.userPrevOutput = torch.Tensor(batchSize*outputSize):fill(1)
> lstm.userPrevCell = torch.Tensor(batchSize*outputSize):fill(0.5)

For the second question, I'll need to add some other input to the first recurrent layer, in my case, it's word vector of every time step. Finally, instead of predict classes of multi-objects, I expect to predict the caption describing the image. And there's also some logic to deal with the captions of variable lengths so I'll probably have to encapsulate a LSTM layer to do this stuff, we can think of it as a language model without the last LogSoftmax layer here, call it **lm** for now. And then wrap it with a Recursor with the glimpse network as a rnn like https://github.com/Element-Research/rnn/blob/master/examples/recurrent-visual-attention.lua#L130-L131, we use

> rnn = nn.Recurrent(opt.hiddenSize, glimpse, lm, nn[opt.transfer](), 99999)

at last wrap the rnn and the locator as https://github.com/Element-Research/rnn/blob/master/examples/recurrent-visual-attention.lua#L145. So here comes the question, how can I pass the additional input like word vector which is not the direct input of the rnn expected by the RecurrentAttention model? The rnn here is compsed of the glimpse network and a recurrent layer, where the glimpse network expect an input of {image, location}, and in this case the recurrent layer not only expect the gt vector gained from the glimpse network but also the word vector. Should I modify the RecurrentAttention module to get more inputs? Or I don't need to modify the input of the RecurrentAttention module and the word vector can directly go from the **lm** module I'm going to implement above. Do you think it's feasible to do so or you have some more elegant way to implement it?

I'm a new babie of the rnn package and torch7. I'll appreciate your suggestion.:p


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implementation of DRAM model #180

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implementation of DRAM model #180

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions