Skip to content

Implementation of DRAM model #180

Open
@vyouman

Description

@vyouman

Hi, I'm trying to implement the Deep Recurrent Attention Model described in the paper http://arxiv.org/pdf/1412.7755v2.pdf to apply to image caption generation instead of image classification. I will probably be able to use most of the modules in the RAM model implemented in the rnn package. For the case, I don't need to modify the Reinforce.lua interface and ReinforceNormal.lua since it's able to deal with table of rewards at every time step per batch now. All I should do here is to write a new Criterion and I've written one. And I think I should modify the RecurrentAttention module.

There is a context network, which is a 3-layer convolution network or some other CNN model presented in the paper to get the feature of the low-resolution images to feed to the second recurrent layer as the initial state to get the first location. I come up with two approaches:

  1. Maybe I should assemble the context network, the second recurrent layer and the location network to be the locator expected in the RecurrentAttention.
  2. Or I should use the context network to deal with the low-resolution image indepently and feed the feature to the 2nd recurrent layer as the initial state in the first time step directly.
    It seems that the second approach is more efficient and easy to implement than the first one, since there is some repeated images in the (image, caption) pair, cause one image will have more than one captions.
    So I want to wrap the second recurrent layer and the location network with a Recursor to be a locator expected by the RecurrentAttention module. Maybe I don't really need to modify the input in the first time step. The zero tensor will go directly to the second recurrent layer then.
    https://github.com/Element-Research/rnn/blob/master/RecurrentAttention.lua#L44-L48
    But I have to initialize the initial state of the second LSTM layer, how can I initialize it? I read through the LSTM code, and I think I may change the userPrevOutput and userPrevCell, right?
    https://github.com/Element-Research/rnn/blob/master/LSTM.lua#L142-L144
    For example, after I get a instance lstm, I should use something like

lstm.userPrevOutput = torch.Tensor(batchSizeoutputSize):fill(1)
lstm.userPrevCell = torch.Tensor(batchSize
outputSize):fill(0.5)

For the second question, I'll need to add some other input to the first recurrent layer, in my case, it's word vector of every time step. Finally, instead of predict classes of multi-objects, I expect to predict the caption describing the image. And there's also some logic to deal with the captions of variable lengths so I'll probably have to encapsulate a LSTM layer to do this stuff, we can think of it as a language model without the last LogSoftmax layer here, call it lm for now. And then wrap it with a Recursor with the glimpse network as a rnn like https://github.com/Element-Research/rnn/blob/master/examples/recurrent-visual-attention.lua#L130-L131, we use

rnn = nn.Recurrent(opt.hiddenSize, glimpse, lm, nnopt.transfer, 99999)

at last wrap the rnn and the locator as https://github.com/Element-Research/rnn/blob/master/examples/recurrent-visual-attention.lua#L145. So here comes the question, how can I pass the additional input like word vector which is not the direct input of the rnn expected by the RecurrentAttention model? The rnn here is compsed of the glimpse network and a recurrent layer, where the glimpse network expect an input of {image, location}, and in this case the recurrent layer not only expect the gt vector gained from the glimpse network but also the word vector. Should I modify the RecurrentAttention module to get more inputs? Or I don't need to modify the input of the RecurrentAttention module and the word vector can directly go from the lm module I'm going to implement above. Do you think it's feasible to do so or you have some more elegant way to implement it?

I'm a new babie of the rnn package and torch7. I'll appreciate your suggestion.:p

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions