Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make CMaxTable and CMinTable cunn-compatible #954

Merged
merged 1 commit into from
Feb 2, 2017

Conversation

juesato
Copy link
Contributor

@juesato juesato commented Sep 8, 2016

Since the Tensor objects are getting created in the updateGradInput function during the backwards calls, to accommodate variable-length tables, we need to make sure that they have the same type as the input tensors.

Sorry I didn't notice this when initially writing the module, and let me know if there are other concerns.

@@ -19,8 +19,7 @@ end

function CMaxTable:updateGradInput(input, gradOutput)
for i=1,#input do
self.gradInput[i] = torch.Tensor()
self.gradInput[i]:resizeAs(input[i]):fill(0.0)
self.gradInput[i] = input[i]:clone():fill(0.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of cloning and then zeroing the tensor, it is more efficient to creat an empty tensor using input[i].new().
Also, you are creating new tensors at every backward pass, which is a blocking operation and currently slows down GPU computation a lot.
Doing something like

self.gradInput [i] = self.gradInput [i] or input [i].new ()

will avoid repeated memory allocations.

Copy link
Contributor Author

@juesato juesato Sep 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the repeated memory allocation seems pretty bad. I'm not sure .new() works here though - I thought this just creates an empty Tensor, and the sizes need to match. Please do correct me if I'm wrong though, or if there's any other ways of implementing this. For example, would resizeAs be more efficient than the clone call?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add something like

self.gradInput[i] = self.gradInput[i] or input[i].new()
self.gradInput[i]:resizeAs(input[i]):zero()

This way, you can change the size of the input during forward calls, which is not possible with current implementation

@juesato
Copy link
Contributor Author

juesato commented Sep 23, 2016

@soumith Sorry for the delay in getting back to this - this should be ready for review now.

@fmassa I've made your changes, so that self.mask is only initialized once, and self.gradInput is only initialized once per table entry. The other times the values are just changed. The only time a memory re-allocation should occur is if one of the input sequences has larger Tensors than others, which seems inherent to supporting inputs of different dimensionalities.

@juesato
Copy link
Contributor Author

juesato commented Sep 25, 2016

Note: this table layer seems to still be significantly slower than other table layers. (Based on incorporating it into a recurrent module which uses it as a merge module, and this merge module taking many times as long as other merge modules).

local mask = torch.gt(input[i], self.output)
self.maxIdx:maskedFill(mask, i)
self.output:maskedCopy(mask, input[i][mask])
self.mask = torch.gt(input[i], self.output)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still allocates memory at every forward run, enforcing synchronization points from time to time when the memory is released.
If you could do instead something like

input[i].gt(self.mask, input[i], self.output)

you will avoid memory allocations and it should be faster.

@fmassa
Copy link
Contributor

fmassa commented Sep 25, 2016

This looks much better, thanks!
I made one inline comment (which applies for both modules).
If you decide to use the mask, you might need to have a special handling of the tensor type (at least initialize it as torch.ByteTensor instead of torch.Tensor).
But now comparison operations in cutorch returns CudaByteTensor, so you might not need to handle special cases for CPU and GPU, but I haven't played much with this new interface to know the precise details.

@juesato
Copy link
Contributor Author

juesato commented Sep 26, 2016

@fmassa Thanks a bunch for all the help with memory allocation - I really appreciate it. Sorry to bug you again - I believe I've written the module so it doesn't allocate memory, but it's still as slow as before (so I suspect it's still allocating memory strangely).

Can you let me know if anything stands out that might be slowing down the code? (Also, are there good timing/profiling tools for torch/lua?)

Right now I've written CMaxTable like so:

local CMaxTable, parent = torch.class('nn.CMaxTable', 'nn.Module')

function CMaxTable:__init()
   parent.__init(self)
   self.gradInput = {}
   self.maxIdx = torch.Tensor()
   self.mask = torch.Tensor()
end

function CMaxTable:updateOutput(input)
   self.output:resizeAs(input[1]):copy(input[1])
   self.maxIdx:resizeAs(input[1]):fill(1)
   self.maskByteTensor = self.maskByteTensor or
      (torch.type(self.output) == 'torch.CudaTensor' and
       torch.CudaByteTensor() or torch.ByteTensor())
   for i=2,#input do
      self.mask:gt(input[i], self.output)
      self.maskByteTensor:resize(self.mask:size()):copy(self.mask)
      self.maxIdx:maskedFill(self.maskByteTensor, i)
      self.output:maskedCopy(self.maskByteTensor, input[i][self.maskByteTensor])
   end
   return self.output
end

function CMaxTable:updateGradInput(input, gradOutput)
   for i=1,#input do
      self.gradInput[i] = self.gradInput[i] or input[i].new()
      self.gradInput[i]:resizeAs(input[i]):zero()
      self.mask:eq(self.maxIdx, i)
      self.maskByteTensor:copy(self.mask)
      self.gradInput[i]:maskedCopy(self.maskByteTensor, gradOutput[self.maskByteTensor])
   end

   for i=#input+1, #self.gradInput do
       self.gradInput[i] = nil
   end

   return self.gradInput
end

Alternatively, if it's easier for you to look at the diff (sorry there's no highlighting)

--- a/CMaxTable.lua
+++ b/CMaxTable.lua
@@ -4,16 +4,20 @@ function CMaxTable:__init()
    parent.__init(self)
    self.gradInput = {}
    self.maxIdx = torch.Tensor()
-   self.mask = torch.Tensor() -- reused for memory allocation efficiency
+   self.mask = torch.Tensor()
 end

 function CMaxTable:updateOutput(input)
    self.output:resizeAs(input[1]):copy(input[1])
    self.maxIdx:resizeAs(input[1]):fill(1)
+   self.maskByteTensor = self.maskByteTensor or
+      (torch.type(self.output) == 'torch.CudaTensor' and
+       torch.CudaByteTensor() or torch.ByteTensor())
    for i=2,#input do
-      self.mask = torch.gt(input[i], self.output)
-      self.maxIdx:maskedFill(self.mask, i)
-      self.output:maskedCopy(self.mask, input[i][self.mask])
+      self.mask:gt(input[i], self.output)
+      self.maskByteTensor:resize(self.mask:size()):copy(self.mask)
+      self.maxIdx:maskedFill(self.maskByteTensor, i)
+      self.output:maskedCopy(self.maskByteTensor, input[i][self.maskByteTensor])
    end
    return self.output

Profiling statistics: I'm using this within the Multi-function recurrent unit module. The baseline uses 2 modules, a nn.SelectTable(1) and nn.SelectTable(2). The speeds are:
Baseline : 1.5-1.6 ms/batch
Baseline + nn.CMulTable : 1.7 ms/batch
Baseline + CMaxTable before : 2.5 ms/batch
Baseline + CMaxTable after : 2.5 ms/batch
so I suspect there's still memory allocation.

@juesato juesato changed the title Make CMaxTable and CMinTable cunn-compatible Make CMaxTable and CMinTable cunn-compatible [work in progress] Sep 26, 2016
@fmassa
Copy link
Contributor

fmassa commented Sep 26, 2016

Hi,
Thanks for your work on improving the CMaxTable implementation. It can be tricky to get all those small details the first time, thanks for your patience :)

There is still one remaining memory allocation.

When you do tensor[maskTensor], it allocates a new tensor with the output. The way of reusing tensors is to use maskedSelect, something like

output:maskedSelect(input, mask)

The [] operator is a convenience operator (like * or +), but when used with ByteTensors they allocate and return a new tensor at every call. Note that tensor[1] is not allocating new memory, it's only creating a new tensor which shares the same storage, so there's no problem there.
This will not be that big of an issue when the new memory allocator is merged into cutorch.

@juesato
Copy link
Contributor Author

juesato commented Sep 26, 2016

@fmassa Thanks a bunch for your help. I really appreciate you pointing out all these little details; otherwise they just wouldn't get fixed (I was confused by the module was slow, but wouldn't have done anything about it). Hopefully I'll be able to get all this right the first time next time, and make it easier on you.

After making the fix, the module is indeed on par with CMulTable and adds minimal runtime to the overall model (1.65 ms/batch).

EDIT: failed my unit test, I'll look into this tomorrow.

@juesato juesato changed the title Make CMaxTable and CMinTable cunn-compatible [work in progress] Make CMaxTable and CMinTable cunn-compatible Sep 26, 2016
@fmassa
Copy link
Contributor

fmassa commented Sep 26, 2016

Tests are failing.
I was maybe not clear, but you need to use maskedSelect and maskedCopy to perform the previous operation that you had. It might require having yet another buffer.

@juesato
Copy link
Contributor Author

juesato commented Sep 30, 2016

@fmassa My bad on that, fixing it up. I'm still seeing slow performance (3.0 ms/batch vs 1.77 ms/batch) using something like

      self.nonmaskedInput:maskedSelect(input[i], self.mask)
      self.output:maskedCopy(self.mask, self.nonmaskedInput)

and similar with gradOutput. Is it possible that the size of the maskedSelection changing causes memory allocation? (for example, if there are multiple maximum elements?)

EDIT: I'm trying to get a sense of what Torch internals look like. https://github.com/torch/cutorch/blob/c2e20479ba1dad4130f77e8258a8fb6a20231b5d/lib/THC/generic/THCTensorMasked.cu#L125 looks like it allocates memory, but I'm not sure. Could you clarify this for me?

@juesato
Copy link
Contributor Author

juesato commented Jan 29, 2017

@soumith Sorry I left this open so long, but this should be good to merge now.

@soumith soumith merged commit db4244e into torch:master Feb 2, 2017
@soumith
Copy link
Member

soumith commented Feb 2, 2017

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants