-
Notifications
You must be signed in to change notification settings - Fork 954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make CMaxTable and CMinTable cunn-compatible #954
Conversation
@@ -19,8 +19,7 @@ end | |||
|
|||
function CMaxTable:updateGradInput(input, gradOutput) | |||
for i=1,#input do | |||
self.gradInput[i] = torch.Tensor() | |||
self.gradInput[i]:resizeAs(input[i]):fill(0.0) | |||
self.gradInput[i] = input[i]:clone():fill(0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of cloning and then zeroing the tensor, it is more efficient to creat an empty tensor using input[i].new()
.
Also, you are creating new tensors at every backward pass, which is a blocking operation and currently slows down GPU computation a lot.
Doing something like
self.gradInput [i] = self.gradInput [i] or input [i].new ()
will avoid repeated memory allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the repeated memory allocation seems pretty bad. I'm not sure .new()
works here though - I thought this just creates an empty Tensor, and the sizes need to match. Please do correct me if I'm wrong though, or if there's any other ways of implementing this. For example, would resizeAs
be more efficient than the clone call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can add something like
self.gradInput[i] = self.gradInput[i] or input[i].new()
self.gradInput[i]:resizeAs(input[i]):zero()
This way, you can change the size of the input during forward calls, which is not possible with current implementation
@soumith Sorry for the delay in getting back to this - this should be ready for review now. @fmassa I've made your changes, so that |
Note: this table layer seems to still be significantly slower than other table layers. (Based on incorporating it into a recurrent module which uses it as a merge module, and this merge module taking many times as long as other merge modules). |
local mask = torch.gt(input[i], self.output) | ||
self.maxIdx:maskedFill(mask, i) | ||
self.output:maskedCopy(mask, input[i][mask]) | ||
self.mask = torch.gt(input[i], self.output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still allocates memory at every forward
run, enforcing synchronization points from time to time when the memory is released.
If you could do instead something like
input[i].gt(self.mask, input[i], self.output)
you will avoid memory allocations and it should be faster.
This looks much better, thanks! |
@fmassa Thanks a bunch for all the help with memory allocation - I really appreciate it. Sorry to bug you again - I believe I've written the module so it doesn't allocate memory, but it's still as slow as before (so I suspect it's still allocating memory strangely). Can you let me know if anything stands out that might be slowing down the code? (Also, are there good timing/profiling tools for torch/lua?) Right now I've written local CMaxTable, parent = torch.class('nn.CMaxTable', 'nn.Module')
function CMaxTable:__init()
parent.__init(self)
self.gradInput = {}
self.maxIdx = torch.Tensor()
self.mask = torch.Tensor()
end
function CMaxTable:updateOutput(input)
self.output:resizeAs(input[1]):copy(input[1])
self.maxIdx:resizeAs(input[1]):fill(1)
self.maskByteTensor = self.maskByteTensor or
(torch.type(self.output) == 'torch.CudaTensor' and
torch.CudaByteTensor() or torch.ByteTensor())
for i=2,#input do
self.mask:gt(input[i], self.output)
self.maskByteTensor:resize(self.mask:size()):copy(self.mask)
self.maxIdx:maskedFill(self.maskByteTensor, i)
self.output:maskedCopy(self.maskByteTensor, input[i][self.maskByteTensor])
end
return self.output
end
function CMaxTable:updateGradInput(input, gradOutput)
for i=1,#input do
self.gradInput[i] = self.gradInput[i] or input[i].new()
self.gradInput[i]:resizeAs(input[i]):zero()
self.mask:eq(self.maxIdx, i)
self.maskByteTensor:copy(self.mask)
self.gradInput[i]:maskedCopy(self.maskByteTensor, gradOutput[self.maskByteTensor])
end
for i=#input+1, #self.gradInput do
self.gradInput[i] = nil
end
return self.gradInput
end Alternatively, if it's easier for you to look at the diff (sorry there's no highlighting) --- a/CMaxTable.lua
+++ b/CMaxTable.lua
@@ -4,16 +4,20 @@ function CMaxTable:__init()
parent.__init(self)
self.gradInput = {}
self.maxIdx = torch.Tensor()
- self.mask = torch.Tensor() -- reused for memory allocation efficiency
+ self.mask = torch.Tensor()
end
function CMaxTable:updateOutput(input)
self.output:resizeAs(input[1]):copy(input[1])
self.maxIdx:resizeAs(input[1]):fill(1)
+ self.maskByteTensor = self.maskByteTensor or
+ (torch.type(self.output) == 'torch.CudaTensor' and
+ torch.CudaByteTensor() or torch.ByteTensor())
for i=2,#input do
- self.mask = torch.gt(input[i], self.output)
- self.maxIdx:maskedFill(self.mask, i)
- self.output:maskedCopy(self.mask, input[i][self.mask])
+ self.mask:gt(input[i], self.output)
+ self.maskByteTensor:resize(self.mask:size()):copy(self.mask)
+ self.maxIdx:maskedFill(self.maskByteTensor, i)
+ self.output:maskedCopy(self.maskByteTensor, input[i][self.maskByteTensor])
end
return self.output Profiling statistics: I'm using this within the Multi-function recurrent unit module. The baseline uses 2 modules, a |
Hi, There is still one remaining memory allocation. When you do output:maskedSelect(input, mask) The |
@fmassa Thanks a bunch for your help. I really appreciate you pointing out all these little details; otherwise they just wouldn't get fixed (I was confused by the module was slow, but wouldn't have done anything about it). Hopefully I'll be able to get all this right the first time next time, and make it easier on you. After making the fix, the module is indeed on par with EDIT: failed my unit test, I'll look into this tomorrow. |
Tests are failing. |
@fmassa My bad on that, fixing it up. I'm still seeing slow performance (3.0 ms/batch vs 1.77 ms/batch) using something like
and similar with EDIT: I'm trying to get a sense of what Torch internals look like. https://github.com/torch/cutorch/blob/c2e20479ba1dad4130f77e8258a8fb6a20231b5d/lib/THC/generic/THCTensorMasked.cu#L125 looks like it allocates memory, but I'm not sure. Could you clarify this for me? |
2022af4
to
3bd2333
Compare
47b3d37
to
5b0b295
Compare
@soumith Sorry I left this open so long, but this should be good to merge now. |
thanks! |
Since the
Tensor
objects are getting created in theupdateGradInput
function during the backwards calls, to accommodate variable-length tables, we need to make sure that they have the same type as the input tensors.Sorry I didn't notice this when initially writing the module, and let me know if there are other concerns.