[Op] cuDNN RNN Symbol #2795

sbodenstein · 2016-07-21T14:33:24Z

This adds an interface to the cuDNN RNN operator. Some issues:

The forward pass in inference mode reproduces https://github.com/soumith/cudnn.torch
The backward mode is currently not working, due to incorrect handling of dropout descriptor. Correct handling of the dropout state needs to be added.
It doesn't currently inherit the MXNet seed for dropout. How is this seed accessed?
This symbol currently only supports data in form [seq length, batch, input size], which is the native cuDNN format. Should add support for [batch, seq, input] as well, but will probably require a temp state + transpose.
Gives an option to return multiple outputs (output + 2 states for LSTM, 1 for others). By default only a single output is returned, but sometimes you require access to the output states (eg generating text).
Currently only support the CUDNN_LINEAR_INPUT option for cudnnRNNInputMode_t.
Uses a single parameter vector. It will be useful to have a Python script to convert this to a dictionary of NDArray's giving each weight + bias for each layer.

See also: #2401

- fixed error in output shape inference

- added cudnn destructors

- completed forward evaluation

- fixed bug where cudnnGetRNNParamsSize needs to be called after cudnnSetRNNDescriptor

- more consistent param names - removed 'batch_first' option for now. Might add it later again

- moved calculated param to cudnn_rnn-inl.h

- fixed error in output shape inference

- added cudnn destructors

- completed forward evaluation

- fixed bug where cudnnGetRNNParamsSize needs to be called after cudnnSetRNNDescriptor

- more consistent param names - removed 'batch_first' option for now. Might add it later again

- moved calculated param to cudnn_rnn-inl.h

piiswrong · 2016-07-21T19:16:34Z

So the current problem with training is that reservedspace need to be kept as an output but it's size is unknown during shape inference?
A simple fix is to use cudamemalloc to alloc it during op.init. It's not ideal but since it should be a small buffer it's fine for now

sxjscience · 2016-07-22T05:16:16Z

@sbodenstein @antinucleon My concern is that resembles of dropout and batch normalization in RNN, which hasn't been included in CuDNN, will soon be standardized and we may need to add these new features to our C++ implementation. Also, RNN in TensorFlow is implemented by combining basic symbols (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/rnn_cell.py). Anyway, writing a C++ compatible version of CuDNNRNN will do nothing bad. We can still implement the wrapper in the script language.

sbodenstein · 2016-07-22T11:06:55Z

@piiswrong: regarding the space computation, I don't know how its done, as it only depends on a single parameter, the cudnn handle (cudnnDropoutGetStatesSize(cudnnHandle_t handle, size_t * size)). Is it device dependent? If not, why not just use a global variable?

Ok, will use cudaMalloc for now (and free in ~CuDNNRNNOp()).

sxjscience · 2016-07-23T11:54:28Z

For [4], I think [seq length, batch, input size] is fine and let's just stick to this layout.

piiswrong · 2016-07-24T00:00:43Z

@sbodenstein Any update on this?

sbodenstein · 2016-07-24T00:12:05Z

@piiswrong: I've been on vacation the last two days. I should have some time tomorrow (latest Monday) to resolve the remaining issues (found one or two extra bugs in my code, and also deal with dropout states correctly and test Backward against Torch). Apologies for the delay.

sbodenstein · 2016-07-24T00:38:08Z

@piiswrong: actually, just committed a fix that should resolve the dropout issue, and also fixes a few other bugs.

I will spend some time tomorrow testing all the various configurations against Torch. What else do I need to do to merge a first version? A set of Python tests?

- added dropout states - fixed incorrect handling of variable outputs

piiswrong · 2016-07-24T00:45:52Z

Testing agains python version would be nice. However since this is GPU only and currently it won't be run on test server anyway, if you can confirm consistency with torch I think it's enough for initial version

piiswrong · 2016-07-24T00:47:21Z

@antinucleon Do you have time to do deepmark lstm?

sbodenstein · 2016-07-24T22:18:29Z

@piiswrong: I reproduce Torch with a wide variety of settings (bidirectional, lstm, gru, etc). I think its ready to be merged.

piiswrong · 2016-07-24T22:23:08Z

Great. I'll merge it after tests finish

piiswrong · 2016-07-24T23:18:10Z

could you update to current master

sbodenstein · 2016-07-24T23:57:45Z

@piiswrong: apologies, done.

sbodenstein · 2016-07-26T14:35:37Z

@piiswrong: I want to add a python function that creates a view of the parameter NDArray that gives the weights and biases of each layer as individual NDArrays. Where should this function live?

Godricly · 2016-07-27T08:41:54Z

@sbodenstein : Can you fix this operator for cudnn v5.0? the function parameters are different between 5.0 and 5.1.

sbodenstein · 2016-07-27T11:07:14Z

@Godricly: I tested this only for cudnn v5.0, and there wasn't a problem. Which parameters are different? And what is broken for 5.0?

Also, I assumed they were the same, as the release notes of v5.1 state: "cuDNN 5.1 is fully API compatible with cuDNN 5.0."

thirdwing · 2016-07-27T16:34:29Z

@sbodenstein Can you share the code you used to reproduce torch results?

Godricly · 2016-07-27T16:34:55Z

src/operator/cudnn_rnn-inl.h

+                                        seed_), CUDNN_STATUS_SUCCESS);
+      // RNN descriptors
+      CHECK_EQ(cudnnCreateRNNDescriptor(&rnn_desc_), CUDNN_STATUS_SUCCESS);
+      CHECK_EQ(cudnnSetRNNDescriptor(rnn_desc_,


This function is different in cudnn 5.0.4, which I used.

Are you referring to cudnnSetDropoutDescriptor? If so, I just downloaded the "cuDNN User Guide" from "cuDNN v5 (May 12, 2016), for CUDA 7.5" from the cuDNN site. I don't see any difference. I also looked in the user guide under "Download cuDNN v5 (May 27, 2016), for CUDA 8.0 RC". Still no difference. These are the only 5.0 releases available on the cudNN site. So please, can you be super specific as to how you are seeing a difference, and with what?

well...They made some change between cudnn 5.0.4 (April 2016) and cudnn 5.0.5. Previously cudnnSetRNNDescriptor has a input parameter seqLength. I've updated my cudnn. It's not a problem anymore.

Maybe you were using a release candidate?

sbodenstein · 2016-07-27T23:12:21Z

@thirdwing: sure, here.

kikoqiu · 2016-07-29T07:19:08Z

Bug report:
mxnet.symbol.RNN dosn't work with mx.model.FeedForward, as in the mx model assume the first size in shape to be batchsize and will split it for multi-device training, while mxnet.symbol.RNN uses Shape3(total_layers, batch_size, param_.state_size) for rnn init state input.
See exector_manager.py line 219

data_shapes = {k: tuple([slices[i].stop-slices[i].start] + list(v[1:]))
for k, v in train_data.provide_data + train_data.provide_label}

sbodenstein · 2016-07-29T11:31:10Z

@kikoqiu: this is not a bug, its a design decision. You can use SwapAxis to put it into [batch, seq length, in size] form. Otherwise, we could add support for [batch, seq len, in size] as a symbol option.

kikoqiu · 2016-08-01T00:39:37Z

Hi @sbodenstein , I see it's a design element for RNN in cudnn, and should not be a problem usually. However, the code for mx.model.FeedForward would not work with it, as it assumes all input params to be in the shape of (batch_size,...).

xlvector · 2016-11-22T02:45:28Z

Is there performance test between CUDNN RNN and previous implements (combined by simple symbols) @sbodenstein

sbodenstein added 30 commits July 9, 2016 22:39

- first commit

919c6f4

- removed unnecssary commented out code

7025db8

- fixed error in output shape inference

- some renaming

e7c2e98

- added cudnn destructors

- added dropout

6af1646

- major refactor

050ca51

- completed forward evaluation

- added parameter size test

f81d8e9

- fixed bug where cudnnGetRNNParamsSize needs to be called after cudnnSetRNNDescriptor

- checks for contiguous input tensors

812b7d4

- more consistent param names - removed 'batch_first' option for now. Might add it later again

- fixed input names

a7f64e2

- added backward method

e311b86

- small fix for in/out names

ccb1ae5

- fixed bug: parameters can't have underscore

9b5e383

- fixed off-by-two error in weight shape inference for bidirectional net

8997a5d

- moved calculated param to cudnn_rnn-inl.h

- added option to control num outputs

77bf61c

- removed lint

62d6f8e

- correct handling of backward dependencies

8b3c6b9

- fix lint

82ac041

- first commit

d1d7ce3

- removed unnecssary commented out code

fde1cb3

- fixed error in output shape inference

- some renaming

7a8a11b

- added cudnn destructors

- added dropout

8979b01

- major refactor

7861b3d

- completed forward evaluation

- added parameter size test

c1382b3

- fixed bug where cudnnGetRNNParamsSize needs to be called after cudnnSetRNNDescriptor

- checks for contiguous input tensors

f87c003

- more consistent param names - removed 'batch_first' option for now. Might add it later again

- fixed input names

8b84ef0

- added backward method

d50f2dc

- small fix for in/out names

dc55e74

- fixed bug: parameters can't have underscore

8bd215c

- fixed off-by-two error in weight shape inference for bidirectional net

2e333fc

- moved calculated param to cudnn_rnn-inl.h

- added option to control num outputs

430bd03

- removed lint

4dbe136

- fixed incorrect dropout parameter

8fd0d92

- added dropout states - fixed incorrect handling of variable outputs

sbodenstein added 2 commits July 24, 2016 11:05

- fix incorrect cell state forward handling

4f46668

- fixed lint by replacing unsigned long long with uint64_t

3c50c5c

Merge branch 'master' into feature/RNN_Symbol_cuDNN

aadf59f

piiswrong merged commit 0460049 into apache:master Jul 24, 2016

Godricly reviewed Jul 27, 2016
View reviewed changes

freddycct mentioned this pull request Aug 5, 2016

how to pad inputs for RNN bucketing #2861

Closed

shuokay mentioned this pull request Aug 31, 2016

Compilation fails with cuDNN and cuda 7.5.18 #3145

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Op] cuDNN RNN Symbol #2795

[Op] cuDNN RNN Symbol #2795

sbodenstein commented Jul 21, 2016

piiswrong commented Jul 21, 2016

sxjscience commented Jul 22, 2016

sbodenstein commented Jul 22, 2016

sxjscience commented Jul 23, 2016

piiswrong commented Jul 24, 2016

sbodenstein commented Jul 24, 2016

sbodenstein commented Jul 24, 2016

piiswrong commented Jul 24, 2016

piiswrong commented Jul 24, 2016 •

edited

Loading

sbodenstein commented Jul 24, 2016

piiswrong commented Jul 24, 2016

piiswrong commented Jul 24, 2016

sbodenstein commented Jul 24, 2016

sbodenstein commented Jul 26, 2016

Godricly commented Jul 27, 2016

sbodenstein commented Jul 27, 2016

thirdwing commented Jul 27, 2016

Godricly Jul 27, 2016

sbodenstein Jul 27, 2016

Godricly Jul 28, 2016

sbodenstein Jul 28, 2016

sbodenstein commented Jul 27, 2016

kikoqiu commented Jul 29, 2016 •

edited

Loading

sbodenstein commented Jul 29, 2016

kikoqiu commented Aug 1, 2016

xlvector commented Nov 22, 2016

[Op] cuDNN RNN Symbol #2795

[Op] cuDNN RNN Symbol #2795

Conversation

sbodenstein commented Jul 21, 2016

piiswrong commented Jul 21, 2016

sxjscience commented Jul 22, 2016

sbodenstein commented Jul 22, 2016

sxjscience commented Jul 23, 2016

piiswrong commented Jul 24, 2016

sbodenstein commented Jul 24, 2016

sbodenstein commented Jul 24, 2016

piiswrong commented Jul 24, 2016

piiswrong commented Jul 24, 2016 • edited Loading

sbodenstein commented Jul 24, 2016

piiswrong commented Jul 24, 2016

piiswrong commented Jul 24, 2016

sbodenstein commented Jul 24, 2016

sbodenstein commented Jul 26, 2016

Godricly commented Jul 27, 2016

sbodenstein commented Jul 27, 2016

thirdwing commented Jul 27, 2016

Godricly Jul 27, 2016

Choose a reason for hiding this comment

sbodenstein Jul 27, 2016

Choose a reason for hiding this comment

Godricly Jul 28, 2016

Choose a reason for hiding this comment

sbodenstein Jul 28, 2016

Choose a reason for hiding this comment

sbodenstein commented Jul 27, 2016

kikoqiu commented Jul 29, 2016 • edited Loading

sbodenstein commented Jul 29, 2016

kikoqiu commented Aug 1, 2016

xlvector commented Nov 22, 2016

piiswrong commented Jul 24, 2016 •

edited

Loading

kikoqiu commented Jul 29, 2016 •

edited

Loading