How to use Multiple GPUs? #44

papajohn · 2017-03-14T21:12:15Z

I think seq2seq training is not using multiple GPUs. The tokens/sec metric is the same as when I was training on a VM with only 1 GPU or 4 GPUs.

Can someone provide a demo of how to use 4 GPUs on a single machine? All I found in the docs was https://google.github.io/seq2seq/training/#distributed-training . That links to an example of how to use multiple devices using tf.device and how to use a cluster with tf.learn, but I couldn't figure out how to proceed with either approach. Thanks!

Running python -m bin.train as specified in https://google.github.io/seq2seq/nmt/ ...

Four devices are found (from logs):

I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y N N N 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1:   N Y N N 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2:   N N Y N 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3:   N N N Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: a370:00:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 9f8e:00:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: b265:00:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 8743:00:00.0)

Memory is allocated to all 4, but only one GPU has non-zero utilization.

$ nvidia-smi 
Tue Mar 14 19:42:15 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 8743:00:00.0     Off |                    0 |
| N/A   50C    P0    74W / 149W |  10363MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 9F8E:00:00.0     Off |                    0 |
| N/A   78C    P0    67W / 149W |  10363MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | A370:00:00.0     Off |                    0 |
| N/A   74C    P0    94W / 149W |  10402MiB / 11439MiB |     46%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | B265:00:00.0     Off |                    0 |
| N/A   62C    P0    64W / 149W |  10363MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

The text was updated successfully, but these errors were encountered:

dennybritz · 2017-03-14T21:19:43Z

There is nothing about GPU device placement hardcoded, so Tensorflow should handle the device placement. I usually train with 1 GPU only (but multiple workers), so I haven't tried the multi-GPU case.

Can you try running a larger model? It could be that TF decides that the small model is not worth splitting across GPUs. Hopefully TF will put the computation on separate devices. E.g. use nmt_large.yml instead of nmt_small.yml as your config. If it doesn't work, we may need to add tf.device statements to put different RNN layers on different GPUs.

papajohn · 2017-03-14T21:21:27Z

I was using nmt_large.yml above. Thanks for the quick response!

python -m bin.train \
  --config_paths="./example_configs/nmt_large.yml,./example_configs/train_seq2seq.yml" \
  --model_params "
      vocab_source: $VOCAB_SOURCE
      vocab_target: $VOCAB_TARGET" \
  --input_pipeline_train "
    class: ParallelTextInputPipeline
    params:
      source_files:
        - $TRAIN_SOURCES
      target_files:
        - $TRAIN_TARGETS" \
  --input_pipeline_dev "
    class: ParallelTextInputPipeline
    params:
       source_files:
        - $DEV_SOURCES
       target_files:
        - $DEV_TARGETS" \
  --batch_size 32 \
  --buckets 8,12,16,20,24,28,32,36,40 \
  --train_steps $TRAIN_STEPS \
  --output_dir $MODEL_DIR

papajohn · 2017-03-14T21:26:46Z

By the way, I was just expecting data parallelism — that different batches would be processed on different GPUs. Sounds very similar to your multiple worker set-up, just on one machine. (But I still don't know how to invoke that, if it's even possible.)

dennybritz · 2017-03-14T21:33:37Z

I see. I think it is not too common to have data parallelism on the same machine for seq2seq models, but people have found that putting different RNN layers on separate devices speed up things, and we should do that if more than 1 GPU is available.

I will need to look into data parallelism on multiple GPUs. In the best case all we need is instantiate the model multiple times on a separate GPU and average the losses. In that case it may only require a few lines of code change. But maybe it's more complex than that.

Thanks for reporting, I'll take a look at this soon (may take 2-3 days).

papajohn · 2017-03-14T21:55:09Z

Great! Thanks for taking a look.

I think the use case is reasonably common among academics: launch a fresh 8-GPU instance on some public cloud, install/configure software, download data, & run an experiment.

OpenNMT follows this model, I believe.

dennybritz · 2017-03-14T21:56:33Z

Sounds reasonable. Will add this in the next few days.

vongruenigen · 2017-03-23T09:56:46Z

@dennybritz, may I ask what's the state of this issue? I'm currently trying to train a conversational dialogue system using this tool and would like to train the model using multiple GPUs since our (desired) model is rather huge, with 4096 hidden units in the encoder/decoder each, and I currently run into OOM problems when the size of my model exceeds 2048 hidden units.

I'm willed to invest some time to help you implementing this feature (if needed). I already took a quick look at the code and couldn't find an obvious place where put the with tf.device(...) wrapper. As far as I understand it, the computational graph must be splitted into multiple parts if I want to leverage the computational power of multiple GPUs (not only RAM). Due to the nature of seq2seq models, this could for example be done by putting the encoder on the first GPU and the decoder on another, right? But I also see some problems, for example does the attention mechanism still work "out of the box" if the encoder is placed on different GPU than the decoder?

dennybritz · 2017-03-23T18:43:54Z

The original issue of parallelizing training across multiple GPUs through data parallelism is very high on my priority list and I will add that ASAP.

However, that seems different from your issue, @vongruenigen. What you want is split the model across multiple GPUs. You're not going to fit a model that big into a single GPU. Just to do a back of the envelope calculation, if you have a ~30k vocab and 4096 units, then your softmax matrix will be 4096 * 3* 30,000 *32 = 11.7GB alone. So, you're not going to fit that model onto a single GPU, no matter what code you use. To make this work you'd need to modify the model code and use something like sampled softmax, or implement a sharded softmax yourself.

Due to the nature of seq2seq models, this could for example be done by putting the encoder on the first GPU and the decoder on another, right? But I also see some problems, for example does the attention mechanism still work "out of the box" if the encoder is placed on different GPU than the decoder?

It will still work, but it's not going to help you. The vast majority of parameters/memory are usually in the softmax and embeddings/inputs. That's what you need to split (or use an alternative) and there is no "obvious" way to do that, other than maybe using the sampled_softmax_loss in Tensorflow. I haven't used that myself though, and it will only help you for training, not inference.

vongruenigen · 2017-03-23T23:13:42Z

@dennybritz, I was aware that a large number of parameters is placed in the softmax, but I didn't realize that it's that huge. I'm going to investigate into using sampled/sharded softmax and try to find a solution. Thanks a lot for the quick response and the clarifying explanation!

skyw · 2017-03-27T20:54:05Z

Distributed Training is supported out of the box using tf.learn. Cluster Configurations can be specified using the TF_CONFIG environment variable, which is parsed by the RunConfig. Refer to the Distributed Tensorflow Guide for more information.

Any example of how this works?

dennybritz · 2017-03-28T17:28:42Z

Any example of how this works?

For a general introduction to distributed training settings check out the Tensorflow tutorial: https://www.tensorflow.org/deploy/distributed

I haven't seen any example of using TF_CONFIG, but check out the documentation in this file: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/estimators/run_config.py

So instead of needing to change the code I believe you should be able to set all required options via the environment variable.

davidecaroselli · 2017-04-05T09:25:42Z

Hi @dennybritz

any news on this topic? I was trying to train a nmt_large model on 8 GPUs machine but I confirm that only one was actually used.

Here's the output of the nvidia-smi command:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:00:17.0     Off |                    0 |
| N/A   70C    P0    75W / 149W |  10417MiB / 11439MiB |     71%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:00:18.0     Off |                    0 |
| N/A   52C    P0    81W / 149W |  10378MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:00:19.0     Off |                    0 |
| N/A   63C    P0    65W / 149W |  10378MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:00:1A.0     Off |                    0 |
| N/A   55C    P0    79W / 149W |  10378MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           Off  | 0000:00:1B.0     Off |                    0 |
| N/A   65C    P0    64W / 149W |  10378MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           Off  | 0000:00:1C.0     Off |                    0 |
| N/A   50C    P0    77W / 149W |  10378MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           Off  | 0000:00:1D.0     Off |                    0 |
| N/A   66C    P0    67W / 149W |  10378MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           Off  | 0000:00:1E.0     Off |                    0 |
| N/A   54C    P0    81W / 149W |  10378MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      2316    C   python                                       10407MiB |
|    1      2316    C   python                                       10368MiB |
|    2      2316    C   python                                       10368MiB |
|    3      2316    C   python                                       10368MiB |
|    4      2316    C   python                                       10368MiB |
|    5      2316    C   python                                       10368MiB |
|    6      2316    C   python                                       10368MiB |
|    7      2316    C   python                                       10368MiB |
+-----------------------------------------------------------------------------+

btw, it seems that tensorflow is actually using the memory of all the GPUs, but only one of them is actually used. Is this something expected?

Boggartfly · 2017-04-12T20:53:57Z

Interesting...

wolfshow · 2017-04-13T06:30:53Z

@davidecaroselli I have the same problem.

hrishikeshvganu · 2017-05-10T15:04:26Z

@dennybritz : wanted to know if there are any updates on this.

kirtisbhandari · 2017-05-11T17:54:19Z

have the same issue.

liyi193328 · 2017-05-15T13:17:55Z

Is there any update or ideas ? I also want to train a model with multiple gpu. It seem @dennybritz is busy with other stuffs.

liyi193328 · 2017-05-16T03:19:55Z

@davidecaroselli @wolfshow face the same problems. How do you smart gays solve the problem? much thanks

NingHongKe · 2017-06-28T11:38:54Z

waiting

stefan-it · 2017-06-28T11:41:11Z

I would recommend the tensor2tensor library, support of multiple gpus is working pretty well: https://github.com/tensorflow/tensor2tensor

nptdat · 2017-07-02T02:02:41Z

@davidecaroselli About using all GPU memory problem, TF provides gpu_options.allow_growth option on session config. If it's True, TF will start with small memory & allocate more when it requires. If it's False (default), TF will allocate all of the memory at the beginning. That's why you have seen all of your GPU mem is allocated.
Ref: https://www.tensorflow.org/tutorials/using_gpu

I don't use seq2seq yet, but look at its bin/train.py, I found the flag named gpu_allow_growth which actually provides value for the original gpu_options.allow_growth option. It's clearly set to False as default. I guess that you can set this flag to True to request TF to allocate memory on demand.

yanghoonkim · 2017-07-04T05:11:44Z

@nptdat In fact, that doesn't solve those problems. I think the only way make fully use of the gpu is 1. data parallelism 2. allocate each gpu to each layer/some layers manually. However, this library seems to be abandoned....

yanghoonkim · 2017-07-04T05:45:21Z

I see. I think it is not too common to have data parallelism on the same machine for seq2seq models, but people have found that putting different RNN layers on separate devices speed up things, and we should do that if more than 1 GPU is available.

But The results page said that @dennybritz used 8 gpus:
https://google.github.io/seq2seq/results/

nptdat · 2017-07-04T07:46:55Z

@ad26kt Yeah, I just mentioned about all memory allocation problem, not about how to make all GPU work.

imranshaikmuma · 2017-07-12T15:51:02Z

guys the answer to all your problems is cudann
https://developer.nvidia.com/cudnn
install cudnn from above link
install instructions
https://stackoverflow.com/questions/42013316/after-building-tensorflow-from-source-seeing-libcudart-so-and-libcudnn-errors

MY TOKENS AFTER I USE THIS:
2017-07-12 15:42:33.111509: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 1 2 3
2017-07-12 15:42:33.111533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y Y Y Y
2017-07-12 15:42:33.111539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 1: Y Y Y Y
2017-07-12 15:42:33.111543: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 2: Y Y Y Y
2017-07-12 15:42:33.111547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 3: Y Y Y Y

BUT AFTER TRAINING THE MODEL, I CAN SEE THE UTILIZATION ONLY FOR ONE GPU
THAT MEANS IT IS USING 4 GPUS WHILE TRAINING BUT AFTER TRAINING IT IS JUST COMING BACK TO ONE GPU
WE MUST QUERY NVIDIA-SMI WHILE WE TRAIN USING DIFFERENT CONNECTION I FEEL
I WILL TRY AND UPDATE

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2006 G /usr/lib/xorg/Xorg 15MiB |
| 0 2398 C python3 10894MiB |
| 1 2398 C python3 10867MiB |
| 2 2398 C python3 10867MiB |
| 3 2398 C python3 10865MiB |
+-----------------------------------------------------------------------------+

imranshaikmuma · 2017-07-19T18:25:12Z

I used MXNET and solved the issue

sampathchanda · 2017-07-20T19:47:52Z

@ad26kt No, you can use data parallelism too, in TensorFlow. Refer to the following cifar10 example provided.

As @nptdat mentioned, I also suspect that allow_growth is the reason for using up all the memory available. Even if you are using only a single GPU model, tensorflow by default allocates full memory on all the GPUs it can see.

If you are not aware of this previously, visibility of GPUs to a certain application can be controlled by prepending the run command with 'CUDA_VISIBLE_DEVICES=<gpu_numbers_to_be_made_visible>'.

DucVuMinh · 2017-08-14T07:01:06Z

@sampathchanda
I run my model with multi GPUs and data parallelism but all GPU's memory is located while only one GPU is used to calculate.
And I also run code at cifar10 example but it is same as situation describing above.
Can you explain why?

imranshaikmuma · 2017-08-14T12:54:10Z

@DucVuMinh Tensorflow by default uses memory power of all GPUs as it allocates maximum memory for your job but not processing speed. To utilize the processing power of all GPUs as well you need to specify tf.device statements where ever you want to do parallel processing in your code. In Tensorflow, you have to manually assign devices on your own and also calculate the overall gradients by collecting output from all devices on your own. But MXNET does this thing automatically and you just need to specify CONTEXT statement indicating list of GPUs available. You dont have to calculate the average loss of your model by yourself. It will do it on its own. Let me know if you have any more questions

DucVuMinh · 2017-08-14T14:58:59Z

@imranshaikmuma
In my model I also use tf.device statements to do parallel processing. I implement as scenario describing in cifar10 multi gpu train but when training I see that only one GPU is using. And when I run the example: cifar10 multi gpu train, it still use only one GPU while all memory of all GPUs are located.

imranshaikmuma · 2017-08-14T15:06:17Z

@DucVuMinh can you show me your code?

nptdat · 2017-08-15T01:50:18Z

@DucVuMinh
When running cifar10_multi_gpu_train.py, did you set the flag num_gpus to a value >1 (e.g., the number of GPUs you have) ? The setting is on the line 59.

For memory problem, you can try to add one more line to set the flag gpu_allow_growth to True. I guess this setting will request TF to allocate mem on demand, not use all at the beginning.

DucVuMinh · 2017-08-16T03:33:28Z

@imranshaikmuma
This is my code:
`
# Create an optimizer that performs gradient descent.
opt = tf.train.GradientDescentOptimizer(lr)
#create queue file_input
filename_queue = tf.train.string_input_producer(arr_file_data, num_epochs=None)
#read and decode data Node
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
record_defaults = np.ones([14491,1])
example_vec = tf.decode_csv(value, record_defaults=record_defaults.tolist())
min_after_dequeue = 100
capacity = min_after_dequeue + 3 * option_variables.batch_size
#create a batch Node
batch = tf.train.shuffle_batch(
[example_vec], batch_size=option_variables.batch_size, capacity=capacity,
min_after_dequeue=min_after_dequeue)
#split batch in to data, label, mask, weight_loss, length
data, label, mask, weight_loss, length = tf.split(batch, [14280, 70, 70, 70, 1], 1)

 # Split the batch of data, label, mask, weight_loss, length for towers.
datas_splits = tf.split(axis=0, num_or_size_splits=FLAGS.num_gpus, value=data)
masks_splits = tf.split(axis=0, num_or_size_splits=FLAGS.num_gpus, value=mask)
weights_losss_splits = tf.split(axis=0, num_or_size_splits=FLAGS.num_gpus, value=weight_loss)
lengths_splits = tf.split(axis=0, num_or_size_splits=FLAGS.num_gpus, value=length)
label_splits = tf.split(axis=0, num_or_size_splits=FLAGS.num_gpus, value=label)

with tf.variable_scope("train") as scope:
    # Calculate the gradients for each model tower.
    tower_grads = []
    for i in range(FLAGS.num_gpus):
        with tf.device('/gpu:%d' % i):
            with tf.name_scope('/gpu_cal%d' % i):
                #create variable for model
                weights, biases, w_fw, b_fw, w_bw, b_bw = create_model_variable()
                #get loss and some node in model to run
                loss_ = \
                        loss(input=datas_splits[i], input_length=lengths_splits[i], lable= label_splits[i],
                            masks = masks_splits[i],W = weights, bias= biases, W_fw = w_fw, bias_fw= b_fw,
                            W_bw = w_bw, bias_bw= b_bw,weight_loss= weights_losss_splits[i])
                # Reuse variables for the next tower.
                scope.reuse_variables()
                # Calculate the gradients for the batch of data on this CIFAR tower.
                grads = opt.compute_gradients(loss)
                # Keep track of the gradients across all towers.
                tower_grads.append(grads)

    # calculate the mean of each gradient
    grads = average_gradients(tower_grads)
    # Apply the gradients to adjust the shared variables.
    apply_gradient_op = opt.apply_gradients(grads)
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    config.gpu_options.per_process_gpu_memory_fraction = 1
    config.log_device_placement = True
    config.allow_soft_placement = True
    # Build an initialization operation to run below.
    init = tf.global_variables_initializer()
    sess = tf.Session(config=config)
    #init variables
    sess.run(init)
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord, sess= sess)
    #loop in number of epochs for training
    for i in range(0,num_epochs):
        print ("epoch ", i)
        for j in range(number_batch):
            _, _loss, _accuracy = \
                sess.run([apply_gradient_op, loss_])
    coord.request_stop()
    coord.join(threads)`

I'm using three layers of LSTM. Can you look over for me? Thank you very much.

kdavis-mozilla · 2017-08-30T10:30:05Z

@papajohn On one machine I'd guess, though I have not yet tested, one can use multiple GPU's by creating a cluster on a single node with one worker per GPU using the environment variables TF_CONFIG and CUDA_VISIBLE_DEVICES. You can find info on setting TF_CONFIG here[1] and you can find info on setting CUDA_VISIBLE_DEVICES here[2].

Amitayus · 2017-10-09T06:20:11Z

I would recommand tf.learn. It is such a good tool, for many distributed trainning can be done with tf.contrib.learn.experiment. After an experiment is created, an experiment instance knows how to invoke trainning and eval loops in a sensible fashion for distributed training.

Benz-Tracxpoint · 2018-04-19T00:24:48Z

I am encountering the same issue, if anyone finds out a solution Please keep us posted

DucVuMinh · 2018-04-20T09:37:58Z

@Benz-Tracxpoint
Before, I also encountered this problem. But I solved this.
Check the following step to make sure your set up and code are correct.
Step 1: Set up number of GPU > 1
Step 2: Using tf.device to assign calculation job to each of GPU

mzlr · 2018-07-19T22:29:16Z

@Benz-Tracxpoint

I found a solution using slim.model_deploy. It implements the In-graph replication synchronous training for a single machine with multiple GPUs (averaging gradients as in CIFAR-10 multi-GPU trainer).

Usage from slim.model_deploy:

with tf.Graph().as_default():
  # Set up DeploymentConfig, num_clones should not be more the number of GPUs
  config = model_deploy.DeploymentConfig(num_clones=num_GPUs_you_want_to_use)

  # Create the global step on the device storing the variables.
  with tf.device(config.variables_device()):
    global_step = slim.create_global_step()

  # Define the inputs for each clone
  with tf.device(config.inputs_device()):
    images, labels = LoadData(...)
    inputs_queue = slim.data.prefetch_queue((images, labels))

  # Define the optimizer.
  with tf.device(config.optimizer_device()):
    optimizer = tf.train.MomentumOptimizer(FLAGS.learning_rate, FLAGS.momentum)

  # Define the model including the loss.
  def model_fn(inputs_queue):
    images, labels = inputs_queue.dequeue()
    predictions = CreateNetwork(images)
    slim.losses.log_loss(predictions, labels)
  model_dp = model_deploy.deploy(config, model_fn, [inputs_queue], optimizer=optimizer)
  
  # Run training.
  slim.learning.train(model_dp.train_op, my_log_dir,
                      summary_op=model_dp.summary_op)

dennybritz added the feature label Mar 14, 2017

dennybritz mentioned this issue Mar 28, 2017

Data parallelism across multiple GPUs #121

Open

johnsrude mentioned this issue Oct 16, 2017

TensorFlow 1.3.0 GPU for Windows doesn't work for 2 NVIDIA P4s tensorflow/tensorflow#13692

Closed

Shujian2015 mentioned this issue Oct 4, 2018

Multi-GPU support? abisee/pointer-generator#58

Open

serejandmyself mentioned this issue Oct 25, 2019

Optimization of CUDA kernel for multiple GPUs cybercongress/go-cyber#342

Open

How to use Multiple GPUs? #44

How to use Multiple GPUs? #44

Comments

papajohn commented Mar 14, 2017

dennybritz commented Mar 14, 2017 • edited Loading

papajohn commented Mar 14, 2017

papajohn commented Mar 14, 2017

dennybritz commented Mar 14, 2017 • edited Loading

papajohn commented Mar 14, 2017

dennybritz commented Mar 14, 2017

vongruenigen commented Mar 23, 2017 • edited Loading

dennybritz commented Mar 23, 2017 • edited Loading

vongruenigen commented Mar 23, 2017

skyw commented Mar 27, 2017

dennybritz commented Mar 28, 2017

davidecaroselli commented Apr 5, 2017 • edited Loading

Boggartfly commented Apr 12, 2017

wolfshow commented Apr 13, 2017

hrishikeshvganu commented May 10, 2017

kirtisbhandari commented May 11, 2017

liyi193328 commented May 15, 2017

liyi193328 commented May 16, 2017

NingHongKe commented Jun 28, 2017

stefan-it commented Jun 28, 2017

nptdat commented Jul 2, 2017 • edited Loading

yanghoonkim commented Jul 4, 2017

yanghoonkim commented Jul 4, 2017

nptdat commented Jul 4, 2017

imranshaikmuma commented Jul 12, 2017 • edited Loading

imranshaikmuma commented Jul 19, 2017

sampathchanda commented Jul 20, 2017

DucVuMinh commented Aug 14, 2017

imranshaikmuma commented Aug 14, 2017

DucVuMinh commented Aug 14, 2017

imranshaikmuma commented Aug 14, 2017

nptdat commented Aug 15, 2017

DucVuMinh commented Aug 16, 2017 • edited Loading

kdavis-mozilla commented Aug 30, 2017

Amitayus commented Oct 9, 2017

Benz-Tracxpoint commented Apr 19, 2018

DucVuMinh commented Apr 20, 2018 • edited Loading

mzlr commented Jul 19, 2018

dennybritz commented Mar 14, 2017 •

edited

Loading

dennybritz commented Mar 14, 2017 •

edited

Loading

vongruenigen commented Mar 23, 2017 •

edited

Loading

dennybritz commented Mar 23, 2017 •

edited

Loading

davidecaroselli commented Apr 5, 2017 •

edited

Loading

nptdat commented Jul 2, 2017 •

edited

Loading

imranshaikmuma commented Jul 12, 2017 •

edited

Loading

DucVuMinh commented Aug 16, 2017 •

edited

Loading

DucVuMinh commented Apr 20, 2018 •

edited

Loading