-
Notifications
You must be signed in to change notification settings - Fork 1.3k
How to use Multiple GPUs? #44
Comments
There is nothing about GPU device placement hardcoded, so Tensorflow should handle the device placement. I usually train with 1 GPU only (but multiple workers), so I haven't tried the multi-GPU case. Can you try running a larger model? It could be that TF decides that the small model is not worth splitting across GPUs. Hopefully TF will put the computation on separate devices. E.g. use |
I was using
|
By the way, I was just expecting data parallelism — that different batches would be processed on different GPUs. Sounds very similar to your multiple worker set-up, just on one machine. (But I still don't know how to invoke that, if it's even possible.) |
I see. I think it is not too common to have data parallelism on the same machine for seq2seq models, but people have found that putting different RNN layers on separate devices speed up things, and we should do that if more than 1 GPU is available. I will need to look into data parallelism on multiple GPUs. In the best case all we need is instantiate the model multiple times on a separate GPU and average the losses. In that case it may only require a few lines of code change. But maybe it's more complex than that. Thanks for reporting, I'll take a look at this soon (may take 2-3 days). |
Great! Thanks for taking a look. I think the use case is reasonably common among academics: launch a fresh 8-GPU instance on some public cloud, install/configure software, download data, & run an experiment. OpenNMT follows this model, I believe. |
Sounds reasonable. Will add this in the next few days. |
@dennybritz, may I ask what's the state of this issue? I'm currently trying to train a conversational dialogue system using this tool and would like to train the model using multiple GPUs since our (desired) model is rather huge, with 4096 hidden units in the encoder/decoder each, and I currently run into OOM problems when the size of my model exceeds 2048 hidden units. I'm willed to invest some time to help you implementing this feature (if needed). I already took a quick look at the code and couldn't find an obvious place where put the |
The original issue of parallelizing training across multiple GPUs through data parallelism is very high on my priority list and I will add that ASAP. However, that seems different from your issue, @vongruenigen. What you want is split the model across multiple GPUs. You're not going to fit a model that big into a single GPU. Just to do a back of the envelope calculation, if you have a ~30k vocab and 4096 units, then your softmax matrix will be
It will still work, but it's not going to help you. The vast majority of parameters/memory are usually in the softmax and embeddings/inputs. That's what you need to split (or use an alternative) and there is no "obvious" way to do that, other than maybe using the |
@dennybritz, I was aware that a large number of parameters is placed in the softmax, but I didn't realize that it's that huge. I'm going to investigate into using sampled/sharded softmax and try to find a solution. Thanks a lot for the quick response and the clarifying explanation! |
Distributed Training is supported out of the box using tf.learn. Cluster Configurations can be specified using the TF_CONFIG environment variable, which is parsed by the RunConfig. Refer to the Distributed Tensorflow Guide for more information. Any example of how this works? |
For a general introduction to distributed training settings check out the Tensorflow tutorial: https://www.tensorflow.org/deploy/distributed I haven't seen any example of using So instead of needing to change the code I believe you should be able to set all required options via the environment variable. |
Hi @dennybritz any news on this topic? I was trying to train a Here's the output of the +-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:00:17.0 Off | 0 |
| N/A 70C P0 75W / 149W | 10417MiB / 11439MiB | 71% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:00:18.0 Off | 0 |
| N/A 52C P0 81W / 149W | 10378MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:00:19.0 Off | 0 |
| N/A 63C P0 65W / 149W | 10378MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:00:1A.0 Off | 0 |
| N/A 55C P0 79W / 149W | 10378MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 Off | 0000:00:1B.0 Off | 0 |
| N/A 65C P0 64W / 149W | 10378MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 Off | 0000:00:1C.0 Off | 0 |
| N/A 50C P0 77W / 149W | 10378MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 Off | 0000:00:1D.0 Off | 0 |
| N/A 66C P0 67W / 149W | 10378MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 Off | 0000:00:1E.0 Off | 0 |
| N/A 54C P0 81W / 149W | 10378MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2316 C python 10407MiB |
| 1 2316 C python 10368MiB |
| 2 2316 C python 10368MiB |
| 3 2316 C python 10368MiB |
| 4 2316 C python 10368MiB |
| 5 2316 C python 10368MiB |
| 6 2316 C python 10368MiB |
| 7 2316 C python 10368MiB |
+-----------------------------------------------------------------------------+ btw, it seems that tensorflow is actually using the memory of all the GPUs, but only one of them is actually used. Is this something expected? |
Interesting... |
@davidecaroselli I have the same problem. |
@dennybritz : wanted to know if there are any updates on this. |
have the same issue. |
Is there any update or ideas ? I also want to train a model with multiple gpu. It seem @dennybritz is busy with other stuffs. |
@davidecaroselli @wolfshow face the same problems. How do you smart gays solve the problem? much thanks |
waiting |
I would recommend the |
@davidecaroselli About using all GPU memory problem, TF provides I don't use seq2seq yet, but look at its bin/train.py, I found the flag named |
@nptdat In fact, that doesn't solve those problems. I think the only way make fully use of the gpu is 1. data parallelism 2. allocate each gpu to each layer/some layers manually. However, this library seems to be abandoned.... |
But The results page said that @dennybritz used 8 gpus: |
@ad26kt Yeah, I just mentioned about all memory allocation problem, not about how to make all GPU work. |
guys the answer to all your problems is cudann MY TOKENS AFTER I USE THIS: BUT AFTER TRAINING THE MODEL, I CAN SEE THE UTILIZATION ONLY FOR ONE GPU UPDATE: +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ |
@ad26kt No, you can use data parallelism too, in TensorFlow. Refer to the following cifar10 example provided. As @nptdat mentioned, I also suspect that allow_growth is the reason for using up all the memory available. Even if you are using only a single GPU model, tensorflow by default allocates full memory on all the GPUs it can see. If you are not aware of this previously, visibility of GPUs to a certain application can be controlled by prepending the run command with 'CUDA_VISIBLE_DEVICES=<gpu_numbers_to_be_made_visible>'. |
@sampathchanda |
@DucVuMinh Tensorflow by default uses memory power of all GPUs as it allocates maximum memory for your job but not processing speed. To utilize the processing power of all GPUs as well you need to specify tf.device statements where ever you want to do parallel processing in your code. In Tensorflow, you have to manually assign devices on your own and also calculate the overall gradients by collecting output from all devices on your own. But MXNET does this thing automatically and you just need to specify CONTEXT statement indicating list of GPUs available. You dont have to calculate the average loss of your model by yourself. It will do it on its own. Let me know if you have any more questions |
@imranshaikmuma |
@DucVuMinh can you show me your code? |
@DucVuMinh For memory problem, you can try to add one more line to set the flag |
@imranshaikmuma
I'm using three layers of LSTM. Can you look over for me? Thank you very much. |
@papajohn On one machine I'd guess, though I have not yet tested, one can use multiple GPU's by creating a cluster on a single node with one worker per GPU using the environment variables |
I would recommand tf.learn. It is such a good tool, for many distributed trainning can be done with tf.contrib.learn.experiment. After an experiment is created, an experiment instance knows how to invoke trainning and eval loops in a sensible fashion for distributed training. |
I am encountering the same issue, if anyone finds out a solution Please keep us posted |
@Benz-Tracxpoint |
I found a solution using Usage from with tf.Graph().as_default():
# Set up DeploymentConfig, num_clones should not be more the number of GPUs
config = model_deploy.DeploymentConfig(num_clones=num_GPUs_you_want_to_use)
# Create the global step on the device storing the variables.
with tf.device(config.variables_device()):
global_step = slim.create_global_step()
# Define the inputs for each clone
with tf.device(config.inputs_device()):
images, labels = LoadData(...)
inputs_queue = slim.data.prefetch_queue((images, labels))
# Define the optimizer.
with tf.device(config.optimizer_device()):
optimizer = tf.train.MomentumOptimizer(FLAGS.learning_rate, FLAGS.momentum)
# Define the model including the loss.
def model_fn(inputs_queue):
images, labels = inputs_queue.dequeue()
predictions = CreateNetwork(images)
slim.losses.log_loss(predictions, labels)
model_dp = model_deploy.deploy(config, model_fn, [inputs_queue], optimizer=optimizer)
# Run training.
slim.learning.train(model_dp.train_op, my_log_dir,
summary_op=model_dp.summary_op) |
I think
seq2seq
training is not using multiple GPUs. Thetokens/sec
metric is the same as when I was training on a VM with only 1 GPU or 4 GPUs.Can someone provide a demo of how to use 4 GPUs on a single machine? All I found in the docs was https://google.github.io/seq2seq/training/#distributed-training . That links to an example of how to use multiple devices using
tf.device
and how to use a cluster withtf.learn
, but I couldn't figure out how to proceed with either approach. Thanks!Running
python -m bin.train
as specified in https://google.github.io/seq2seq/nmt/ ...Four devices are found (from logs):
Memory is allocated to all 4, but only one GPU has non-zero utilization.
The text was updated successfully, but these errors were encountered: