Question about device option #6

ChristinaLK · 2020-03-11T19:53:13Z

templates-GPUs/docker/tensorflow_python/test_tensorflow.py

Line 19 in 21c9139

with tf.device("/gpu:0"):

Does this line make this script ALWAYS use the "first" GPU on a server? What if HTCondor has assigned you a different one (i.e. gpu device 3 instead of gpu device 0)?

@sameerd

sameerd · 2020-03-11T20:40:43Z

That is an interesting question and I don't know the answer.

I assume that tensorflow won't be able to see the other GPU's and so GPU0 will be the first GPU that it can see.

I added the following lines into the script to see what GPU's tensorflow can see.

print("GPU Devices:")
print(tf.config.list_physical_devices('GPU'))

It is in queue and I'll report back when it is done.

sameerd · 2020-03-12T02:05:36Z

The test showed that tf.device("/gpu:0") will refer to the first GPU that HTCondor has assigned and not the first GPU on the server. So this is working correctly.

In case it is useful, here is more detail.

The code to print the list of devices that I wrote above was only for Tensorflow 2.0+ and it had to be changed to what was below for tensorflow 1.4

from tensorflow.python.client import device_lib
local_device_protos = device_lib.list_local_devices()
gpus = [x.name for x in local_device_protos if x.device_type == 'GPU']
print(gpus)

The output from this job (12717904.0) was

['/device:GPU:0']

In the stderr file, Tensorflow says that it assigned pci bus id 0000:5e:00.0 to GPU:0

...
2020-03-11 21:51:15.974258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10312 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:5e:00.0, compute capability: 7.5)
...

When we look at the GPUs on the machine we get that this PCI bus id belongs to CUDA1 and not CUDA0.

$ condor_status -long gitter2002.chtc.wisc.edu | grep -i 0000:5e:00.0
CUDA1DevicePciBusId = "0000:5E:00.0"

So, to sum up, this server has 4 GPU's (CUDA0, CUDA1, CUDA2, CUDA3). HTcondor assigned this job the 2nd GPU i.e. CUDA1 and Tensorflow mapped that to GPU0.

Let me know if you need anything else.

ChristinaLK · 2020-03-12T16:20:55Z

awesome, thanks @sameerd ! I'll pass this on.

jmvera255 · 2020-03-12T16:32:50Z

thanks @sameerd for looking into this, Christina asked about this on my behalf. Do you recommend that users always use tf.device to instruct tf to only use the GPU that has been allocated to the job by HTCondor? I have someone who saw output in their log from tf that looks like tf is trying to use all the GPUs on the machine the job landed on

sameerd · 2020-03-12T17:53:32Z

@jmvera255 tf.device is mainly used to determine whether the computations are placed on the CPU or the GPU. Looking through the logs of a test case, Tensorflow only sees the GPU's that are allocated to its own job and tf.device("/gpu:0") will be the first of these.

If someone's logs look like Tensorflow was trying to use all the GPUs then either

The server they are running on is mis-configured. According to the HT Condor docs, it is supposed to set a variable called CUDA_VISIBLE_DEVICES. Tensorflow automatically reads this variable know which GPUs to map to gpu:0. So maybe this variable is incorrect?
They actually requested all the GPUs in the submit file?

I'm not sure what else could be causing tensorflow to use all the GPUs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about device option #6

Question about device option #6

ChristinaLK commented Mar 11, 2020

sameerd commented Mar 11, 2020 •

edited

Loading

sameerd commented Mar 12, 2020

ChristinaLK commented Mar 12, 2020

jmvera255 commented Mar 12, 2020

sameerd commented Mar 12, 2020

Question about device option #6

Question about device option #6

Comments

ChristinaLK commented Mar 11, 2020

sameerd commented Mar 11, 2020 • edited Loading

sameerd commented Mar 12, 2020

ChristinaLK commented Mar 12, 2020

jmvera255 commented Mar 12, 2020

sameerd commented Mar 12, 2020

sameerd commented Mar 11, 2020 •

edited

Loading