Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about device option #6

Open
ChristinaLK opened this issue Mar 11, 2020 · 5 comments
Open

Question about device option #6

ChristinaLK opened this issue Mar 11, 2020 · 5 comments

Comments

@ChristinaLK
Copy link
Collaborator

Does this line make this script ALWAYS use the "first" GPU on a server? What if HTCondor has assigned you a different one (i.e. gpu device 3 instead of gpu device 0)?

@sameerd

@sameerd
Copy link
Contributor

sameerd commented Mar 11, 2020

That is an interesting question and I don't know the answer.

I assume that tensorflow won't be able to see the other GPU's and so GPU0 will be the first GPU that it can see.

I added the following lines into the script to see what GPU's tensorflow can see.

print("GPU Devices:")
print(tf.config.list_physical_devices('GPU'))

It is in queue and I'll report back when it is done.

@sameerd
Copy link
Contributor

sameerd commented Mar 12, 2020

The test showed that tf.device("/gpu:0") will refer to the first GPU that HTCondor has assigned and not the first GPU on the server. So this is working correctly.

In case it is useful, here is more detail.

The code to print the list of devices that I wrote above was only for Tensorflow 2.0+ and it had to be changed to what was below for tensorflow 1.4

from tensorflow.python.client import device_lib
local_device_protos = device_lib.list_local_devices()
gpus = [x.name for x in local_device_protos if x.device_type == 'GPU']
print(gpus)

The output from this job (12717904.0) was

['/device:GPU:0']

In the stderr file, Tensorflow says that it assigned pci bus id 0000:5e:00.0 to GPU:0

...
2020-03-11 21:51:15.974258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10312 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:5e:00.0, compute capability: 7.5)
...

When we look at the GPUs on the machine we get that this PCI bus id belongs to CUDA1 and not CUDA0.

$ condor_status -long gitter2002.chtc.wisc.edu | grep -i 0000:5e:00.0
CUDA1DevicePciBusId = "0000:5E:00.0"

So, to sum up, this server has 4 GPU's (CUDA0, CUDA1, CUDA2, CUDA3). HTcondor assigned this job the 2nd GPU i.e. CUDA1 and Tensorflow mapped that to GPU0.

Let me know if you need anything else.

@ChristinaLK
Copy link
Collaborator Author

awesome, thanks @sameerd ! I'll pass this on.

@jmvera255
Copy link

thanks @sameerd for looking into this, Christina asked about this on my behalf. Do you recommend that users always use tf.device to instruct tf to only use the GPU that has been allocated to the job by HTCondor? I have someone who saw output in their log from tf that looks like tf is trying to use all the GPUs on the machine the job landed on

@sameerd
Copy link
Contributor

sameerd commented Mar 12, 2020

@jmvera255 tf.device is mainly used to determine whether the computations are placed on the CPU or the GPU. Looking through the logs of a test case, Tensorflow only sees the GPU's that are allocated to its own job and tf.device("/gpu:0") will be the first of these.

If someone's logs look like Tensorflow was trying to use all the GPUs then either

  1. The server they are running on is mis-configured. According to the HT Condor docs, it is supposed to set a variable called CUDA_VISIBLE_DEVICES. Tensorflow automatically reads this variable know which GPUs to map to gpu:0. So maybe this variable is incorrect?
  2. They actually requested all the GPUs in the submit file?

I'm not sure what else could be causing tensorflow to use all the GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants