not working correctly when only one gpu is available #3

f-fuchs · 2024-08-26T08:40:04Z

Hey,

when I try to use 3 GPUs but only 2 are available the library behaves as expected and the third process crashes.

python gpu-acquisitor.py --backend pytorch --id 1 --nb-gpus 1 & python3 gpu-acquisitor.py --backend pytorch --id 2 --nb-gpus 1 & python3 gpu-acquisitor.py --backend pytorch --id 3 --nb-gpus 1
[1] 351
[2] 352
GPUOwner3 2024-08-26 10:36:12,762 [INFO] acquiring lock
GPUOwner3 2024-08-26 10:36:12,762 [INFO] lock acquired
GPUOwner1 2024-08-26 10:36:12,767 [INFO] acquiring lock
GPUOwner2 2024-08-26 10:36:12,803 [INFO] acquiring lock
GPUOwner3 2024-08-26 10:36:12,805 [INFO] Set CUDA_VISIBLE_DEVICES=0
GPUOwner3 2024-08-26 10:36:22,043 [INFO] lock released
GPUOwner3 2024-08-26 10:36:22,043 [INFO] Allocated devices: [0]
GPUOwner1 2024-08-26 10:36:22,044 [INFO] lock acquired
GPUOwner1 2024-08-26 10:36:22,081 [INFO] Set CUDA_VISIBLE_DEVICES=1
GPUOwner3 2024-08-26 10:36:31,044 [INFO] Finished
GPUOwner1 2024-08-26 10:36:31,214 [INFO] lock released
GPUOwner2 2024-08-26 10:36:31,214 [INFO] lock acquired
GPUOwner1 2024-08-26 10:36:31,214 [INFO] Allocated devices: [1]
GPUOwner2 2024-08-26 10:36:31,252 [INFO] lock released
Traceback (most recent call last):
  File "/home/fuchsfa/foundation-models/gpu-acquisitor.py", line 77, in <module>
    safe_gpu.claim_gpus(
  File "/home/fuchsfa/foundation-models/.venv/lib/python3.12/site-packages/safe_gpu/safe_gpu.py", line 153, in claim_gpus
    gpu_owner = GPUOwner(nb_gpus, placeholder_fn, logger, debug_sleep)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fuchsfa/foundation-models/.venv/lib/python3.12/site-packages/safe_gpu/safe_gpu.py", line 132, in __init__
    raise RuntimeError(f"Required {nb_gpus} GPUs, only found these free: {free_gpus}. Somebody didn't properly declare their resources?")
RuntimeError: Required 1 GPUs, only found these free: []. Somebody didn't properly declare their resources?
[2]+  Exit 1                  python3 gpu-acquisitor.py --backend pytorch --id 2 --nb-gpus 1

But when I try to use 2 GPUs when only 1 is available both processes get the one available GPU. Can I prevent this?

python gpu-acquisitor.py --backend pytorch --id 1 --nb-gpus 1 & python3 gpu-acquisitor.py --backend pytorch --id 2 --nb-gpus 1
[3] 6117
GPUOwner2 2024-08-26 10:35:35,666 [INFO] Running on a machine with single GPU used for actual display
GPUOwner2 2024-08-26 10:35:35,667 [INFO] Set CUDA_VISIBLE_DEVICES=0
GPUOwner1 2024-08-26 10:35:35,670 [INFO] Running on a machine with single GPU used for actual display
GPUOwner1 2024-08-26 10:35:35,670 [INFO] Set CUDA_VISIBLE_DEVICES=0
GPUOwner1 2024-08-26 10:35:44,742 [INFO] Allocated devices: [0]
GPUOwner2 2024-08-26 10:35:44,850 [INFO] Allocated devices: [0]
GPUOwner1 2024-08-26 10:35:53,743 [INFO] Finished
GPUOwner2 2024-08-26 10:35:53,850 [INFO] Finished
[2]   Done                    python gpu-acquisitor.py --backend pytorch --id 1 --nb-gpus 1
[3]+  Done                    python gpu-acquisitor.py --backend pytorch --id 1 --nb-gpus 1

The text was updated successfully, but these errors were encountered:

ibenes · 2024-08-26T12:01:11Z

Hallo @f-fuchs,
thanks for the bringing this up! I'm now super busy with other stuff (thesis deadline in fact ;-) ), I will look into this in September, hopefully next week already. Please poke me if I don't.

f-fuchs · 2024-09-18T07:47:27Z

Hey @ibenes,

I hope everything went well with your thesis. Have you had a chance to look into this yet?

ibenes · 2024-10-08T09:09:19Z

Hi @f-fuchs!
Thank not thank you for asking ;-) It's done now, on to safe-gpu issues!

As far as I can tell from your example, your 1-GPU machine is not in an exclusive mode. This case has special handling in safe-gpu: The default for us is GPUs in our local cluster, which are in exclusive mode and need to be allocated appropriately. The other one is occasional GPUs in people PCs, which are not exclusive, running a couple of processes associated with GUI etc. Here we don't want to check that the card is free (because it is not) and safe-gpu simply sends the current process there.

Is that roughly your situation? I know the test for the exclusivity is not perfect; if you'd like the behaviour of safe-gpu to change in your case, could you attach the output of nvidia-smi -q here?

f-fuchs · 2024-10-08T10:07:36Z

okay, now i am confused 😕 I ran nvidia-smi -q while having one gpu and while having four but both times the Compute Mode was set to Default but with four gpus it is working correctly.

if this is the intended behavior that's also fine, I currently don't need need it to work with one GPU. 👌
just happy it works with multiple ones 😄

ibenes · 2024-10-08T13:44:19Z

Thanks for the input, we will update the test for exclusivity; I will keep this Issue open until then 💪

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

not working correctly when only one gpu is available #3

not working correctly when only one gpu is available #3

f-fuchs commented Aug 26, 2024

ibenes commented Aug 26, 2024

f-fuchs commented Sep 18, 2024

ibenes commented Oct 8, 2024

f-fuchs commented Oct 8, 2024

ibenes commented Oct 8, 2024

not working correctly when only one gpu is available #3

not working correctly when only one gpu is available #3

Comments

f-fuchs commented Aug 26, 2024

ibenes commented Aug 26, 2024

f-fuchs commented Sep 18, 2024

ibenes commented Oct 8, 2024

f-fuchs commented Oct 8, 2024

ibenes commented Oct 8, 2024