Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not working correctly when only one gpu is available #3

Open
f-fuchs opened this issue Aug 26, 2024 · 5 comments
Open

not working correctly when only one gpu is available #3

f-fuchs opened this issue Aug 26, 2024 · 5 comments

Comments

@f-fuchs
Copy link

f-fuchs commented Aug 26, 2024

Hey,

when I try to use 3 GPUs but only 2 are available the library behaves as expected and the third process crashes.

python gpu-acquisitor.py --backend pytorch --id 1 --nb-gpus 1 & python3 gpu-acquisitor.py --backend pytorch --id 2 --nb-gpus 1 & python3 gpu-acquisitor.py --backend pytorch --id 3 --nb-gpus 1
[1] 351
[2] 352
GPUOwner3 2024-08-26 10:36:12,762 [INFO] acquiring lock
GPUOwner3 2024-08-26 10:36:12,762 [INFO] lock acquired
GPUOwner1 2024-08-26 10:36:12,767 [INFO] acquiring lock
GPUOwner2 2024-08-26 10:36:12,803 [INFO] acquiring lock
GPUOwner3 2024-08-26 10:36:12,805 [INFO] Set CUDA_VISIBLE_DEVICES=0
GPUOwner3 2024-08-26 10:36:22,043 [INFO] lock released
GPUOwner3 2024-08-26 10:36:22,043 [INFO] Allocated devices: [0]
GPUOwner1 2024-08-26 10:36:22,044 [INFO] lock acquired
GPUOwner1 2024-08-26 10:36:22,081 [INFO] Set CUDA_VISIBLE_DEVICES=1
GPUOwner3 2024-08-26 10:36:31,044 [INFO] Finished
GPUOwner1 2024-08-26 10:36:31,214 [INFO] lock released
GPUOwner2 2024-08-26 10:36:31,214 [INFO] lock acquired
GPUOwner1 2024-08-26 10:36:31,214 [INFO] Allocated devices: [1]
GPUOwner2 2024-08-26 10:36:31,252 [INFO] lock released
Traceback (most recent call last):
  File "/home/fuchsfa/foundation-models/gpu-acquisitor.py", line 77, in <module>
    safe_gpu.claim_gpus(
  File "/home/fuchsfa/foundation-models/.venv/lib/python3.12/site-packages/safe_gpu/safe_gpu.py", line 153, in claim_gpus
    gpu_owner = GPUOwner(nb_gpus, placeholder_fn, logger, debug_sleep)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fuchsfa/foundation-models/.venv/lib/python3.12/site-packages/safe_gpu/safe_gpu.py", line 132, in __init__
    raise RuntimeError(f"Required {nb_gpus} GPUs, only found these free: {free_gpus}. Somebody didn't properly declare their resources?")
RuntimeError: Required 1 GPUs, only found these free: []. Somebody didn't properly declare their resources?
[2]+  Exit 1                  python3 gpu-acquisitor.py --backend pytorch --id 2 --nb-gpus 1

But when I try to use 2 GPUs when only 1 is available both processes get the one available GPU. Can I prevent this?

python gpu-acquisitor.py --backend pytorch --id 1 --nb-gpus 1 & python3 gpu-acquisitor.py --backend pytorch --id 2 --nb-gpus 1
[3] 6117
GPUOwner2 2024-08-26 10:35:35,666 [INFO] Running on a machine with single GPU used for actual display
GPUOwner2 2024-08-26 10:35:35,667 [INFO] Set CUDA_VISIBLE_DEVICES=0
GPUOwner1 2024-08-26 10:35:35,670 [INFO] Running on a machine with single GPU used for actual display
GPUOwner1 2024-08-26 10:35:35,670 [INFO] Set CUDA_VISIBLE_DEVICES=0
GPUOwner1 2024-08-26 10:35:44,742 [INFO] Allocated devices: [0]
GPUOwner2 2024-08-26 10:35:44,850 [INFO] Allocated devices: [0]
GPUOwner1 2024-08-26 10:35:53,743 [INFO] Finished
GPUOwner2 2024-08-26 10:35:53,850 [INFO] Finished
[2]   Done                    python gpu-acquisitor.py --backend pytorch --id 1 --nb-gpus 1
[3]+  Done                    python gpu-acquisitor.py --backend pytorch --id 1 --nb-gpus 1
@ibenes
Copy link
Collaborator

ibenes commented Aug 26, 2024

Hallo @f-fuchs,
thanks for the bringing this up! I'm now super busy with other stuff (thesis deadline in fact ;-) ), I will look into this in September, hopefully next week already. Please poke me if I don't.

@f-fuchs
Copy link
Author

f-fuchs commented Sep 18, 2024

Hey @ibenes,

I hope everything went well with your thesis. Have you had a chance to look into this yet?

@ibenes
Copy link
Collaborator

ibenes commented Oct 8, 2024

Hi @f-fuchs!
Thank not thank you for asking ;-) It's done now, on to safe-gpu issues!

As far as I can tell from your example, your 1-GPU machine is not in an exclusive mode. This case has special handling in safe-gpu: The default for us is GPUs in our local cluster, which are in exclusive mode and need to be allocated appropriately. The other one is occasional GPUs in people PCs, which are not exclusive, running a couple of processes associated with GUI etc. Here we don't want to check that the card is free (because it is not) and safe-gpu simply sends the current process there.

Is that roughly your situation? I know the test for the exclusivity is not perfect; if you'd like the behaviour of safe-gpu to change in your case, could you attach the output of nvidia-smi -q here?

@f-fuchs
Copy link
Author

f-fuchs commented Oct 8, 2024

okay, now i am confused 😕 I ran nvidia-smi -q while having one gpu and while having four but both times the Compute Mode was set to Default but with four gpus it is working correctly.

if this is the intended behavior that's also fine, I currently don't need need it to work with one GPU. 👌
just happy it works with multiple ones 😄

@ibenes
Copy link
Collaborator

ibenes commented Oct 8, 2024

Thanks for the input, we will update the test for exclusivity; I will keep this Issue open until then 💪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants
@ibenes @f-fuchs and others