shared usage of GPUs

Hi @ExpectationMax,

How difficult would it be to allow shared usage of GPUs given a known memory constraint in advance? This would be similar to the way many job scheduling softwares work when allocating the correct amount of workers.

The utility is nice as-is, but I think such a feat would be very useful for larger GPUs (over 16 GB of memory).

For instance, we could add a `memory` arg to the available options of each command and keep track of the per-GPU memory usage, instead of it being an exclusive flag (in use or free). In case `memory` is not set, we could assume either a default memory allocation request, or the control of a full GPU device, regardless of capacity. Of course, we would have to retrieve the capacity of each available GPU device and make sure that any given process does not exceed the requested memory allocation.

For the latter, the main deep learning frameworks have a way to do it:
- [`tensorflow.GPUOptions(per_process_gpu_memory_fraction)`](https://www.tensorflow.org/api_docs/python/tf/compat/v1/GPUOptions)
- [`torch.cuda.set_per_process_memory_fraction`](https://pytorch.org/docs/stable/generated/torch.cuda.set_per_process_memory_fraction.html)
- the env variable [`MXNET_GPU_MEM_POOL_RESERVE`](https://discuss.mxnet.apache.org/t/how-to-limit-gpu-memory-usage/6304/7)

but I do not know how we could enforce it at the device level regardless of the framework.

Alternatively, are there any other utilities you know of that already integrate this feat and that I could use?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shared usage of GPUs #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

shared usage of GPUs #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions