[#124] FEATURE: add support for multiple GPU nodes#113
[#124] FEATURE: add support for multiple GPU nodes#113jacksonjacobs1 wants to merge 3 commits intochoosehappy:v2.0from
Conversation
DescriptionThe current start_dlproc method in quickannotator does not correctly manage GPU resources when multiple GPUs are available for a single ray actor: def start_dlproc(self, allow_pred=True):
if self.getProcRunningSince() is not None:
self.logger.warning("Already running, not starting again")
return
self.logger.info(f"Starting up {build_actor_name(annotation_class_id=self.annotation_class_id)}")
self.setProcRunningSince()
total_gpus = ray.cluster_resources().get("GPU", 0)
self.logger.info(f"Total GPUs available: {total_gpus}")
scaling_config = ray.train.ScalingConfig(
num_workers=int(total_gpus),
use_gpu=True,
resources_per_worker={"GPU": .01},
placement_strategy="STRICT_SPREAD"
)
trainer = ray.train.torch.TorchTrainer(
train_pred_loop,
scaling_config=scaling_config,
train_loop_config={
'annotation_class_id': self.annotation_class_id,
'tile_size': self.tile_size,
'magnification': self.magnification
}
)
self.hexid = trainer.fit().hex()
self.logger.info(f"DLActor started with hexid: {self.hexid}")
return self.hexidProposed solutionConfigure the ray cluster as follows: # causes ray to not modify the CUDA_VISIBLE_DEVICES — essentially allowing us to manage them ourselves
export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0
# Explicitly manage the CUDA_VISIBLE_DEVICES. This isn't ideal but prevents an error "ValueError: '0' is not in list"
export CUDA_VISIBLE_DEVICES=0,1Then in start_dlproc we do not need to set the placement strategy. The train_pred_loop function should be modified to look like this: def trainpred_func2(config):
print(f"{os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']=}")
print(f"{os.environ['CUDA_VISIBLE_DEVICES']=}")
model = resnet18(num_classes=10)
cuda_dev=torch.device('cuda',ray.train.get_context().get_local_rank())
model=ray.train.torch.prepare_model(model,cuda_dev)
time.sleep(10)
scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU":.1})
trainer = ray.train.torch.TorchTrainer(trainpred_func2,scaling_config=scaling_config)
trainer.fit() |
choosehappy
left a comment
There was a problem hiding this comment.
not sure if this was actually requested for review ; )
| "RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "0", | ||
|
|
||
| // We set CUDA_VISIBLE_DEVICES here, as each container will need to set visible GPUs independently. | ||
| "CUDA_VISIBLE_DEVICES": "0,1" |
There was a problem hiding this comment.
is it possible to set this dynamically? what if someone has like, e.g., 10 GPUs, or only 1 GPU?
There was a problem hiding this comment.
The hardcoded values were set for simplicity. This PR is not yet ready for review - I still need to test whether QA works with a multi-node, multi-gpu cluster.
That said, there are ways to set CUDA_VISIBLE_DEVICES dynamically:
-
Run export command after container setup (can be added to either Dockerfile, devcontainer.json, or docker compose file):
export CUDA_VISIBLE_DEVICES=$(nvidia-smi --query-gpu=uuid --format=csv,noheader | paste -sd "," -)
-
Using nvidia-container-toolkit's API:
In devcontainer.json{ "name": "My GPU Dev Container", "runArgs": ["--gpus=all"], "workspaceFolder": "/workspace" }In docker compose yaml:
services: app: image: your-image runtime: nvidia environment: - NVIDIA_VISIBLE_DEVICES=all
But it's unclear to me whether option 2 avoids the bug that you noticed with ray requiring CUDA_VISIBLE_DEVICES to be explicitly set:
ray-project/ray#49985 (comment)
If not, we could run the following command within the container to ensure CUDA_VISIBLE_DEVICES is set:
export CUDA_VISIBLE_DEVICES=$NVIDIA_VISIBLE_DEVICES| ``` | ||
|
|
||
| 2. Modify `devcontainer.json` to suit your use case. Particularly, change the value of `CUDA_VISIBLE_DEVICES` to your desired GPU ids. | ||
|
|
There was a problem hiding this comment.
i see - are folks likely to read the readme in detail though? or perhaps we should have some explicit messages appear on the screen/log during bootup to draw their attention to these components?
https://jacksonjjacobs.com/openproject/work_packages/124