-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[Misc] Replace CUDA_VISIBLE_DEVICES in DP with torch.cuda.set_device for device selection on cuda-like devices #27564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] Replace CUDA_VISIBLE_DEVICES in DP with torch.cuda.set_device for device selection on cuda-like devices #27564
Conversation
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly replaces the usage of CUDA_VISIBLE_DEVICES with programmatic device selection using torch.cuda.set_device for data parallelism on CUDA-like devices. This is a good improvement to avoid NCCL initialization issues and improve performance. The logic for calculating the device rank seems correct. I've found one minor issue with a boundary check that could lead to a crash in specific scenarios.
Signed-off-by: ilmarkov <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@zhuohan123 @22quinn any chance you could test this branch with your external launcher? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test that uses the external launcher?
|
Has this PR been tested with the external launcher? Looking for a confirmation before merging |
|
Just tested manually with #27548 and it failed. Logs: |
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
|
@tlrmchlsmth @njhill @22quinn External launcher works with this PR now. With external launcher we don't update |
|
@ilmarkov looks like the test failures are related to this PR |
Signed-off-by: ilmarkov <[email protected]>
|
Double checked on my end with: |
Re-applying #26709
In this PR we replace
CUDA_VISIBLE_DEVICESsetting in DP initilialization with appropriatetorch.cuda.set_deviceThis allows us:
We keep the old
CUDA_VISIBLE_DEVICESapproach in the cases of ray or external launcher as well as non-cuda like devices. This is left for the follow-up PRs.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.