[Misc] Replace CUDA_VISIBLE_DEVICES in DP with torch.cuda.set_device for device selection on cuda-like devices #27564

ilmarkov · 2025-10-27T09:05:48Z

Re-applying #26709

In this PR we replace CUDA_VISIBLE_DEVICES setting in DP initilialization with appropriate torch.cuda.set_device

This allows us:

Avoid slow NCCL initilization (which is confused by multiple devices with the same id in one world group)
Avoid breaking GPU -> NIC mapping which is required for deepep performance
Allow using torch symm mem for DP group

We keep the old CUDA_VISIBLE_DEVICES approach in the cases of ray or external launcher as well as non-cuda like devices. This is left for the follow-up PRs.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: ilmarkov <[email protected]>

gemini-code-assist

Code Review

This pull request correctly replaces the usage of CUDA_VISIBLE_DEVICES with programmatic device selection using torch.cuda.set_device for data parallelism on CUDA-like devices. This is a good improvement to avoid NCCL initialization issues and improve performance. The logic for calculating the device rank seems correct. I've found one minor issue with a boundary check that could lead to a crash in specific scenarios.

vllm/v1/worker/gpu_worker.py

Signed-off-by: ilmarkov <[email protected]>

njhill

Thanks @ilmarkov!

Will add same comment as I made on #26709, that we can look at making things more consistent in a follow-on PR, w.r.t. other platforms and local_rank computation.

njhill · 2025-10-27T15:30:27Z

@zhuohan123 @22quinn any chance you could test this branch with your external launcher?

tlrmchlsmth

Please add a test that uses the external launcher?

njhill · 2025-10-29T15:04:52Z

Please add a test that uses the external launcher?

@22quinn recently opened a PR for this #27548

tlrmchlsmth · 2025-10-29T16:34:04Z

Has this PR been tested with the external launcher? Looking for a confirmation before merging

22quinn · 2025-10-29T17:17:44Z

Just tested manually with #27548 and it failed. Logs:

Traceback (most recent call last):
  File "/data/users/quinnzhu/gitrepos/vllm/torchrun.py", line 98, in <module>
    llm = LLM(
          ^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/entrypoints/llm.py", line 335, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/llm_engine.py", line 188, in from_engine_args
    return cls(
           ^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/llm_engine.py", line 122, in __init__
    self.engine_core = EngineCoreClient.make_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core_client.py", line 95, in make_client
    return InprocClient(vllm_config, executor_class, log_stats)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core_client.py", line 264, in __init__
    self.engine_core = EngineCore(*args, **kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core.py", line 102, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/abstract.py", line 98, in __init__
    self._init_executor()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/uniproc_executor.py", line 132, in _init_executor
    super()._init_executor()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/uniproc_executor.py", line 46, in _init_executor
    self.driver_worker.init_device()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/worker/worker_base.py", line 310, in init_device
    self.worker.init_device()  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/worker/gpu_worker.py", line 193, in init_device
    assert self.local_rank < torch.cuda.device_count(), (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: DP adjusted local rank 13 is out of bounds. 
Traceback (most recent call last):
  File "/data/users/quinnzhu/gitrepos/vllm/torchrun.py", line 98, in <module>
    llm = LLM(
          ^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/entrypoints/llm.py", line 335, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/llm_engine.py", line 188, in from_engine_args
    return cls(
           ^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/llm_engine.py", line 122, in __init__
    self.engine_core = EngineCoreClient.make_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core_client.py", line 95, in make_client
    return InprocClient(vllm_config, executor_class, log_stats)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core_client.py", line 264, in __init__
    self.engine_core = EngineCore(*args, **kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core.py", line 102, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/abstract.py", line 98, in __init__
    self._init_executor()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/uniproc_executor.py", line 132, in _init_executor
    super()._init_executor()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/uniproc_executor.py", line 46, in _init_executor
    self.driver_worker.init_device()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/worker/worker_base.py", line 310, in init_device
    self.worker.init_device()  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/worker/gpu_worker.py", line 193, in init_device
    assert self.local_rank < torch.cuda.device_count(), (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: DP adjusted local rank 12 is out of bounds. 
Traceback (most recent call last):
  File "/data/users/quinnzhu/gitrepos/vllm/torchrun.py", line 98, in <module>
    llm = LLM(
          ^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/entrypoints/llm.py", line 335, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/llm_engine.py", line 188, in from_engine_args
    return cls(
           ^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/llm_engine.py", line 122, in __init__
    self.engine_core = EngineCoreClient.make_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core_client.py", line 95, in make_client
    return InprocClient(vllm_config, executor_class, log_stats)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core_client.py", line 264, in __init__
    self.engine_core = EngineCore(*args, **kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core.py", line 102, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/abstract.py", line 98, in __init__
    self._init_executor()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/uniproc_executor.py", line 132, in _init_executor
    super()._init_executor()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/uniproc_executor.py", line 46, in _init_executor
    self.driver_worker.init_device()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/worker/worker_base.py", line 310, in init_device
    self.worker.init_device()  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/worker/gpu_worker.py", line 193, in init_device
    assert self.local_rank < torch.cuda.device_count(), (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: DP adjusted local rank 9 is out of bounds. 
Traceback (most recent call last):
  File "/data/users/quinnzhu/gitrepos/vllm/torchrun.py", line 98, in <module>
    llm = LLM(
          ^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/entrypoints/llm.py", line 335, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/llm_engine.py", line 188, in from_engine_args
    return cls(
           ^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/llm_engine.py", line 122, in __init__
    self.engine_core = EngineCoreClient.make_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core_client.py", line 95, in make_client
    return InprocClient(vllm_config, executor_class, log_stats)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core_client.py", line 264, in __init__
    self.engine_core = EngineCore(*args, **kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core.py", line 102, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/abstract.py", line 98, in __init__
    self._init_executor()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/uniproc_executor.py", line 132, in _init_executor
    super()._init_executor()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/uniproc_executor.py", line 46, in _init_executor
    self.driver_worker.init_device()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/worker/worker_base.py", line 310, in init_device
    self.worker.init_device()  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/worker/gpu_worker.py", line 193, in init_device
    assert self.local_rank < torch.cuda.device_count(), (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: DP adjusted local rank 8 is out of bounds.

Signed-off-by: ilmarkov <[email protected]>

ilmarkov · 2025-10-29T19:14:31Z

@tlrmchlsmth @njhill @22quinn External launcher works with this PR now. With external launcher we don't update CUDA_VISIBLE_DEVICES, so we don't need to adjust ranks in this PR at all.

tlrmchlsmth · 2025-10-29T20:42:40Z

@ilmarkov looks like the test failures are related to this PR

Signed-off-by: ilmarkov <[email protected]>

tlrmchlsmth · 2025-10-30T15:41:09Z

Double checked on my end with:
torchrun --nproc-per-node=4 examples/offline_inference/torchrun_dp_example.py --tp-size=2 --pp-size=1 --dp-size=2 --enable-ep

ilmarkov and others added 19 commits October 9, 2025 14:34

Wip

a23fda0

Signed-off-by: ilmarkov <[email protected]>

Upd TP>1. Cleanup

09f18d7

Signed-off-by: ilmarkov <[email protected]>

Merge branch 'main' into imarkov/dp_startup_fix

f763e4d

Upd

55a6fc8

Signed-off-by: ilmarkov <[email protected]>

Update after reviews

338eb19

Signed-off-by: ilmarkov <[email protected]>

Fix precommit

ed53170

Signed-off-by: ilmarkov <[email protected]>

Revert unrelated changes

820f7ed

Signed-off-by: ilmarkov <[email protected]>

Add comments and update for external launcher

78b3189

Signed-off-by: ilmarkov <[email protected]>

Prettify

735ab89

Signed-off-by: ilmarkov <[email protected]>

Fix Nixl

e0cfc2e

Signed-off-by: ilmarkov <[email protected]>

Fix

936da41

Signed-off-by: ilmarkov <[email protected]>

Disable fix for ray

d749e26

Signed-off-by: ilmarkov <[email protected]>

Merge branch 'main' into imarkov/dp_startup_fix

5150949

Upd

608a475

Signed-off-by: ilmarkov <[email protected]>

Merge branch 'main' into imarkov/dp_startup_fix

391c389

Merge branch 'main' into imarkov/dp_startup_fix

fe914a9

Update comment

bea055e

Merge branch 'main' into imarkov/dp_startup_fix

6d310df

Handle external launcher case

7670d50

Signed-off-by: ilmarkov <[email protected]>

ilmarkov requested review from ApostaC and NickLucche as code owners October 27, 2025 09:05

mergify bot added v1 kv-connector labels Oct 27, 2025

gemini-code-assist bot reviewed Oct 27, 2025

View reviewed changes

vllm/v1/worker/gpu_worker.py Outdated Show resolved Hide resolved

Corrections

12f4585

Signed-off-by: ilmarkov <[email protected]>

njhill approved these changes Oct 27, 2025

View reviewed changes

tlrmchlsmth reviewed Oct 29, 2025

View reviewed changes

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 29, 2025

ilmarkov added 3 commits October 29, 2025 18:45

Merge branch 'main' into imarkov/dp_startup_fix

90d7e05

Quick fix

63c9410

Signed-off-by: ilmarkov <[email protected]>

Upd utils

1288ac9

Signed-off-by: ilmarkov <[email protected]>

Add data_parallel_backend check

3de954c

Signed-off-by: ilmarkov <[email protected]>

tlrmchlsmth approved these changes Oct 30, 2025

View reviewed changes

tlrmchlsmth merged commit 60f76ba into vllm-project:main Oct 30, 2025
52 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Misc] Replace CUDA_VISIBLE_DEVICES in DP with torch.cuda.set_device for device selection on cuda-like devices #27564

[Misc] Replace CUDA_VISIBLE_DEVICES in DP with torch.cuda.set_device for device selection on cuda-like devices #27564

Uh oh!

ilmarkov commented Oct 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

njhill left a comment

Uh oh!

njhill commented Oct 27, 2025

Uh oh!

tlrmchlsmth left a comment

Uh oh!

njhill commented Oct 29, 2025

Uh oh!

tlrmchlsmth commented Oct 29, 2025

Uh oh!

22quinn commented Oct 29, 2025 •

edited

Loading

Uh oh!

ilmarkov commented Oct 29, 2025

Uh oh!

tlrmchlsmth commented Oct 29, 2025

Uh oh!

tlrmchlsmth commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[Misc] Replace CUDA_VISIBLE_DEVICES in DP with torch.cuda.set_device for device selection on cuda-like devices #27564

[Misc] Replace CUDA_VISIBLE_DEVICES in DP with torch.cuda.set_device for device selection on cuda-like devices #27564

Uh oh!

Conversation

ilmarkov commented Oct 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

njhill commented Oct 27, 2025

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

njhill commented Oct 29, 2025

Uh oh!

tlrmchlsmth commented Oct 29, 2025

Uh oh!

22quinn commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilmarkov commented Oct 29, 2025

Uh oh!

tlrmchlsmth commented Oct 29, 2025

Uh oh!

tlrmchlsmth commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ilmarkov commented Oct 27, 2025 •

edited by github-actions bot

Loading

22quinn commented Oct 29, 2025 •

edited

Loading