Skip to content

Conversation

@ilmarkov
Copy link
Contributor

@ilmarkov ilmarkov commented Oct 27, 2025

Re-applying #26709

In this PR we replace CUDA_VISIBLE_DEVICES setting in DP initilialization with appropriate torch.cuda.set_device

This allows us:

  • Avoid slow NCCL initilization (which is confused by multiple devices with the same id in one world group)
  • Avoid breaking GPU -> NIC mapping which is required for deepep performance
  • Allow using torch symm mem for DP group

We keep the old CUDA_VISIBLE_DEVICES approach in the cases of ray or external launcher as well as non-cuda like devices. This is left for the follow-up PRs.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly replaces the usage of CUDA_VISIBLE_DEVICES with programmatic device selection using torch.cuda.set_device for data parallelism on CUDA-like devices. This is a good improvement to avoid NCCL initialization issues and improve performance. The logic for calculating the device rank seems correct. I've found one minor issue with a boundary check that could lead to a crash in specific scenarios.

Signed-off-by: ilmarkov <[email protected]>
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ilmarkov!

Will add same comment as I made on #26709, that we can look at making things more consistent in a follow-on PR, w.r.t. other platforms and local_rank computation.

@njhill
Copy link
Member

njhill commented Oct 27, 2025

@zhuohan123 @22quinn any chance you could test this branch with your external launcher?

Copy link
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test that uses the external launcher?

@njhill
Copy link
Member

njhill commented Oct 29, 2025

Please add a test that uses the external launcher?

@22quinn recently opened a PR for this #27548

@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 29, 2025
@tlrmchlsmth
Copy link
Member

Has this PR been tested with the external launcher? Looking for a confirmation before merging

@22quinn
Copy link
Collaborator

22quinn commented Oct 29, 2025

Just tested manually with #27548 and it failed. Logs:

Traceback (most recent call last):
  File "/data/users/quinnzhu/gitrepos/vllm/torchrun.py", line 98, in <module>
    llm = LLM(
          ^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/entrypoints/llm.py", line 335, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/llm_engine.py", line 188, in from_engine_args
    return cls(
           ^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/llm_engine.py", line 122, in __init__
    self.engine_core = EngineCoreClient.make_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core_client.py", line 95, in make_client
    return InprocClient(vllm_config, executor_class, log_stats)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core_client.py", line 264, in __init__
    self.engine_core = EngineCore(*args, **kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core.py", line 102, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/abstract.py", line 98, in __init__
    self._init_executor()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/uniproc_executor.py", line 132, in _init_executor
    super()._init_executor()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/uniproc_executor.py", line 46, in _init_executor
    self.driver_worker.init_device()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/worker/worker_base.py", line 310, in init_device
    self.worker.init_device()  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/worker/gpu_worker.py", line 193, in init_device
    assert self.local_rank < torch.cuda.device_count(), (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: DP adjusted local rank 13 is out of bounds. 
Traceback (most recent call last):
  File "/data/users/quinnzhu/gitrepos/vllm/torchrun.py", line 98, in <module>
    llm = LLM(
          ^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/entrypoints/llm.py", line 335, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/llm_engine.py", line 188, in from_engine_args
    return cls(
           ^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/llm_engine.py", line 122, in __init__
    self.engine_core = EngineCoreClient.make_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core_client.py", line 95, in make_client
    return InprocClient(vllm_config, executor_class, log_stats)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core_client.py", line 264, in __init__
    self.engine_core = EngineCore(*args, **kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core.py", line 102, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/abstract.py", line 98, in __init__
    self._init_executor()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/uniproc_executor.py", line 132, in _init_executor
    super()._init_executor()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/uniproc_executor.py", line 46, in _init_executor
    self.driver_worker.init_device()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/worker/worker_base.py", line 310, in init_device
    self.worker.init_device()  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/worker/gpu_worker.py", line 193, in init_device
    assert self.local_rank < torch.cuda.device_count(), (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: DP adjusted local rank 12 is out of bounds. 
Traceback (most recent call last):
  File "/data/users/quinnzhu/gitrepos/vllm/torchrun.py", line 98, in <module>
    llm = LLM(
          ^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/entrypoints/llm.py", line 335, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/llm_engine.py", line 188, in from_engine_args
    return cls(
           ^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/llm_engine.py", line 122, in __init__
    self.engine_core = EngineCoreClient.make_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core_client.py", line 95, in make_client
    return InprocClient(vllm_config, executor_class, log_stats)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core_client.py", line 264, in __init__
    self.engine_core = EngineCore(*args, **kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core.py", line 102, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/abstract.py", line 98, in __init__
    self._init_executor()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/uniproc_executor.py", line 132, in _init_executor
    super()._init_executor()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/uniproc_executor.py", line 46, in _init_executor
    self.driver_worker.init_device()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/worker/worker_base.py", line 310, in init_device
    self.worker.init_device()  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/worker/gpu_worker.py", line 193, in init_device
    assert self.local_rank < torch.cuda.device_count(), (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: DP adjusted local rank 9 is out of bounds. 
Traceback (most recent call last):
  File "/data/users/quinnzhu/gitrepos/vllm/torchrun.py", line 98, in <module>
    llm = LLM(
          ^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/entrypoints/llm.py", line 335, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/llm_engine.py", line 188, in from_engine_args
    return cls(
           ^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/llm_engine.py", line 122, in __init__
    self.engine_core = EngineCoreClient.make_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core_client.py", line 95, in make_client
    return InprocClient(vllm_config, executor_class, log_stats)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core_client.py", line 264, in __init__
    self.engine_core = EngineCore(*args, **kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/engine/core.py", line 102, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/abstract.py", line 98, in __init__
    self._init_executor()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/uniproc_executor.py", line 132, in _init_executor
    super()._init_executor()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/executor/uniproc_executor.py", line 46, in _init_executor
    self.driver_worker.init_device()
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/worker/worker_base.py", line 310, in init_device
    self.worker.init_device()  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/quinnzhu/gitrepos/vllm/vllm/v1/worker/gpu_worker.py", line 193, in init_device
    assert self.local_rank < torch.cuda.device_count(), (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: DP adjusted local rank 8 is out of bounds. 

@ilmarkov
Copy link
Contributor Author

@tlrmchlsmth @njhill @22quinn External launcher works with this PR now. With external launcher we don't update CUDA_VISIBLE_DEVICES, so we don't need to adjust ranks in this PR at all.

@tlrmchlsmth
Copy link
Member

@ilmarkov looks like the test failures are related to this PR

@tlrmchlsmth
Copy link
Member

Double checked on my end with:
torchrun --nproc-per-node=4 examples/offline_inference/torchrun_dp_example.py --tp-size=2 --pp-size=1 --dp-size=2 --enable-ep

@tlrmchlsmth tlrmchlsmth merged commit 60f76ba into vllm-project:main Oct 30, 2025
52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants