使用vllm批量推理时卡间通信报错

**Describe the bug**
使用下述命令运行批量推理时，运行一会儿总是会被莫名其妙kill
```
#!/bin/bash
MAX_PIXELS=1003520 \
VIDEO_MAX_PIXELS=50176 \
FPS_MAX_FRAMES=12 \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
swift infer \
    --model /home/yyao/landai/RFT/output/v7-20250529-180509/checkpoint-1000-merged \
    --logprobs true \
    --gpu_memory_utilization 0.8 \
    --temperature 0.6 \
    --infer_backend vllm \
    --max_new_tokens 2048 \
    --val_dataset /home/yyao/landai/RFT/test_swift_first.jsonl \
    --tensor_parallel_size 4
```
后续经排查，发现报错如下：
```
INFO 05-31 18:20:47 [executor_base.py:112] # cuda blocks: 91424, # CPU blocks: 18724                                                                               INFO 05-31 18:20:47 [executor_base.py:117] Maximum concurrency for 128000 tokens per request: 11.43x                                                               
INFO 05-31 18:20:50 [model_runner.py:1450] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.                                             
Capturing CUDA graph shapes:   0%|                                                                                                          | 0/35 [00:00<?, ?it/s](VllmWorkerProcess pid=33974) INFO 05-31 18:20:50 [model_runner.py:1450] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.               
(VllmWorkerProcess pid=33972) INFO 05-31 18:20:50 [model_runner.py:1450] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.               
(VllmWorkerProcess pid=33973) INFO 05-31 18:20:50 [model_runner.py:1450] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.               
Capturing CUDA graph shapes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:20<00:00,  1.73it/s]INFO 05-31 18:21:10 [model_runner.py:1592] Graph capturing finished in 20 secs, took 0.43 GiB                                                                      (VllmWorkerProcess pid=33974) INFO 05-31 18:21:10 [model_runner.py:1592] Graph capturing finished in 20 secs, took 0.43 GiB                                        
(VllmWorkerProcess pid=33972) INFO 05-31 18:21:11 [model_runner.py:1592] Graph capturing finished in 20 secs, took 0.43 GiB                                        (VllmWorkerProcess pid=33973) INFO 05-31 18:21:11 [model_runner.py:1592] Graph capturing finished in 20 secs, took 0.43 GiB                                        
INFO 05-31 18:21:11 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 167.56 seconds                                                   
WARNING 05-31 18:21:12 [sampling_params.py:347] temperature 1e-06 is less than 0.01, which may cause numerical errors nan or inf in tensors. We have maxed it out to 0.01.
[INFO:swift] default_system: 'The dialogue between the user and the assistant. The user asks a question, and the assistant provides a solution. The assistant first thinks through the reasoning process in their mind, then provides the answer to the user. The reasoning process is enclosed in <think> </think> tags. For example, <think> Here is the reasoning process </think> Here is the answer.'
[INFO:swift] response_prefix: ''
[INFO:swift] agent_template: hermes
[INFO:swift] max_length: 128000
[INFO:swift] norm_bbox: none
[INFO:swift] Start time of running main: 2025-05-31 18:21:12.545350
[INFO:swift] request_config: RequestConfig(max_tokens=2048, temperature=0.6, top_k=None, top_p=None, repetition_penalty=None, num_beams=1, stop=[], seed=None, stream=False, logprobs=True, top_logprobs=None, n=1, best_of=None, presence_penalty=0.0, frequency_penalty=0.0, length_penalty=1.0)
[INFO:swift] val_dataset: Dataset({
    features: ['messages', 'images'],
    num_rows: 17248
})100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [02:32<00:00,  6.73it/s]100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [02:20<00:00,  7.28it/s] 
12%|██████████████▎                                                                                                          | 2048/17248 [04:52<35:57,  7.04it/s[rank0]:[E531 18:34:21.609446864 ProcessGroupNCCL.cpp:1753] [PG ID 2 PG GUID 3 Rank 0] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. 
[rank1]:[E531 18:36:20.824654229 ProcessGroupNCCL.cpp:629] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1410, OpType=BROADCAST, NumelIn=381480, NumelOut=381480, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
[rank1]:[E531 18:36:20.825346881 ProcessGroupNCCL.cpp:2168] [PG ID 2 PG GUID 3 Rank 1]  failure detected by watchdog at work sequence id: 1410 PG status: last enqueued work: 1416, last completed work: 1409
[rank1]:[E531 18:36:20.825380284 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can 
enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank1]:[E531 18:36:20.825396818 ProcessGroupNCCL.cpp:681] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, 
subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E531 18:36:20.825411129 ProcessGroupNCCL.cpp:695] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E531 18:36:20.830525286 ProcessGroupNCCL.cpp:1895] [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1410, OpType=BROADCAST, NumelIn=381480, NumelOut=381480, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x1455b936c1b6 in /home/yyao/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x1455673fec74 in /home/yyao/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x1455674007d0 in /home/yyao/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x1455674016ed in /home/yyao/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x1455b97a95c0 in /home/yyao/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x1455bae94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x1455baf26850 in /lib/x86_64-linux-gnu/libc.so.6)
[rank3]:[E531 18:36:20.945987176 ProcessGroupNCCL.cpp:629] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1410, OpType=BROADCAST, NumelIn=381480, NumelOut=381480, Timeout(ms)=600000) ran for 600039 milliseconds before timing out.
[rank3]:[E531 18:36:20.946297825 ProcessGroupNCCL.cpp:2168] [PG ID 2 PG GUID 3 Rank 3]  failure detected by watchdog at work sequence id: 1410 PG status: last enqueued work: 1416, last completed work: 1409
[rank3]:[E531 18:36:20.946326487 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can 
enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank3]:[E531 18:36:20.946341858 ProcessGroupNCCL.cpp:681] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, 
subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E531 18:36:20.946357047 ProcessGroupNCCL.cpp:695] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E531 18:36:20.946386815 ProcessGroupNCCL.cpp:629] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1410, OpType=BROADCAST, NumelIn=381480, NumelOut=381480, Timeout(ms)=600000) ran for 600062 milliseconds before timing out.
[rank2]:[E531 18:36:20.946688051 ProcessGroupNCCL.cpp:2168] [PG ID 2 PG GUID 3 Rank 2]  failure detected by watchdog at work sequence id: 1410 PG status: last enqueued work: 1416, last completed work: 1409
[rank2]:[E531 18:36:20.946713589 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can 
enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank2]:[E531 18:36:20.946729356 ProcessGroupNCCL.cpp:681] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, 
subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E531 18:36:20.946743699 ProcessGroupNCCL.cpp:695] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E531 18:36:20.951086120 ProcessGroupNCCL.cpp:1895] [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1410, OpType=BROADCAST, NumelIn=381480, NumelOut=381480, Timeout(ms)=600000) ran for 600039 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x14f24f96c1b6 in /home/yyao/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x14f1fddfec74 in /home/yyao/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x14f1fde007d0 in /home/yyao/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x14f1fde016ed in /home/yyao/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x14f2500f95c0 in /home/yyao/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x14f251694ac3 in /lib/x86_64-linux-gnu/libc.so.6)
[rank2]:[E531 18:36:20.951577516 ProcessGroupNCCL.cpp:1895] [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1410, OpType=BROADCAST, NumelIn=381480, NumelOut=381480, Timeout(ms)=600000) ran for 600062 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x151e9836c1b6 in /home/yyao/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x151e463fec74 in /home/yyao/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x151e464007d0 in /home/yyao/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x151e464016ed in /home/yyao/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x151e988605c0 in /home/yyao/miniconda3/envs/swift/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x151e99e94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x151e99f26850 in /lib/x86_64-linux-gnu/libc.so.6)

ERROR 05-31 18:36:21 [multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 33972 died, exit code: -6
INFO 05-31 18:36:21 [multiproc_worker_utils.py:124] Killing local vLLM worker processes
[rank0]:[F531 18:42:21.610666408 ProcessGroupNCCL.cpp:1575] [PG ID 2 PG GUID 3 Rank 0] [PG ID 2 PG GUID 3 Rank 0] Terminating the process after attempting to dump 
debug info, due to ProcessGroupNCCL watchdog hang.
/home/yyao/miniconda3/envs/swift/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 15 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/yyao/miniconda3/envs/swift/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
```
奇怪的问题

**Your hardware and system info**
NVIDIA 6000 ada*4
torch2.6.0



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

使用vllm批量推理时卡间通信报错 #4433

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

使用vllm批量推理时卡间通信报错 #4433

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions