Skip to content

More logging in socketPollConnect and let some error cases retry #1618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

YinglinSun
Copy link

Recently we noticed a hang case during NCCL bootstrap.

Some signature in job error log:

rank_604.s12gn3064.4031852.log:[0604/1024][2025-02-16 22:31:34,422] [misc/socket.cc:586] [s12gn3064:pid=3942408] NCCL WARN socketPollConnect poll() returned 1, no POLLOUT events
rank_607.s12gn3064.4031852.log:[0607/1024][2025-02-16 22:31:34,418] [misc/socket.cc:586] [s12gn3064:pid=3942411] NCCL WARN socketPollConnect poll() returned 1, no POLLOUT events

For our case, the issue is caused by the destination node dropping TCP SYN packets, which triggers TCP connection timeout on src side. However, this is general case. When poll returns 1 and there is no POLLOUT event, it could be POLLERR or POLLHUP. For such cases, we would want to go to socketConnectCheck for retry, instead of returning error.

Testing:
we reproduced the issue. The job ran into this case and retry worked. The job completed successfully.

[03/16][2025-02-25 15:46:48,389] [misc/socket.cc:602] [s11gn1268:pid=3987720] NCCL WARN socketPollConnect poll() failed, ret 1, error Connection timed out, revents 8, connect to 172.16.30.229:39953
[03/16][2025-02-25 15:46:48,390] [misc/socket.cc:551] [s11gn1268:pid=3987720] NCCL INFO socketPollConnect: connect returned Connection timed out, retrying (1/34) after sleep for 100 msec

@sjeaugey
Copy link
Member

Thanks Yinglin for the patch. Aside from a minor logging discrepancy, the patch looks good to me. We'll work on integrating this for 2.27.

Recently we noticed a hang case during NCCL bootstrap.

Some signature in job error log:
rank_604.s12gn3064.4031852.log:[0604/1024][2025-02-16 22:31:34,422] [misc/socket.cc:586] [s12gn3064:pid=3942408] NCCL WARN socketPollConnect poll() returned 1, no POLLOUT events
rank_607.s12gn3064.4031852.log:[0607/1024][2025-02-16 22:31:34,418] [misc/socket.cc:586] [s12gn3064:pid=3942411] NCCL WARN socketPollConnect poll() returned 1, no POLLOUT events

For our case, the issue is caused by the destination node dropping TCP SYN packets, which triggers TCP connection timeout on src side. However, this is general case. When poll returns 1 and there is no POLLOUT event, it could be POLLERR or POLLHUP. For such cases, we would want to go to socketConnectCheck for retry, instead of returning error.

Testing:
we reproduced the issue. The job ran into this case and retry worked. The job completed successfully.

[03/16][2025-02-25 15:46:48,389] [misc/socket.cc:602] [s11gn1268:pid=3987720] NCCL WARN socketPollConnect poll() failed, ret 1, error Connection timed out, revents 8, connect to 172.16.30.229:39953
[03/16][2025-02-25 15:46:48,390] [misc/socket.cc:551] [s11gn1268:pid=3987720] NCCL INFO socketPollConnect: connect returned Connection timed out, retrying (1/34) after sleep for 100 msec
@YinglinSun
Copy link
Author

YinglinSun commented Feb 27, 2025

Thanks Yinglin for the patch. Aside from a minor logging discrepancy, the patch looks good to me. We'll work on integrating this for 2.27.

Thanks for the review. For integration is there anything I need to do from my end? first time submitting patch to nccl ...

@sjeaugey
Copy link
Member

Thanks for the review. For integration is there anything I need to do from my end? first time submitting patch to nccl ...

No, only thing you can do is update the patch based on comments but once we reach the final state of the patch, we'll take care of merging at the right time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants