More logging in socketPollConnect and let some error cases retry #1618
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Recently we noticed a hang case during NCCL bootstrap.
Some signature in job error log:
For our case, the issue is caused by the destination node dropping TCP SYN packets, which triggers TCP connection timeout on src side. However, this is general case. When poll returns 1 and there is no POLLOUT event, it could be POLLERR or POLLHUP. For such cases, we would want to go to socketConnectCheck for retry, instead of returning error.
Testing:
we reproduced the issue. The job ran into this case and retry worked. The job completed successfully.