Multi-node eval hang due to rank-0-only train_loader iteration desynchronization

In `pretrain.py`, rank 0 iterates `train_loader` just to print a length: https://github.com/UbiquantAI/URM/blob/41de2c38d3dda7bb3a5c13b61a590125e5df1bd9/pretrain.py#L909
This advances the dataset iterator only on rank 0. On multi-node runs, ranks become desynced, and the second evaluation hangs at the metric `dist.reduce` (NCCL timeout) near the end of eval. Removing line 909-910 (and the print statement) fixes the hang. Did you observe similar behaviors with the codebase?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-node eval hang due to rank-0-only train_loader iteration desynchronization #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-node eval hang due to rank-0-only train_loader iteration desynchronization #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions