In pretrain.py, rank 0 iterates train_loader just to print a length:
|
for set_name, batch, global_batch_size in train_loader: |
This advances the dataset iterator only on rank 0. On multi-node runs, ranks become desynced, and the second evaluation hangs at the metric
dist.reduce (NCCL timeout) near the end of eval. Removing line 909-910 (and the print statement) fixes the hang. Did you observe similar behaviors with the codebase?