Gradient accumulation + DDP scenario #94

Dengligedeng · 2025-01-22T02:33:51Z

Thanks very much for the great work. I think if grad acc and DDP is used, gradients will only do all-reduce in optimizer's closure calling time. Besides, the closure will only contains last batch's training data and can't do step2's gradient recalculation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient accumulation + DDP scenario #94

Gradient accumulation + DDP scenario #94

Dengligedeng commented Jan 22, 2025

Gradient accumulation + DDP scenario #94

Gradient accumulation + DDP scenario #94

Comments

Dengligedeng commented Jan 22, 2025