use DDP.no_sync when grad_accum and DDP

When using DDP and gradient accumulation at the same time, we should use the `DDP.no_sync` context manager to get some free training speed.

https://chatgpt.com/share/e/68efa990-9ee8-800c-99da-b078c3d2ac7b

<img width="1544" height="406" alt="Image" src="https://github.com/user-attachments/assets/53d5aa19-8a11-4dd1-aa29-13072251a0e4" />