Fix: Prevent DDP Crash by Enabling find_unused_parameters=True in GITMol Training #10304

drivanov · 2025-06-04T17:42:10Z

This PR fixes a DDP runtime error triggered on some machines during multi-GPU training of the GITMol model, caused by conditionally unused parameters in the model's forward pass.

🧠 Root Cause
The following call:

        accelerator = Accelerator()

internally uses DistributedDataParallel(find_unused_parameters=False) by default. This means:

All model parameters must contribute to the loss every iteration.
If any are unused (e.g., due to conditional logic), the process crashes with:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error 
indicates that your module has parameters that were not used in producing loss. You can enable unused 
parameter detection by passing the keyword argument find_unused_parameters=True to 
torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss.

The GITMol model is a multi-modal architecture that processes molecular graphs, SMILES strings, images, and captions. Due to this design, not all model parameters are guaranteed to participate in the forward pass for every batch (e.g., if an input modality is missing or conditionally ignored). This causes DDP to crash unless configured to tolerate such behavior.

📈 Impact

Fixes the DDP crash on systems where conditional execution leads to parameter dropout.
Improves robustness across varied hardware and dataset configurations.
Safe for all training runs — introduces a small overhead but avoids hard failures.

🧪 Tested On
GB200 & B200 nodes across multiple runs.

Training runs no longer crash under multi-GPU execution.

…putation.

for more information, see https://pre-commit.ci

Resolve DDP reduction error by handling unused parameters in loss com…

b4e08b6

…putation.

drivanov requested a review from wsad1 as a code owner June 4, 2025 17:42

[pre-commit.ci] auto fixes from pre-commit.com hooks

a8aec58

for more information, see https://pre-commit.ci

xnuohz approved these changes Jun 5, 2025

View reviewed changes

drivanov added 2 commits June 9, 2025 15:50

Merge branch 'master' into git_mol

3d9ddb5

Merge branch 'master' into git_mol

2b2c574

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Prevent DDP Crash by Enabling find_unused_parameters=True in GITMol Training #10304

Fix: Prevent DDP Crash by Enabling find_unused_parameters=True in GITMol Training #10304

Uh oh!

drivanov commented Jun 4, 2025

Uh oh!

Uh oh!

Fix: Prevent DDP Crash by Enabling find_unused_parameters=True in GITMol Training #10304

Are you sure you want to change the base?

Fix: Prevent DDP Crash by Enabling find_unused_parameters=True in GITMol Training #10304

Uh oh!

Conversation

drivanov commented Jun 4, 2025

Uh oh!

Uh oh!