Skip to content

Fix: Prevent DDP Crash by Enabling find_unused_parameters=True in GITMol Training #10304

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

drivanov
Copy link
Contributor

@drivanov drivanov commented Jun 4, 2025

This PR fixes a DDP runtime error triggered on some machines during multi-GPU training of the GITMol model, caused by conditionally unused parameters in the model's forward pass.

🧠 Root Cause
The following call:

        accelerator = Accelerator()

internally uses DistributedDataParallel(find_unused_parameters=False) by default. This means:

  • All model parameters must contribute to the loss every iteration.
  • If any are unused (e.g., due to conditional logic), the process crashes with:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error 
indicates that your module has parameters that were not used in producing loss. You can enable unused 
parameter detection by passing the keyword argument find_unused_parameters=True to 
torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. 

The GITMol model is a multi-modal architecture that processes molecular graphs, SMILES strings, images, and captions. Due to this design, not all model parameters are guaranteed to participate in the forward pass for every batch (e.g., if an input modality is missing or conditionally ignored). This causes DDP to crash unless configured to tolerate such behavior.

📈 Impact

  • Fixes the DDP crash on systems where conditional execution leads to parameter dropout.
  • Improves robustness across varied hardware and dataset configurations.
  • Safe for all training runs — introduces a small overhead but avoids hard failures.

🧪 Tested On
GB200 & B200 nodes across multiple runs.

Training runs no longer crash under multi-GPU execution.

@drivanov drivanov requested a review from wsad1 as a code owner June 4, 2025 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants