Skip to content

Conversation

fegin
Copy link
Contributor

@fegin fegin commented Feb 19, 2025

If we don't wait for the first quorum, the trainer will continue to run forward and may use incorrect weights if the trainer is healing.

If we don't wait for the first quorum, the trainer will continue to run
forward and may use incorrect weights if the trainer is healing.
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 19, 2025
@d4l3k
Copy link
Member

d4l3k commented Feb 19, 2025

@fegin do you have more details on where this is being triggered? We can recover in non start cases so we should figure out how to resolve this

Are we not zeroing grads correctly during recovery?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants