Skip to content

Comments

[NeoML] DistributedTraining uses IsDnnInferenced#1110

Draft
favorart wants to merge 1 commit intoneoml-lib:masterfrom
favorart:golikovDistributedTrainRunOnce
Draft

[NeoML] DistributedTraining uses IsDnnInferenced#1110
favorart wants to merge 1 commit intoneoml-lib:masterfrom
favorart:golikovDistributedTrainRunOnce

Conversation

@favorart
Copy link
Contributor

@favorart favorart commented Sep 5, 2024

Previously, if you did a RunOnce (even on random data) before a RunAndBackward, it would no longer be a firstRun and you could send batches as you wish. So you could never learn some extra dnn. Now you can not.

All dnns must have paramBlobs initialized to run solver->Train() for all of them (at least RunOnce must be completed for each dnn for this to happen).

The solver->Train() must run for all dnns because all dnns must have the same paramBlobs in each epoch.

@favorart favorart added the bug Something isn't working label Sep 5, 2024
Signed-off-by: Kirill Golikov <kirill.golikov@abbyy.com>
@favorart favorart force-pushed the golikovDistributedTrainRunOnce branch from 5ae5642 to ae3256d Compare September 5, 2024 10:59
@favorart favorart marked this pull request as draft September 9, 2024 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant