[NeoML] DistributedTraining uses IsDnnInferenced by favorart · Pull Request #1110 · neoml-lib/neoml

favorart · 2024-09-05T09:17:16Z

Previously, if you did a RunOnce (even on random data) before a RunAndBackward, it would no longer be a firstRun and you could send batches as you wish. So you could never learn some extra dnn. Now you can not.

All dnns must have paramBlobs initialized to run solver->Train() for all of them (at least RunOnce must be completed for each dnn for this to happen).

The solver->Train() must run for all dnns because all dnns must have the same paramBlobs in each epoch.

Signed-off-by: Kirill Golikov <kirill.golikov@abbyy.com>

favorart added the bug Something isn't working label Sep 5, 2024

[NeoML] DistributedTraining uses IsDnnInferenced

ae3256d

Signed-off-by: Kirill Golikov <kirill.golikov@abbyy.com>

favorart force-pushed the golikovDistributedTrainRunOnce branch from 5ae5642 to ae3256d Compare September 5, 2024 10:59

favorart marked this pull request as draft September 9, 2024 18:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[NeoML] DistributedTraining uses IsDnnInferenced#1110

[NeoML] DistributedTraining uses IsDnnInferenced#1110
favorart wants to merge 1 commit intoneoml-lib:masterfrom
favorart:golikovDistributedTrainRunOnce

favorart commented Sep 5, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

favorart commented Sep 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

favorart commented Sep 5, 2024 •

edited

Loading