-
Notifications
You must be signed in to change notification settings - Fork 78
Open
Description
Hello,
I do have 2 questions I couldn't find explanations in your tech report nor other discussions about the adjusted learning rate.
- First I do understand you flatten in 2D your gradient when it's dim > 2. However, when updating the lr, you define A and B as the 2 first dimension of your parameter matrix. Doesn't it leads to a mismatch of the value B if p.dim > 2 ? Wouldn't it be necessary for completeness to flatten your parameter matrix shape (like
p.view(p.size(0), -1).shape) before computing A and B ? - Secondly, it seems you don't apply the adjusted learning rate during weight decay :
p.data.mul_(1 - lr * wd). Is this wanted and if yes, could you elaborate on this ? Is this how Adam(W) apply weight decay ?
Kind regards,
Alexandre De Moor
Metadata
Metadata
Assignees
Labels
No labels