Two questions about the learning rate adjustment

Hello,

I do have 2 questions I couldn't find explanations in your tech report nor other discussions about the adjusted learning rate.

1. First I do understand you flatten in 2D your gradient when it's dim > 2. However, when updating the lr, you define A and B as the 2 first dimension of your parameter matrix. Doesn't it leads to a mismatch of the value B if p.dim > 2 ? Wouldn't it be necessary for completeness to flatten your parameter matrix shape (like `p.view(p.size(0), -1).shape`) before computing A and B ?
2. Secondly, it seems you don't apply the adjusted learning rate during weight decay : `p.data.mul_(1 - lr * wd)`. Is this wanted and if yes, could you elaborate on this ? Is this how Adam(W) apply weight decay ?

Kind regards,

Alexandre De Moor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Two questions about the learning rate adjustment #40

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Two questions about the learning rate adjustment #40

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions