-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling LR scaling for a specific layer (ex. down-projection...) during pretraining #1262
base: main
Are you sure you want to change the base?
Conversation
Any updates ? |
I think this is a really nice and important addition to the code base, which is also emphasized by further research on other parametrizations (https://arxiv.org/abs/2407.05872). However, I think the implementation would ideally be even more flexible and allow specification of multiple layers and learning rates to support all current and future use cases (see, e.g., page 3, Table 1 in the linked paper). Aside from that, I think it would be helpful to make the change backward-compatible. I.e., keep the old Then in the argument parsing function, do something like: if args.head_lr_mult != 1.0:
warnings.warn(
'--head-lr-mult is deprecated; please use the'
'--scale-lr-layer and --lr-multiplier arguments instead.'
)
assert args.scale_lr_layer is None and args.lr_multiplier == 1.0, \
'cannot set --scale-lr-layer or --lr-multiplier when --head-lr-mult is given.'
args.scale_lr_layer = 'head'
args.lr_multiplier = args.head_lr_mult |
Thanks @janEbert.
What do you think?
Let me tag @jaredcasper (a maintainer). |
Marking as stale. No activity in 60 days. |
Hope it gets another look! |
This PR enables scaling the learning rate of a layer by giving its name in
scale-lr-layer
and the multiplier inlr-multiplier
by using the existing internal logic ofscale_l_cond
andlr_mult
.Motivation:
MuP and several interesting papers that followed (ex. Depth-MuP) suggest, among other technics s.a layers' output scaling and initializations, to use different LRs depending on width in order to enhance feature learning and avoid that output layers dominate the learning process. When combined with proper initializations and layers' output scaling, it consists of a stable setting especially for sweeping and scaling hyperparameters for pretraining.
Implementation:
Generalizes/makes more flexible the existing use of this feature for lm
head
during finetuning by making it possible to specify the name of the target layer as well as the LR multiplier.Extends its use for pretraining as well. When no layer is specified, the
scale_lr_cond
argument isNone
and no lr-scaling is applied.Why?:
A GPT like model typically has an ffn-factor > 1. It's 3.5 for Llama3.1 70B. Which suggests that down-projection (
linear_fc2
in Megatron) requires a lower LR. TheoreticallyLR x 1/ffn_factor
.This way, we don't have to add a new argument (ex.
downproj-lr-mult
) each time we want to test scaling of a certain layer (ex.linear_fc2
).P.S:
Layers' output scaling (before residual-connections) as introduced in Depth-MuP to account for depth-scaling will be suggested in a separate PR. Same for init.