-
Notifications
You must be signed in to change notification settings - Fork 69
Training Improvements: MultipackV2, Statistics, Mock Data #483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
74d8c3b
to
5ec71f5
Compare
Signed-off-by: Oleg Silkin <[email protected]> Signed-off-by: --global <[email protected]>
5ec71f5
to
77589a4
Compare
Signed-off-by: --global <[email protected]>
@@ -228,3 +229,9 @@ class TrainingArgs(BaseModel): | |||
default=False, | |||
description="Whether to use Liger kernels for training.", | |||
) | |||
|
|||
# TODO(osilkin): Create a better API for this, should not merge into library this way |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know Fynn has been working on "SDK-ifying" the sampler specifically @RobotSail , maybe we should sync on this with the training team
This pull request has been automatically marked as stale because it has not had activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. |
This pull request has merge conflicts that must be resolved before it can be |
Adds a number of enhancements to provide better training performance, have clarity on raining times, and experiment more robustly.
Multipack V2
Multipack V2 has been tested as a batch sampler and found to improve training throughput by an amount that benefits training long-context models. However; it does not support models non padding-free models, therefore these would need to use Multipack V1.
Experimental setup
Constants:
GPUs: 8xA100
MBL: 52k
Distributed Backend: FSDP
MSL: 50k
Liger: on
Independent variables:
Todos: