[BE] Lr schduler flatten #794

mori360 · 2025-01-16T03:49:26Z

Currently, lr_scheduler is stored differently as optimizer, model and data_loader, with keys to be "lr_scheduler_0", "lr_scheduler_1", ... stored in the state
This PR aims to flatten lr_shceduler so that all the schedulers would be stored as a list under self.state['lr_scheduler'], which is consistent with optimizer, model and data_loader

The PR is tested by 2 parts:

before and after this PR, lr_shceduler values are the same
Memory trace:
Before the flatten, rerun llama3_8b.toml from step 5 to step 10:

After the flatten, rerun llama3_8b.toml from step 5 to step 10:

tianyu-l · 2025-01-23T02:12:24Z

torchtitan/checkpoint.py

@@ -183,9 +183,9 @@ def __init__(
                "model": ModelWrapper(model_parts),
                "optimizer": optimizers,
                "dataloader": dataloader,
+                "lr_scheduler": lr_schedulers,


I think it won't be this simple. Both OptimizersContainer and ModelWrapper define state_dict and load_state_dict to handle flattening and unflattening. Since we don't have things like get_model_state_dict and set_model_state_dict for lr scheduler in torch.distributed.checkpoint.state_dict, we likely will need to manually write something for the LambdaLR we are using. See #738 (comment)

Let's work with @fegin on this.

Compared lr_schedulers before and after flattening, with/without checkpoint
lr_scheduler values are consistent with changes here

does it support DCP resharding? e.g. PP degree from 2 to 4 across two jobs

I think this PR doesn't address the resharding issue, hence the [BE] prefix. Supporting lr resharding deserve a separate PR.

mori360 added 2 commits January 15, 2025 19:47

flatten lr scheduler

83d01fa

flatten lr scheduler

dbf1f07

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 16, 2025

mori360 changed the title ~~[do not review] Lr schduler flatten~~ [BE] Lr schduler flatten Jan 17, 2025

mori360 marked this pull request as ready for review January 17, 2025 22:39

mori360 marked this pull request as draft January 17, 2025 22:39

remove get_lr_scheduler_state

9da918b

mori360 marked this pull request as ready for review January 17, 2025 23:22

mori360 requested a review from fegin January 17, 2025 23:22

tianyu-l reviewed Jan 23, 2025

View reviewed changes

tianyu-l added this to the torchtitan v1.0.0 release milestone Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BE] Lr schduler flatten #794

[BE] Lr schduler flatten #794

mori360 commented Jan 16, 2025 •

edited

Loading

tianyu-l Jan 23, 2025

mori360 Jan 23, 2025

tianyu-l Jan 23, 2025

fegin Jan 24, 2025

[BE] Lr schduler flatten #794

Are you sure you want to change the base?

[BE] Lr schduler flatten #794

Conversation

mori360 commented Jan 16, 2025 • edited Loading

tianyu-l Jan 23, 2025

Choose a reason for hiding this comment

mori360 Jan 23, 2025

Choose a reason for hiding this comment

tianyu-l Jan 23, 2025

Choose a reason for hiding this comment

fegin Jan 24, 2025

Choose a reason for hiding this comment

mori360 commented Jan 16, 2025 •

edited

Loading