add option to disable ft checkpoints #1795

tushar00jain · 2025-10-03T15:58:19Z

Differential Revision: D83846965

facebook-github-bot · 2025-10-03T15:58:32Z

@tushar00jain has exported this pull request. If you are a Meta employee, you can view the originating Diff in D83846965.

fegin · 2025-10-03T18:34:16Z

torchtitan/config/job_config.py

    enable: bool = False
    """Whether to enable checkpoint"""

+    enable_ft: bool = True


enable_ft is too vague. I got confused that this is a similar thing as FT.enabled. Can we have a longer and more meaningful naming?

fegin · 2025-10-03T18:36:09Z

torchtitan/train.py

    def load_state_dict(self, state_dict: dict[str, Any]):
        self.step = state_dict["step"]
        self.ntokens_seen = state_dict["ntokens_seen"]
+        if not self.job_config.checkpoint.enable_ft and self.ft_manager is not None:


Setting dataloader state_dict is not safe. data.loader_state_dict() may be invoked before or after this load_state_dict().

Setting dataloader state_dict is not safe.

Not safe to call inside this method you mean? Why is that the case? And it's safe to call after this load_state_dict returns?

The method doesn't set data loader state dict right now but we could set it manually based on step count. What's the best way to do that? Could put that in a separate PR.

Essentially with this change, users don't really have to rely on any external storage. So it reduces set up time to get things up and running. Since we also don't really need model checkpoints when we have torchft. And if checkpoint storage has issues, this can work as a killswitch to completely disable the storage so it doesn't impact training.

fegin · 2025-10-03T18:37:13Z

I would also suggest that you should not try to do internal code first. The internal TorchTitan is always days behind the OSS one. So it can cause some merge issues.

wwwjn · 2025-10-03T20:41:39Z

torchtitan/config/job_config.py


+    enable_ft: bool = True
+    """
+    Checkpoints data loader state if enabled. Otherwise infers the data loader


The description is a little bit vague to me, how does this option helps checkpoint?Do I have to enable FT.enabled to use this?

add option to disable ft checkpoints

af79753

Differential Revision: D83846965

tushar00jain requested review from tianyu-l, fegin, wwwjn and wconstab as code owners October 3, 2025 15:58

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 3, 2025

facebook-github-bot added fb-exported meta-exported labels Oct 3, 2025

fegin reviewed Oct 3, 2025

View reviewed changes

wwwjn reviewed Oct 3, 2025

View reviewed changes

tushar00jain closed this Oct 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add option to disable ft checkpoints #1795

add option to disable ft checkpoints #1795

Uh oh!

tushar00jain commented Oct 3, 2025

Uh oh!

facebook-github-bot commented Oct 3, 2025

Uh oh!

fegin Oct 3, 2025

Uh oh!

fegin Oct 3, 2025

Uh oh!

tushar00jain Oct 7, 2025 •

edited

Loading

Uh oh!

fegin commented Oct 3, 2025

Uh oh!

wwwjn Oct 3, 2025

Uh oh!

Uh oh!

add option to disable ft checkpoints #1795

add option to disable ft checkpoints #1795

Uh oh!

Conversation

tushar00jain commented Oct 3, 2025

Uh oh!

facebook-github-bot commented Oct 3, 2025

Uh oh!

fegin Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

fegin Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

tushar00jain Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fegin commented Oct 3, 2025

Uh oh!

wwwjn Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tushar00jain Oct 7, 2025 •

edited

Loading