feat: refactor main_ds.py (3/n) Checkpointer Class #605

cdoern · 2025-06-10T15:07:43Z

Introduce a new design for key components of main_ds.py. Namely splitting Model initialization, Accelerator initialization, Optimizer initialization, and Checkpoint saving initialization
into classes:

Model
Accelerator
Checkpointer

The Checkpointer class introduces a unified approach to our various checkpointing techniques. A user can pass in their checkpointing style (full_state or hf_format), and the checkpointer, via checkpointer.checkpoint, will save the model using the selected method and other techniques (LoRA).

This PR adds the new class and unit tests for the class

see previous PRs #572 and #594

note: this is probably the last of these large refactor for now with subsequent smaller followup PRs for cleanup.

Signed-off-by: Charlie Doern <[email protected]>

model_conf from `AutoConfig` has some key info we need in the checkpointer. Associate it with the model class and its subclasses Signed-off-by: Charlie Doern <[email protected]>

github-actions · 2025-06-10T16:54:18Z

E2E (NVIDIA L40S x4) (python 3.11) workflow launched on this PR: View run

github-actions · 2025-06-10T20:34:32Z

e2e workflow succeeded on this PR: View run, congrats!

fynnsu

Left some comments below.

I'm also assuming that the contents of the specific methods (like save_fsdp_lora_model) is largely unchanged. Is that correct?

fynnsu · 2025-06-11T13:43:58Z

src/instructlab/training/checkpointer.py

+        print("[None] Skipping checkpointing.")
+
+    # pylint: disable=unused-argument
+    def save_fsdp_lora_model(


Potentially for a future PR, but I think it would be cleaner to have a base Checkpointer abstract class and then have FSDPLoRACheckpointer, HFFormatAccelerateCheckpointer, etc. subclasses which each implement their own checkpoint method. Instead of doing our own custom routing with self._checkpoint_fn

I made a similar argument for Model class before... Class hierarchies are exactly meant for such scenarios.

fynnsu · 2025-06-11T13:44:26Z

src/instructlab/training/main_ds.py

-            accelerator,
-            samples_seen,
-            is_lora=bool(args.lora_r),
+        checkpointer.save_hf_format_accelerate(


Shouldn't this be checkpointer.checkpoint()

fynnsu · 2025-06-11T13:48:32Z

src/instructlab/training/model.py

@@ -50,11 +50,13 @@ def __init__(
        flash_enabled: bool = False,
        lora_config: Optional[LoraConfig] = None,
        lora_quant_bits: int = 0,
+        model_conf=None,


What is model_conf for? Currently we only seem to be using it to access model_conf.model_type. Could we just store model_type instead?

Also currently none of those accesses of model_conf.model_type check that model_conf is not None before trying to access the attribute, so this will raise an error if model_conf is ever actually None (it's current default value).

fynnsu · 2025-06-11T13:50:29Z

src/instructlab/training/main_ds.py

+    checkpointer = Checkpointer(
+        strategy=strategy, model=m, optimizer=optimizer, accelerator=accelerator
+    )
+    checkpointer.load_latest_full_state(Path(args.output_dir))


Can args.output_dir be set in the Checkpointer.__init__? It seems like we're currently passing it into every load/checkpoint function but it doesn't seem like it's changing values (or should change values) currently.

booxter · 2025-06-12T01:01:21Z

src/instructlab/training/checkpointer.py

+    def save_fsdp_lora_model(
+        self,
+        output_dir: Path,
+        **kwargs,


please remove kwargs that are not used; transform those used into specific arguments for the information that needs to be passed (with proper names, types etc.)

booxter · 2025-06-12T01:05:47Z

src/instructlab/training/checkpointer.py

+        model: Model,
+        optimizer: torch.optim.Optimizer,
+        accelerator: Accelerator,
+        strategy="all",


make it enum

booxter · 2025-06-12T01:06:03Z

src/instructlab/training/checkpointer.py

+    # pylint: disable=unused-argument
+    def save_full_state(
+        self,
+        output_dir,


define all args' types.

booxter · 2025-06-12T01:07:59Z

src/instructlab/training/checkpointer.py

+        print("[None] Skipping checkpointing.")
+
+    # pylint: disable=unused-argument
+    def save_fsdp_lora_model(


I made a similar argument for Model class before... Class hierarchies are exactly meant for such scenarios.

booxter · 2025-06-12T01:10:33Z

tests/unit/test_checkpointer.py

+
+@pytest.fixture
+def mock_accelerator():
+    accelerator = MagicMock(spec=Accelerator)


instead of mocks, you could introduce a new subclass for TestAccelerator that would "do nothing" / "do bare minimum" for test purposes. Same for the rest. Why do we have to have mocks just to create an object? Are init methods destructive / invasive? (Maybe it should be fixed then - it should be generally safe / cheap to create objects.)

feat: add Checkpointer class and usage

0f28c81

Signed-off-by: Charlie Doern <[email protected]>

mergify bot added testing Relates to testing ci-failure labels Jun 10, 2025

feat: add test_checkpointer unit test suite

276e9b3

Signed-off-by: Charlie Doern <[email protected]>

cdoern force-pushed the refactor-checkpoint branch from 5dca802 to 276e9b3 Compare June 10, 2025 15:15

mergify bot removed the ci-failure label Jun 10, 2025

fix: associate model_conf with Model

e075ad3

model_conf from `AutoConfig` has some key info we need in the checkpointer. Associate it with the model class and its subclasses Signed-off-by: Charlie Doern <[email protected]>

cdoern mentioned this pull request Jun 10, 2025

chore: Remove hf_format= argument for save_checkpoint #604

Closed

booxter mentioned this pull request Jun 10, 2025

feat: log disk space usage info, warn if close to exhaustion #603

Closed

fynnsu reviewed Jun 11, 2025

View reviewed changes

booxter reviewed Jun 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: refactor main_ds.py (3/n) Checkpointer Class #605

feat: refactor main_ds.py (3/n) Checkpointer Class #605

Uh oh!

cdoern commented Jun 10, 2025

Uh oh!

github-actions bot commented Jun 10, 2025

Uh oh!

github-actions bot commented Jun 10, 2025

Uh oh!

fynnsu left a comment

Uh oh!

fynnsu Jun 11, 2025

Uh oh!

booxter Jun 12, 2025

Uh oh!

fynnsu Jun 11, 2025

Uh oh!

fynnsu Jun 11, 2025

Uh oh!

fynnsu Jun 11, 2025

Uh oh!

booxter Jun 12, 2025

Uh oh!

booxter Jun 12, 2025

Uh oh!

booxter Jun 12, 2025

Uh oh!

booxter Jun 12, 2025

Uh oh!

booxter Jun 12, 2025

Uh oh!

Uh oh!

feat: refactor main_ds.py (3/n) Checkpointer Class #605

Are you sure you want to change the base?

feat: refactor main_ds.py (3/n) Checkpointer Class #605

Uh oh!

Conversation

cdoern commented Jun 10, 2025

Uh oh!

github-actions bot commented Jun 10, 2025

Uh oh!

github-actions bot commented Jun 10, 2025

Uh oh!

fynnsu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!