[core] fix pipeline loading by waiting till `transformer` is saved. #226

sayakpaul · 2025-01-16T08:14:15Z

Currently, we first wait for all processes to complete before saving the final state dict (which happens under the main process). We then proceed towards distributed validation. The nuance of this is that distributed validation can start even before the state dict serialization is complete.

Below is a snippet of the problem:

Unfold

ERROR:finetrainers:Traceback (most recent call last):
  File "/fsx/sayak/finetrainers/train.py", line 35, in main
    trainer.train()
  File "/fsx/sayak/finetrainers/finetrainers/trainer.py", line 844, in train
    self.validate(step=global_step, final_validation=True)
  File "/fsx/sayak/finetrainers/finetrainers/trainer.py", line 873, in validate
    pipeline = self._get_and_prepare_pipeline_for_validation(final_validation=final_validation)
  File "/fsx/sayak/finetrainers/finetrainers/trainer.py", line 1134, in _get_and_prepare_pipeline_for_validation
    transformer = self.model_config["load_diffusion_models"](model_id=self.args.output_dir)["transformer"]
  File "/fsx/sayak/finetrainers/finetrainers/models/cogvideox/lora.py", line 45, in load_diffusion_models
    transformer = CogVideoXTransformer3DModel.from_pretrained(
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/fsx/sayak/diffusers/src/diffusers/models/modeling_utils.py", line 857, in from_pretrained
    model_file = _get_model_file(
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/fsx/sayak/diffusers/src/diffusers/utils/hub_utils.py", line 309, in _get_model_file
    raise EnvironmentError(
OSError: Error no file named diffusion_pytorch_model.bin found in directory cogvideox-news-anchoring.

This PR fixes the problem.

The other changes are regarding fixing unneeded media:

Fixed: https://wandb.ai/sayakpaul/finetrainers-cog/runs/l5raemd

Completely okay for me if you want me to separate the other changes.

a-r-r-o-w

Thanks! Just two small changes to be more explicit we're dealing with list types

finetrainers/trainer.py

Co-authored-by: Aryan <[email protected]>

sayakpaul added 3 commits January 16, 2025 13:35

fix pipeline loading when using full-finetuning.

8011a63

simplify.

60e2c58

quality

bbd817d

a-r-r-o-w approved these changes Jan 17, 2025

View reviewed changes

finetrainers/trainer.py Outdated Show resolved Hide resolved

finetrainers/trainer.py Outdated Show resolved Hide resolved

sayakpaul and others added 2 commits January 17, 2025 19:59

Merge branch 'main' into fix-transformer-loading

61a4256

Apply suggestions from code review

afc3fa5

Co-authored-by: Aryan <[email protected]>

sayakpaul merged commit da9d7d9 into main Jan 17, 2025
1 check passed

sayakpaul deleted the fix-transformer-loading branch January 17, 2025 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] fix pipeline loading by waiting till `transformer` is saved. #226

[core] fix pipeline loading by waiting till `transformer` is saved. #226

sayakpaul commented Jan 16, 2025

a-r-r-o-w left a comment

[core] fix pipeline loading by waiting till transformer is saved. #226

[core] fix pipeline loading by waiting till transformer is saved. #226

Conversation

sayakpaul commented Jan 16, 2025

a-r-r-o-w left a comment

Choose a reason for hiding this comment

[core] fix pipeline loading by waiting till `transformer` is saved. #226

[core] fix pipeline loading by waiting till `transformer` is saved. #226