Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] fix pipeline loading by waiting till transformer is saved. #226

Merged
merged 5 commits into from
Jan 17, 2025

Conversation

sayakpaul
Copy link
Collaborator

Currently, we first wait for all processes to complete before saving the final state dict (which happens under the main process). We then proceed towards distributed validation. The nuance of this is that distributed validation can start even before the state dict serialization is complete.

Below is a snippet of the problem:

Unfold
ERROR:finetrainers:Traceback (most recent call last):
  File "/fsx/sayak/finetrainers/train.py", line 35, in main
    trainer.train()
  File "/fsx/sayak/finetrainers/finetrainers/trainer.py", line 844, in train
    self.validate(step=global_step, final_validation=True)
  File "/fsx/sayak/finetrainers/finetrainers/trainer.py", line 873, in validate
    pipeline = self._get_and_prepare_pipeline_for_validation(final_validation=final_validation)
  File "/fsx/sayak/finetrainers/finetrainers/trainer.py", line 1134, in _get_and_prepare_pipeline_for_validation
    transformer = self.model_config["load_diffusion_models"](model_id=self.args.output_dir)["transformer"]
  File "/fsx/sayak/finetrainers/finetrainers/models/cogvideox/lora.py", line 45, in load_diffusion_models
    transformer = CogVideoXTransformer3DModel.from_pretrained(
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/fsx/sayak/diffusers/src/diffusers/models/modeling_utils.py", line 857, in from_pretrained
    model_file = _get_model_file(
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/fsx/sayak/diffusers/src/diffusers/utils/hub_utils.py", line 309, in _get_model_file
    raise EnvironmentError(
OSError: Error no file named diffusion_pytorch_model.bin found in directory cogvideox-news-anchoring.

This PR fixes the problem.

The other changes are regarding fixing unneeded media:
image

Fixed: https://wandb.ai/sayakpaul/finetrainers-cog/runs/l5raemd
image

Completely okay for me if you want me to separate the other changes.

Copy link
Owner

@a-r-r-o-w a-r-r-o-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just two small changes to be more explicit we're dealing with list types

finetrainers/trainer.py Outdated Show resolved Hide resolved
finetrainers/trainer.py Outdated Show resolved Hide resolved
@sayakpaul sayakpaul merged commit da9d7d9 into main Jan 17, 2025
1 check passed
@sayakpaul sayakpaul deleted the fix-transformer-loading branch January 17, 2025 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants