Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StableDiffusionImg2ImgPipeline OSError: Consistency check failed #1549

Open
emreaniloguz opened this issue Jul 10, 2023 · 16 comments
Open

StableDiffusionImg2ImgPipeline OSError: Consistency check failed #1549

emreaniloguz opened this issue Jul 10, 2023 · 16 comments
Labels
bug Something isn't working

Comments

@emreaniloguz
Copy link

emreaniloguz commented Jul 10, 2023

Describe the bug

I'm trying to run DreamPose Repository. When I finished fine-tuning the UNet, the code saved the fine-tuned network with this code snippet

            if accelerator.is_main_process and global_step % 500 == 0:
                pipeline = StableDiffusionImg2ImgPipeline.from_pretrained(
                    args.pretrained_model_name_or_path,
                    #adapter=accelerator.unwrap_model(adapter),
                    unet=accelerator.unwrap_model(unet),
                    tokenizer=tokenizer,
                    image_encoder=accelerator.unwrap_model(clip_encoder),
                    clip_processor=accelerator.unwrap_model(clip_processor),
                    revision=args.revision,
                )
                pipeline.save_pretrained(os.path.join(args.output_dir, f'checkpoint-{epoch}'))
                model_path = args.output_dir+f'/unet_epoch_{epoch}.pth'
                torch.save(unet.state_dict(), model_path)
                adapter_path = args.output_dir+f'/adapter_{epoch}.pth'
                torch.save(adapter.state_dict(), adapter_path)

It failed due to: OSError: Consistency check failed: file should be of size 1215981833 but has size 492265879 (model.safetensors). (You can find the full output In the Logs section.)

  • I have modified the force_download parameter to be true but nothing changed.
  • I have enough space to save the model
  • I'm using the latest huggingface-hub version.

Reproduction

No response

Logs

Fetching 14 files:   0%|                                 | 0/14 [00:00<?, ?it/s]Force download:  True
Force download:  True
Fetching 14 files:  21%|█████▎                   | 3/14 [00:06<00:23,  2.11s/it]
Traceback (most recent call last):
  File "finetune-unet.py", line 458, in <module>92M/492M [00:05<00:00, 85.9MB/s]
    main(args)
  File "finetune-unet.py", line 438, in main
    pipeline = StableDiffusionImg2ImgPipeline.from_pretrained(
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/diffusers/pipelines/pipeline_utils.py", line 908, in from_pretrained
    cached_folder = cls.download(
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/diffusers/pipelines/pipeline_utils.py", line 1349, in download
    cached_folder = snapshot_download(
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/huggingface_hub/_snapshot_download.py", line 235, in snapshot_download
    thread_map(
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "***/anaconda3/envs/***/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
    yield fs.pop().result()
  File "***/anaconda3/envs/***/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "***/anaconda3/envs/***/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "***/anaconda3/envs/***/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/huggingface_hub/_snapshot_download.py", line 211, in _inner_hf_hub_download
    return hf_hub_download(
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1365, in hf_hub_download
    http_get(
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 547, in http_get
    raise EnvironmentError(
OSError: Consistency check failed: file should be of size 1215981833 but has size 492265879 (model.safetensors).
We are sorry for the inconvenience. Please retry download and pass `force_download=True, resume_download=False` as argument.
If the issue persists, please let us know by opening an issue on https://github.com/huggingface/huggingface_hub.
Downloading model.safetensors: 100%|█████████| 492M/492M [00:05<00:00, 83.3MB/s]
Steps: 100%|██████████████| 500/500 [06:10<00:00,  1.35it/s, loss=0.95, lr=1e-5]
Traceback (most recent call last):
  File "***/anaconda3/envs/***/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/accelerate/commands/launch.py", line 941, in launch_command
    simple_launcher(args)
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/accelerate/commands/launch.py", line 603, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['***/anaconda3/envs/***/bin/python', 'finetune-unet.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--instance_data_dir=demo/sample_emre/train', '--output_dir=demo/custom-chkpts_default', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--learning_rate=1e-5', '--num_train_epochs=500', '--dropout_rate=0.0', '--custom_chkpt=checkpoints/unet_epoch_20.pth', '--revision', 'ebb811dd71cdc38a204ecbdd6ac5d580f529fd8c', '--use_8bit_adam']' returned non-zero exit status 1.

System info

- huggingface_hub version: 0.15.1
- Platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.17
- Python version: 3.8.16
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: ***.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers: 
- FastAI: N/A
- Tensorflow: N/A
- Torch: 1.13.1+cu116
- Jinja2: N/A
- Graphviz: N/A
- Pydot: N/A
- Pillow: 10.0.0
- hf_transfer: N/A
- gradio: N/A
- numpy: 1.24.4
- ENDPOINT: https://huggingface.co
- HUGGINGFACE_HUB_CACHE: ***.cache/huggingface/hub
- HUGGINGFACE_ASSETS_CACHE: ***.cache/huggingface/assets
- HF_TOKEN_PATH: ***.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
@emreaniloguz emreaniloguz added the bug Something isn't working label Jul 10, 2023
@Wauplin
Copy link
Contributor

Wauplin commented Jul 10, 2023

Hi @emreaniloguz, thanks for reporting the issue. Can you provide the url of the repo of the finetuned model on the Hub please? I would like to investigate it myself. If you can't make the model public for privacy reason, would it be possible to create an org, add the model to this org (as private) and add my account to the org so that I can have access to it. Also for completeness, can you paste the full code you use to instantiate the model? Thank you in advance.

@emreaniloguz
Copy link
Author

Hi @Wauplin, I didn't exactly understand what you mean by "Can you provide the url of the repo of the finetuned model on the Hub please". If I understand correctly, you want me to share my final "fine-tuned" model, but there isn't because of the error. You can access the pre-trained model hub URL from here. Please correct me if I'm missing something.

@Wauplin
Copy link
Contributor

Wauplin commented Jul 10, 2023

Oh ok, I misunderstood the original issue then. So basically you try to download weights from https://huggingface.co/CompVis/stable-diffusion-v1-4 and you get this error ? Just to be sure, could you:

  1. Delete the cached repo: run huggingface-cli delete-cache and select "Model CompVis/stable-diffusion-v1-4". For a better CLI UI, it's best to install huggingface_hub[cli] first.
  2. Upgrade deps to pip install huggingface_hub==0.16.4. We released last week a fix in the HTTP session we use. I doubt it will fix your issue but it's worth trying.
  3. Retry the download.

I'm sorry in advance if you have a limited connection but this should cross-out some possible reasons for your bug and I'd like to try it before investigating further.

@Wauplin
Copy link
Contributor

Wauplin commented Jul 10, 2023

Wow actually the issue is very intriguing 🤯 It seems that for some reason the safety_checker/model.safetensors and the text_encoder/model.safetensors files have been mixed.

Here are the actual sizes of the files on S3:

~ curl --head https://huggingface.co/CompVis/stable-diffusion-v1-4/resolve/main/safety_checker/model.safetensors | grep size
x-linked-size: 1215981830
➜  ~ curl --head https://huggingface.co/CompVis/stable-diffusion-v1-4/resolve/main/text_encoder/model.safetensors | grep size
x-linked-size: 492265879

Given the error message you got (OSError: Consistency check failed: file should be of size 1215981833 but has size 492265879 (model.safetensors).), this cannot be a coincidence.

@emreaniloguz
Copy link
Author

Oh ok, I misunderstood the original issue then. So basically you try to download weights from https://huggingface.co/CompVis/stable-diffusion-v1-4 and you get this error ? Just to be sure, could you:

1. Delete the cached repo: run `huggingface-cli delete-cache` and select `"Model CompVis/stable-diffusion-v1-4"`. For a better CLI UI, it's best to install `huggingface_hub[cli]` first.

2. Upgrade deps to `pip install huggingface_hub==0.16.4`. We released last week a fix in the HTTP session we use. I doubt it will fix your issue but it's worth trying.

3. Retry the download.

I'm sorry in advance if you have a limited connection but this should cross-out some possible reasons for your bug and I'd like to try it before investigating further.

I've done everything that you mentioned and started fine-tuning but the result is the same, OSError.

@emreaniloguz
Copy link
Author

Wow actually the issue is very intriguing exploding_head It seems that for some reason the safety_checker/model.safetensors and the text_encoder/model.safetensors files have been mixed.

Here are the actual sizes of the files on S3:

~ curl --head https://huggingface.co/CompVis/stable-diffusion-v1-4/resolve/main/safety_checker/model.safetensors | grep size
x-linked-size: 1215981830
➜  ~ curl --head https://huggingface.co/CompVis/stable-diffusion-v1-4/resolve/main/text_encoder/model.safetensors | grep size
x-linked-size: 492265879

Given the error message you got (OSError: Consistency check failed: file should be of size 1215981833 but has size 492265879 (model.safetensors).), this cannot be a coincidence.

This is interesting :)

@Wauplin
Copy link
Contributor

Wauplin commented Jul 10, 2023

I've done everything that you mentioned and started fine-tuning but the result is the same, OSError.

Ok thanks for confirming. That's so weird 😬 I'll try to reproduce myself and let you know.

@Wauplin
Copy link
Contributor

Wauplin commented Jul 10, 2023

Just to be sure, what happens if you delete your cache and run

from diffusers import StableDiffusionImg2ImgPipeline

model = StableDiffusionImg2ImgPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")

?

@emreaniloguz
Copy link
Author

Just to be sure, what happens if you delete your cache and run

from diffusers import StableDiffusionImg2ImgPipeline

model = StableDiffusionImg2ImgPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")

?

Here is my output:

[2023-07-10 14:01:43,910] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Downloading (…)ain/model_index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 541/541 [00:00<00:00, 38.6kB/s]
Downloading (…)69ce/vae/config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 551/551 [00:00<00:00, 127kB/s]
Downloading model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.22G/1.22G [00:13<00:00, 88.1MB/s]
Fetching 16 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:14<00:00,  1.09it/s]
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overridden.

I think it's the correct model.safetensors, right?

@Wauplin
Copy link
Contributor

Wauplin commented Jul 10, 2023

Hmmm, so no errors at all when using the one from diffusers... But I wouldn't say it is because of DreamPose implementation either since the failing part is really an internal consistency check within huggingface_hub 🤔

(Though now that you successfully cached the repo locally, you should be able to continue with your training. It's not fixing the actual issue but at least unblock you, right?)

@emreaniloguz
Copy link
Author

Hmmm, so no errors at all when using the one from diffusers... But I wouldn't say it is because of DreamPose implementation either since the failing part is really an internal consistency check within huggingface_hub thinking

(Though now that you successfully cached the repo locally, you should be able to continue with your training. It's not fixing the actual issue but at least unblock you, right?)

I'll share the result in 5 min.

@emreaniloguz
Copy link
Author

emreaniloguz commented Jul 10, 2023

The error is the same, but I think it should be related to the force_download parameter that I've hardcoded into the huggingface_hub library. The code tries to download text_encoder safe.tensors file. I'll change the library to its default version and give it a try. I'll also write here if it'll work.

@emreaniloguz
Copy link
Author

emreaniloguz commented Jul 13, 2023

I've first run this script where the safetensors are okay. Then I upgraded the huggingface-hub to the default structure where the force_download parameter is unchanged. Alas, the error remains.

Just to be sure, what happens if you delete your cache and run

from diffusers import StableDiffusionImg2ImgPipeline

model = StableDiffusionImg2ImgPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")

?

Here is my output:

[2023-07-10 14:01:43,910] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Downloading (…)ain/model_index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 541/541 [00:00<00:00, 38.6kB/s]
Downloading (…)69ce/vae/config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 551/551 [00:00<00:00, 127kB/s]
Downloading model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.22G/1.22G [00:13<00:00, 88.1MB/s]
Fetching 16 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:14<00:00,  1.09it/s]
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overridden.

I think it's the correct model.safetensors, right?

@Wauplin
Copy link
Contributor

Wauplin commented Jul 13, 2023

@emreaniloguz Just to be sure, the error now is 'text_config_dict' is provided which will be used to initialize 'CLIPTextConfig'. The value 'text_config["id2label"]' will be overridden., right? So not related to the initial consistency check failure? If that's the case, it'd be best to open an issue on diffusers or DreamPose repository to get some more help.


Btw, our conversation made me realize that force_download was not correctly taken into account in diffusers, hence the hardcoded value that you needed to set. I've made a PR (huggingface/diffusers#4036) so it should be fixed in next release or if you install from git source.

@emreaniloguz
Copy link
Author

To update the issue, I've deleted the "revision" argument from everywhere and could overcome the problem but the results were not expected as I would. Someone else could try somewhere else also.

@nathanshearer
Copy link

nathanshearer commented Jan 20, 2025

I found this bug as a result of trying to download https://huggingface.co/CompVis/stable-diffusion-v1-4 including all LFS data and searching for various file sizes and hashes on a failed LFS download.

$ git lfs fetch --all fetch: 515 objects found, done. fetch: Fetching all references... expected OID 4666d0f9b718a6ed165ce95b8aac0d3d78031b8906fdc88ca8e735af5261788c, got 7b3a12df205cb3c74dd4eae4354d93f606ae6b3bc29d5d06fd97921cb9ad8a81 after 492265879 bytes written error: failed to fetch some objects from 'https://huggingface.co/CompVis/stable-diffusion-v1-4.git/info/lfs'

In commit b32fdef93b6679cae16f5beb019a5dc60a030cc1 several files were added including stable-diffusion-v1-4/model.safetensors and the LFS pointer for that file specified a size of 1215981833 bytes with SHA256 4666d0f9b718a6ed165ce95b8aac0d3d78031b8906fdc88ca8e735af5261788c

However that file does not exist on the server.

A different commit 249dd2d739844dea6a0bc7fc27b3c1d014720b28 updates the LFS pointer for safety_checker/model.safetensors from SHA256 4666d0f9b718a6ed165ce95b8aac0d3d78031b8906fdc88ca8e735af5261788c to 9d6a233ff6fd5ccb9f76fd99618d73369c52dd3d8222376384d0e601911089e8 which is the current version of the file.

Was an an incorrect LFS file checked into the repository then fixed in a later commit?

The missing file can be found here: https://huggingface.co/ckpt/anything-v3.0/blob/refs%2Fpr%2F1/safety_checker/model.safetensors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants