Did you ever encounter issues with `Trainer`'s `_save_checkpoint` and FSDP? #21

geomlyd · 2024-11-15T09:22:37Z

This may be system-dependent and a bit of a long shot, but I'm having some issues running training. All goes seemingly well until _save_checkpoint is called (if it matters, I'm running a toy training session with a dataset of 10 samples), at which point I receive some kind of pytorch synchronization error (originating from a gather op) that indicates that e.g. rank 0's optimizer state has a large number of parameters (presumably, as many as the backbone's) but rank 1's is empty.

I'm using srun on a machine with 4 A40s. Did you ever encounter anything similar?

The text was updated successfully, but these errors were encountered:

xiaoqian-shen · 2024-11-17T10:53:57Z

We did not encounter this problem by running on H100s or A100s.

IceFlameWorm · 2024-12-02T08:39:02Z

I met the same issue, have you ever solved it? @geomlyd

geomlyd · 2024-12-02T09:22:33Z

Hello @IceFlameWorm, not exactly: I never managed to get FSDP to run properly. What I did instead, and what seems to work so far, is I switched to using deepspeed. I did this by removing all FSDP-related arguments from the launching script, and by adding the --deepspeed $my_deepspeed_cfg_file argument that is passed on to the base class of HuggingFace's trainer. In case it's helpful, I will also upload here the .json config I used for deepspeed.

Note that I say seems to work because a) I've been able to run some training experiments that did what I expected, but my aim so far was not to reproduce the paper's results, and so I can't confirm that everything runs exactly as it should b) I've noticed that at the very end of training, if there's been a checkpoint resumption in between, the program crashes with some sort of triton-related error message. Nevertheless, this seems to happen after the model has been saved to the disk, and so it doesn't seem to have serious negative effects.
zero2.json

IceFlameWorm · 2024-12-02T09:42:03Z

Thanks for your reply @geomlyd. I also noticed that although the training script crashes in the end, some checkpoint files do be saved. I tried to load the model from these files to run the quick inference code, but I met the following error: ValueError: Trying to set a tensor of shape torch.Size([143251328]) in "weight" (which has shape torch.Size([152064, 3584])), this looks incorrect. Do you have any solutions or suggestions? Thank you in advance.

IceFlameWorm · 2024-12-02T09:43:47Z

@xiaoqian-shen We both still havn't solve this issue yet, could you share your dev enviroment settings?

geomlyd · 2024-12-02T09:44:54Z

From my experience, your best bet is probably trying to use deepspeed, inference worked fine for me after training with it (and also whenever I ran inference with the published pretrained checkpoints without any training).

IceFlameWorm · 2024-12-02T09:47:30Z

All right, maybe your advice is now the only one solution for me.

IceFlameWorm · 2024-12-03T03:31:38Z

@xiaoqian-shen Hi, although some error occurred at line 1109, trainer.train() , in train.py after finishing a finetune, some files still be saved like the following:

But when I tried to load this finetuned model using the quick inference code, the following errors were thrown out:

Traceback (most recent call last):
  File "/data/home/agent_ln/projects/LongVU/infer.py", line 20, in <module>
    tokenizer, model, image_processor, context_len = load_pretrained_model(
  File "/data/home/agent_ln/projects/LongVU/longvu/builder.py", line 159, in load_pretrained_model
    model = CambrianQwenForCausalLM.from_pretrained(
  File "/data/home/agent_ln/miniconda3/envs/longvu/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3838, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/data/home/agent_ln/miniconda3/envs/longvu/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4298, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/data/home/agent_ln/miniconda3/envs/longvu/lib/python3.10/site-packages/transformers/modeling_utils.py", line 895, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/data/home/agent_ln/miniconda3/envs/longvu/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 373, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([143251328]) in "weight" (which has shape torch.Size([152064, 3584])), this looks incorrect.

xiaoqian-shen · 2024-12-05T11:17:53Z

Hi, @IceFlameWorm

It seems that you are loading the FSDP checkpoint. Remove this file and the checkpoint should be successfully loaded for inference.

IceFlameWorm · 2024-12-09T03:07:59Z

Hi, @IceFlameWorm

It seems that you are loading the FSDP checkpoint. Remove this file and the checkpoint should be successfully loaded for inference.

I removed those safetensor files and renamed pytorch_model_fsdp.bin as pytorch_model.bin, then the inference seemed ok. Is it a right way? @xiaoqian-shen

orzgugu · 2024-12-09T12:49:33Z

根据我的经验，最好的选择可能是尝试使用 deepspeed，在用它训练之后，推理对我来说效果很好（并且每当我在没有任何训练的情况下使用已发布的预训练检查点进行推理时也是如此）。

I recently encountered the problem of FSDP and want to switch to deepspeed. I would like to ask you for help. Can you provide the relevant code? Thank you very much!

geomlyd · 2024-12-18T10:20:55Z

Hello @orzgugu , I don't think I had to make changes to the code itself. See my comment above: you should remove all fsdp-related arguments provided to train.py, and you should instead use --deepspeed $my_deepspeed_cfg_file with the deepspeed json config I uploaded above (other configurations may work as well).

However, unrelated to FSDP or deepspeed, I made some changes to the code that saves/loads checkpoints, since my cluster environment has time limits on submitted jobs. I got these from some changes I had previously found necessary for making VideoLlaVa's codebase resume training properly. Note that I'm not sure whether things would work just as well without them, I just thought it would be better to do this as a precaution and so far everything has worked. I can't directly upload .py files here, so I link to a repo I've made containing just these files.

jzhzhang · 2025-01-10T04:09:19Z

@xiaoqian-shen Meet the same problem with H800. Is there an official solution available for this issue?

andyhandsom6 mentioned this issue Jan 24, 2025

Checkpoint loading issue: ValueError #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Did you ever encounter issues with `Trainer`'s `_save_checkpoint` and FSDP? #21

Did you ever encounter issues with `Trainer`'s `_save_checkpoint` and FSDP? #21

geomlyd commented Nov 15, 2024

xiaoqian-shen commented Nov 17, 2024

IceFlameWorm commented Dec 2, 2024

geomlyd commented Dec 2, 2024

IceFlameWorm commented Dec 2, 2024

IceFlameWorm commented Dec 2, 2024

geomlyd commented Dec 2, 2024

IceFlameWorm commented Dec 2, 2024

IceFlameWorm commented Dec 3, 2024

xiaoqian-shen commented Dec 5, 2024

IceFlameWorm commented Dec 9, 2024

orzgugu commented Dec 9, 2024

geomlyd commented Dec 18, 2024

jzhzhang commented Jan 10, 2025

Did you ever encounter issues with Trainer's _save_checkpoint and FSDP? #21

Did you ever encounter issues with Trainer's _save_checkpoint and FSDP? #21

Comments

geomlyd commented Nov 15, 2024

xiaoqian-shen commented Nov 17, 2024

IceFlameWorm commented Dec 2, 2024

geomlyd commented Dec 2, 2024

IceFlameWorm commented Dec 2, 2024

IceFlameWorm commented Dec 2, 2024

geomlyd commented Dec 2, 2024

IceFlameWorm commented Dec 2, 2024

IceFlameWorm commented Dec 3, 2024

xiaoqian-shen commented Dec 5, 2024

IceFlameWorm commented Dec 9, 2024

orzgugu commented Dec 9, 2024

geomlyd commented Dec 18, 2024

jzhzhang commented Jan 10, 2025

Did you ever encounter issues with `Trainer`'s `_save_checkpoint` and FSDP? #21

Did you ever encounter issues with `Trainer`'s `_save_checkpoint` and FSDP? #21