Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Did you ever encounter issues with Trainer's _save_checkpoint and FSDP? #21

Open
geomlyd opened this issue Nov 15, 2024 · 13 comments
Open

Comments

@geomlyd
Copy link

geomlyd commented Nov 15, 2024

This may be system-dependent and a bit of a long shot, but I'm having some issues running training. All goes seemingly well until _save_checkpoint is called (if it matters, I'm running a toy training session with a dataset of 10 samples), at which point I receive some kind of pytorch synchronization error (originating from a gather op) that indicates that e.g. rank 0's optimizer state has a large number of parameters (presumably, as many as the backbone's) but rank 1's is empty.

I'm using srun on a machine with 4 A40s. Did you ever encounter anything similar?

@xiaoqian-shen
Copy link
Collaborator

We did not encounter this problem by running on H100s or A100s.

@IceFlameWorm
Copy link

I met the same issue, have you ever solved it? @geomlyd

@geomlyd
Copy link
Author

geomlyd commented Dec 2, 2024

Hello @IceFlameWorm, not exactly: I never managed to get FSDP to run properly. What I did instead, and what seems to work so far, is I switched to using deepspeed. I did this by removing all FSDP-related arguments from the launching script, and by adding the --deepspeed $my_deepspeed_cfg_file argument that is passed on to the base class of HuggingFace's trainer. In case it's helpful, I will also upload here the .json config I used for deepspeed.

Note that I say seems to work because a) I've been able to run some training experiments that did what I expected, but my aim so far was not to reproduce the paper's results, and so I can't confirm that everything runs exactly as it should b) I've noticed that at the very end of training, if there's been a checkpoint resumption in between, the program crashes with some sort of triton-related error message. Nevertheless, this seems to happen after the model has been saved to the disk, and so it doesn't seem to have serious negative effects.
zero2.json

@IceFlameWorm
Copy link

Thanks for your reply @geomlyd. I also noticed that although the training script crashes in the end, some checkpoint files do be saved. I tried to load the model from these files to run the quick inference code, but I met the following error: ValueError: Trying to set a tensor of shape torch.Size([143251328]) in "weight" (which has shape torch.Size([152064, 3584])), this looks incorrect. Do you have any solutions or suggestions? Thank you in advance.

@IceFlameWorm
Copy link

@xiaoqian-shen We both still havn't solve this issue yet, could you share your dev enviroment settings?

@geomlyd
Copy link
Author

geomlyd commented Dec 2, 2024

From my experience, your best bet is probably trying to use deepspeed, inference worked fine for me after training with it (and also whenever I ran inference with the published pretrained checkpoints without any training).

@IceFlameWorm
Copy link

All right, maybe your advice is now the only one solution for me.

@IceFlameWorm
Copy link

@xiaoqian-shen Hi, although some error occurred at line 1109, trainer.train() , in train.py after finishing a finetune, some files still be saved like the following:
screenshot-20241203-112530

But when I tried to load this finetuned model using the quick inference code, the following errors were thrown out:

Traceback (most recent call last):
  File "/data/home/agent_ln/projects/LongVU/infer.py", line 20, in <module>
    tokenizer, model, image_processor, context_len = load_pretrained_model(
  File "/data/home/agent_ln/projects/LongVU/longvu/builder.py", line 159, in load_pretrained_model
    model = CambrianQwenForCausalLM.from_pretrained(
  File "/data/home/agent_ln/miniconda3/envs/longvu/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3838, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/data/home/agent_ln/miniconda3/envs/longvu/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4298, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/data/home/agent_ln/miniconda3/envs/longvu/lib/python3.10/site-packages/transformers/modeling_utils.py", line 895, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/data/home/agent_ln/miniconda3/envs/longvu/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 373, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([143251328]) in "weight" (which has shape torch.Size([152064, 3584])), this looks incorrect.

@xiaoqian-shen
Copy link
Collaborator

Hi, @IceFlameWorm

It seems that you are loading the FSDP checkpoint. Remove this file and the checkpoint should be successfully loaded for inference.

@IceFlameWorm
Copy link

Hi, @IceFlameWorm

It seems that you are loading the FSDP checkpoint. Remove this file and the checkpoint should be successfully loaded for inference.

I removed those safetensor files and renamed pytorch_model_fsdp.bin as pytorch_model.bin, then the inference seemed ok. Is it a right way? @xiaoqian-shen

@orzgugu
Copy link

orzgugu commented Dec 9, 2024

根据我的经验,最好的选择可能是尝试使用 deepspeed,在用它训练之后,推理对我来说效果很好(并且每当我在没有任何训练的情况下使用已发布的预训练检查点进行推理时也是如此)。

I recently encountered the problem of FSDP and want to switch to deepspeed. I would like to ask you for help. Can you provide the relevant code? Thank you very much!

@geomlyd
Copy link
Author

geomlyd commented Dec 18, 2024

Hello @orzgugu , I don't think I had to make changes to the code itself. See my comment above: you should remove all fsdp-related arguments provided to train.py, and you should instead use --deepspeed $my_deepspeed_cfg_file with the deepspeed json config I uploaded above (other configurations may work as well).

However, unrelated to FSDP or deepspeed, I made some changes to the code that saves/loads checkpoints, since my cluster environment has time limits on submitted jobs. I got these from some changes I had previously found necessary for making VideoLlaVa's codebase resume training properly. Note that I'm not sure whether things would work just as well without them, I just thought it would be better to do this as a precaution and so far everything has worked. I can't directly upload .py files here, so I link to a repo I've made containing just these files.

@jzhzhang
Copy link

@xiaoqian-shen Meet the same problem with H800. Is there an official solution available for this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants