-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Did you ever encounter issues with Trainer
's _save_checkpoint
and FSDP?
#21
Comments
We did not encounter this problem by running on H100s or A100s. |
I met the same issue, have you ever solved it? @geomlyd |
Hello @IceFlameWorm, not exactly: I never managed to get FSDP to run properly. What I did instead, and what seems to work so far, is I switched to using deepspeed. I did this by removing all FSDP-related arguments from the launching script, and by adding the Note that I say seems to work because a) I've been able to run some training experiments that did what I expected, but my aim so far was not to reproduce the paper's results, and so I can't confirm that everything runs exactly as it should b) I've noticed that at the very end of training, if there's been a checkpoint resumption in between, the program crashes with some sort of triton-related error message. Nevertheless, this seems to happen after the model has been saved to the disk, and so it doesn't seem to have serious negative effects. |
Thanks for your reply @geomlyd. I also noticed that although the training script crashes in the end, some checkpoint files do be saved. I tried to load the model from these files to run the quick inference code, but I met the following error: ValueError: Trying to set a tensor of shape torch.Size([143251328]) in "weight" (which has shape torch.Size([152064, 3584])), this looks incorrect. Do you have any solutions or suggestions? Thank you in advance. |
@xiaoqian-shen We both still havn't solve this issue yet, could you share your dev enviroment settings? |
From my experience, your best bet is probably trying to use deepspeed, inference worked fine for me after training with it (and also whenever I ran inference with the published pretrained checkpoints without any training). |
All right, maybe your advice is now the only one solution for me. |
@xiaoqian-shen Hi, although some error occurred at line 1109, But when I tried to load this finetuned model using the quick inference code, the following errors were thrown out:
|
Hi, @IceFlameWorm It seems that you are loading the FSDP checkpoint. Remove this file and the checkpoint should be successfully loaded for inference. |
I removed those safetensor files and renamed pytorch_model_fsdp.bin as pytorch_model.bin, then the inference seemed ok. Is it a right way? @xiaoqian-shen |
I recently encountered the problem of FSDP and want to switch to deepspeed. I would like to ask you for help. Can you provide the relevant code? Thank you very much! |
Hello @orzgugu , I don't think I had to make changes to the code itself. See my comment above: you should remove all fsdp-related arguments provided to However, unrelated to FSDP or deepspeed, I made some changes to the code that saves/loads checkpoints, since my cluster environment has time limits on submitted jobs. I got these from some changes I had previously found necessary for making VideoLlaVa's codebase resume training properly. Note that I'm not sure whether things would work just as well without them, I just thought it would be better to do this as a precaution and so far everything has worked. I can't directly upload |
@xiaoqian-shen Meet the same problem with H800. Is there an official solution available for this issue? |
This may be system-dependent and a bit of a long shot, but I'm having some issues running training. All goes seemingly well until
_save_checkpoint
is called (if it matters, I'm running a toy training session with a dataset of 10 samples), at which point I receive some kind of pytorch synchronization error (originating from agather
op) that indicates that e.g. rank 0's optimizer state has a large number of parameters (presumably, as many as the backbone's) but rank 1's is empty.I'm using
srun
on a machine with 4 A40s. Did you ever encounter anything similar?The text was updated successfully, but these errors were encountered: