-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting errors when trying to replicate the distilling operation #2
Comments
This error is primarily due to that flash attention does not support head mask. So we actually did not use flash attention during sparsification and then use it during distillation. We have updated the codebase to not use flash attention for sparsification and you could have a try now. |
Thanks for the update. The sparsification now starts. However, I'm getting an out of memory error running on one A100 GPU with 80GB VRAM:
How much memory does this need? In the tutorial, you mentioned it's possible to prune with 1xA100. |
That is strange. I exactly use one A100-80G for sparsification. Are you sparsifying larger llama, say llama-13b? |
No, just llama2 7b |
Then could you please provide other details, e.g., hyperparameters? |
What hyperparameters are you referring to? The commands run are copy pasted from the Tutorial page in this repo. My dataset is 1.5GB (dataset used during pruning). I'm trying to sparsify an already fine-tuned llama2 model, I don't believe that will make a difference in terms of memory usage? |
For example, are you using a larger sequence length? I used a relatively short sequence length during pruning, i.e., 512. |
These are the two commands I ran:
|
I can try with smaller pruning length? Will that affect quality? |
Oh, I am awfully sorry that I did not give the tip that the data for pruning should be rebuilded with Besides, we did not test that much whether the pruning length will affect the quality. Maybe we could examine that later. |
No problem, thanks for helping me. I will try again with less sequence length. Perhaps you should update your tutorial page with the correct seq_length as well. |
Yes, I shall update it later. If you further encounter any questions, please let me know. |
The sparsification is running now, thank you for the help! I'm wondering how long does it take on a A100? It's running for the whole night (about 10 hours) now. Wondering if it's stuck or it's supposed to take that long |
It should take a long time, in my case, 1GB data would take more than 1 day to go : < |
Ran sparsification for 1.5 days, got this error:
|
Basically, I write the NaN loss detection in case of any loss spiking, which may potentially result in unexpected behavior in pruning. However, I have not encountered this issue during pruning in my experiments. The issue in your case perhaps is correlated with your data, so I suggest adding a if-nan-then-continue logic to skip the data. Or directly using float32 instead of float16. BTW, I am trying to integrate deepspeed and flash attention into the pruning process so that you could a achieve higher speed : / |
Hi @l3utterfly I have updated the pruning process with deepspeed and flash attention, which largely reduces the compute from >1 days to several hours. Hope you will find it useful! |
Thank you so much! I will start a new run tonight! |
I tried to distill with the new code. I noticed the pruning process now recommends 8 A100 GPUs? I am using 1 A100 GPU and it still runs out of memory with my 1GB dataset. Lower numbers of GPU should only affect time taken and not memory usage right? |
With fewer GPUs, maybe the batch size should also be decreased accordingly. Since deepspeed will permit a larger batch size per GPU when more GPUs are used. |
I reduced the Also, what should be the values here? |
In your case, the two parameters should be removed. And I am not quite sure whether deepspeed would work for 1 GPU or not |
Should I remove the deepspeed argument then? |
Not really, deepspeed is integrated for speedup, and removing it will result in errors... |
I don't have access to 2 GPUs sadly. Are you using the 40GB A100 or the 80GB version? |
80gb version. |
I am closing this issue since it is not active, feel free to reopen it as you like. |
Trying with the llama2 base weights.
I get the following error:
After hardcoding
use_cache=False
, and continuing, I get the following error:Can you help please?
Also:
from modules.fused_rope_monkey_patch_llama import apply_rotary_pos_emb
this seems to be a wrong import?Should it be:
from modules.modeling_llama import apply_rotary_pos_emb
in the file:flash_attn_monkey_patch_llama.py: line 10
The text was updated successfully, but these errors were encountered: