-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to increase batch size by using multiple gpus? #3207
Comments
Hello! I would recommend having a read through #2831, where we discuss sharing negatives across devices (i.e., using multiple GPUs to create one big batch in which the in-batch negatives are shared). So, you can use a You can scale this 32 up to any number that you wish (larger is generally better, to an extent). Then, using multiple GPUs should allow you to parallelize this quite well. Everything stays on the same GPU, but you're processing 8 times as many large batches as with just one GPU.
|
Thank you so much! It works. But there also another big issue... It seems the model have been loaded repeatedly, which cause the gpu's memory OOM. Some other people also reports this problem. Some 7B model cannot be trained on a 80G RAM GPU like A100. |
Fair enough! For that, you would need FSDP. FSDP is partially supported in Sentence Transformers, but it's not been tested significantly. See the documentation here: https://sbert.net/docs/sentence_transformer/training/distributed.html#fsdp
|
Thank you so much! Now it can run properly on 1 GPU. But if i extend it to 8, it will lead to OOM. When i use the torchrun, it seems all of the process will run on the GPU 0 , When i use the accelerator, the GPU 0 also have the OOM errors. Could you help me figure out this problem? Thank you so much! |
Did you keep the You shouldn't have to set up the accelerator yourself, that should be taken care of. The training script should be pretty much the same as 1 GPU (the only difference is that you have to wrap your main code in Edit: I realise now that it might be that the device placement is too naive. Instead, you should use: def main():
local_rank = int(os.environ["LOCAL_RANK"])
# 1. Load a model to finetune
model = SentenceTransformer(
model_name_or_path=Alibaba-NLP/gte-Qwen2-7B-instruct,
model_kwargs={
"device_map": "auto",
},
tokenizer_kwargs={
"model_max_length": 512,
"truncation": True
},
device=f"cuda:{local_rank}",
)
# set the max input seq length to 512
model.max_seq_length = 512
if __name__ == "__main__":
main() This uses the
|
Hi~ |
Hmm, perhaps then the
|
Case1: Case2: Case3: Case4: This is a really strange problem. I don't know whether this problem happens due to my script.py or not. Here is the code of the script.py:
|
Hello! My fine-tuned model need a large batch size to get the best performance. I have multiple gpus with 40G VRAM each. How can i use them together to enlarge the batch size? Currently i can only set the batch size be 3 per GPU and seems they won't share the datas. How can i make the total batch size become 24?
The text was updated successfully, but these errors were encountered: