-
Notifications
You must be signed in to change notification settings - Fork 965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed only utilizing the GPU on the host machine #1218
Comments
Hi, the batch size must be divisible by the number of workers so using batch size 1 means 1 of two possibilities.
My guess is the 2nd of the above. I would add the environment variable |
I wanted to reopen to talk about speed of inference b/c it seems slow... we have now 5 mac mini m4 pro 64gig training and am getting ~50 tok/sec is that normal? the run command is
|
It does seem a bit low. First run the training on a single machine and get a baseline number, how many tps training throughput does a single machine with batch size 1 get? Note the iterations per second and compare before and after adding the distributed training. The GPU utilization seems pretty high but maybe it is misleading sampling from mactop (or whatever the program on the left). I am getting ~45 on one M2 Ultra (and a different dataset) without gradient checkpointing so it doesn't seem completely off. It should probably scale better though so getting the above numbers would be interesting. |
By the way since you have gradient checkpointing enabled you can increase the per node batch size which will probably improve the throughput a bit even though the iterations will be slower. It will also decrease the relative communication time so it is almost certain that you can get a 5x speedup compared to a single Mac mini. |
well maybe it's right
surprised |
Yep that looks right. Keep in mind that if you were to use distributed batch size 25 you would likely be close to 60tps so practically perfect scaling. It does sound very low but it is a 70B parameter model nonetheless... 50tps means ~1M tokens every 5 hours which is I think sufficient to finetune overnight many a dataset. |
Thanks for your help! Yeah the issue is just my dataset is super small for this one. |
I have been playing with distributed training and generation and am running into a weird issue where only the computer that runs
mpirun
actually utilizes the GPU. The rest load the models into memory but do not actually use any GPU.The command I am using to run the training is:
I have tried not using thunderbolt etc and still the same issue. If I run the
mpirun
from any of my hosts I get the host that I start the program on using it's gpu but no other computer does.my hosts.txt is pretty simple, ssh works, everything connects and training runs it just runs at 30 tokens / second and no distribution.
The text was updated successfully, but these errors were encountered: