Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed only utilizing the GPU on the host machine #1218

Open
austinbv opened this issue Jan 21, 2025 · 8 comments
Open

Distributed only utilizing the GPU on the host machine #1218

austinbv opened this issue Jan 21, 2025 · 8 comments

Comments

@austinbv
Copy link

I have been playing with distributed training and generation and am running into a weird issue where only the computer that runs mpirun actually utilizes the GPU. The rest load the models into memory but do not actually use any GPU.

Image

The command I am using to run the training is:

mpirun \
    --mca btl_tcp_links 4 \
    --mca btl_tcp_if_include bridge0 \
    --verbose \
    --mca btl_base_verbose 100 \
    -np 4 \
    -x PATH=$(pwd)/.venv/bin:$PATH \
    -x PYTHONPATH=$(pwd)/.venv/lib/python3.12/site-packages  \
    --hostfile ./hosts.txt \
    /opt/homebrew/bin/uv run mlx_lm.lora \
        --model mlx-community/Llama-3.3-70B-Instruct-4bit \
        --data ./converted \
        --train \
        --iters 1000 \
        --batch-size 1

I have tried not using thunderbolt etc and still the same issue. If I run the mpirun from any of my hosts I get the host that I start the program on using it's gpu but no other computer does.

my hosts.txt is pretty simple, ssh works, everything connects and training runs it just runs at 30 tokens / second and no distribution.

ai-mac-1.local  slots=1
ai-mac-2.local  slots=1
ai-mac-3.local  slots=1
ai-mac-4.local  slots=1
@angeloskath
Copy link
Member

Hi, the batch size must be divisible by the number of workers so using batch size 1 means 1 of two possibilities.

  1. It still hasn't started training and the scripts will fail when they reach line 91 in mlx_lm/tuner/trainer.py.
  2. MLX failed to find MPI so the it failed to initialize the distributed group so it runs on each machine as if it was an independent training.

My guess is the 2nd of the above. I would add the environment variable -x DYLD_LIBRARY_PATH=/opt/homebrew/lib (presumably). Another option is to run mpirun with a full path eg /opt/homebrew/bin/mpirun .

@austinbv
Copy link
Author

Image

Nailed it thank you

@austinbv austinbv reopened this Jan 22, 2025
@austinbv
Copy link
Author

I wanted to reopen to talk about speed of inference b/c it seems slow...

we have now 5 mac mini m4 pro 64gig training and am getting ~50 tok/sec is that normal?

the run command is

/opt/homebrew/bin/mpirun \
    --mca btl_tcp_links 4 \
    --mca btl_tcp_if_include bridge0 \
    -np 5 \
    -x PATH=$(pwd)/.venv/bin:$PATH \
    -x PYTHONPATH=$(pwd)/.venv/lib/python3.12/site-packages  \
    -x DYLD_LIBRARY_PATH=/opt/homebrew/lib \
    --hostfile ./hosts.txt \
    /opt/homebrew/bin/uv run mlx_lm.lora \
        --model mlx-community/Llama-3.3-70B-Instruct-4bit \
        --data ./converted \
        --train \
        --iters 1000 \
        --num-layers 4 \
        --val-batches 1 \
        --learning-rate 5e-5 \
        --steps-per-eval 1000 \
        --save-every 500 \
        --grad-checkpoint \
        --batch-size 5 # I am using a pretty small dataset

Image

@angeloskath
Copy link
Member

It does seem a bit low. First run the training on a single machine and get a baseline number, how many tps training throughput does a single machine with batch size 1 get? Note the iterations per second and compare before and after adding the distributed training. The GPU utilization seems pretty high but maybe it is misleading sampling from mactop (or whatever the program on the left).

I am getting ~45 on one M2 Ultra (and a different dataset) without gradient checkpointing so it doesn't seem completely off. It should probably scale better though so getting the above numbers would be interesting.

@angeloskath
Copy link
Member

By the way since you have gradient checkpointing enabled you can increase the per node batch size which will probably improve the throughput a bit even though the iterations will be slower. It will also decrease the relative communication time so it is almost certain that you can get a 5x speedup compared to a single Mac mini.

@austinbv
Copy link
Author

well maybe it's right

% /opt/homebrew/bin/uv run mlx_lm.lora \
        --model mlx-community/Llama-3.3-70B-Instruct-4bit \
        --data ./converted \
        --train \
        --iters 1000 \
        --num-layers 4 \
        --val-batches 1 \
        --learning-rate 5e-5 \
        --steps-per-eval 1000 \
        --save-every 500 \
        --grad-checkpoint \
        --batch-size 5
Loading pretrained model
Fetching 13 files: 100%|██████████████████████| 13/13 [00:00<00:00, 143112.73it/s]
Loading datasets
Training
Trainable parameters: 0.023% (16.384M/70553.706M)
Starting training..., iters: 1000
Iter 1: Val loss 5.819, Val took 6.471s
Iter 10: Train loss 3.570, Learning Rate 5.000e-05, It/sec 0.053, Tokens/sec 12.006, Trained Tokens 2265, Peak mem 41.579 GB

surprised

@angeloskath
Copy link
Member

Yep that looks right. Keep in mind that if you were to use distributed batch size 25 you would likely be close to 60tps so practically perfect scaling.

It does sound very low but it is a 70B parameter model nonetheless...

50tps means ~1M tokens every 5 hours which is I think sufficient to finetune overnight many a dataset.

@austinbv
Copy link
Author

Thanks for your help! Yeah the issue is just my dataset is super small for this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants