Distributed only utilizing the GPU on the host machine #1218

austinbv · 2025-01-21T08:20:14Z

I have been playing with distributed training and generation and am running into a weird issue where only the computer that runs mpirun actually utilizes the GPU. The rest load the models into memory but do not actually use any GPU.

The command I am using to run the training is:

mpirun \
    --mca btl_tcp_links 4 \
    --mca btl_tcp_if_include bridge0 \
    --verbose \
    --mca btl_base_verbose 100 \
    -np 4 \
    -x PATH=$(pwd)/.venv/bin:$PATH \
    -x PYTHONPATH=$(pwd)/.venv/lib/python3.12/site-packages  \
    --hostfile ./hosts.txt \
    /opt/homebrew/bin/uv run mlx_lm.lora \
        --model mlx-community/Llama-3.3-70B-Instruct-4bit \
        --data ./converted \
        --train \
        --iters 1000 \
        --batch-size 1

I have tried not using thunderbolt etc and still the same issue. If I run the mpirun from any of my hosts I get the host that I start the program on using it's gpu but no other computer does.

my hosts.txt is pretty simple, ssh works, everything connects and training runs it just runs at 30 tokens / second and no distribution.

ai-mac-1.local  slots=1
ai-mac-2.local  slots=1
ai-mac-3.local  slots=1
ai-mac-4.local  slots=1

The text was updated successfully, but these errors were encountered:

angeloskath · 2025-01-22T00:22:45Z

Hi, the batch size must be divisible by the number of workers so using batch size 1 means 1 of two possibilities.

It still hasn't started training and the scripts will fail when they reach line 91 in mlx_lm/tuner/trainer.py.
MLX failed to find MPI so the it failed to initialize the distributed group so it runs on each machine as if it was an independent training.

My guess is the 2nd of the above. I would add the environment variable -x DYLD_LIBRARY_PATH=/opt/homebrew/lib (presumably). Another option is to run mpirun with a full path eg /opt/homebrew/bin/mpirun .

austinbv · 2025-01-22T02:58:08Z

Nailed it thank you

austinbv · 2025-01-22T06:09:53Z

I wanted to reopen to talk about speed of inference b/c it seems slow...

we have now 5 mac mini m4 pro 64gig training and am getting ~50 tok/sec is that normal?

the run command is

/opt/homebrew/bin/mpirun \
    --mca btl_tcp_links 4 \
    --mca btl_tcp_if_include bridge0 \
    -np 5 \
    -x PATH=$(pwd)/.venv/bin:$PATH \
    -x PYTHONPATH=$(pwd)/.venv/lib/python3.12/site-packages  \
    -x DYLD_LIBRARY_PATH=/opt/homebrew/lib \
    --hostfile ./hosts.txt \
    /opt/homebrew/bin/uv run mlx_lm.lora \
        --model mlx-community/Llama-3.3-70B-Instruct-4bit \
        --data ./converted \
        --train \
        --iters 1000 \
        --num-layers 4 \
        --val-batches 1 \
        --learning-rate 5e-5 \
        --steps-per-eval 1000 \
        --save-every 500 \
        --grad-checkpoint \
        --batch-size 5 # I am using a pretty small dataset

angeloskath · 2025-01-22T06:23:22Z

It does seem a bit low. First run the training on a single machine and get a baseline number, how many tps training throughput does a single machine with batch size 1 get? Note the iterations per second and compare before and after adding the distributed training. The GPU utilization seems pretty high but maybe it is misleading sampling from mactop (or whatever the program on the left).

I am getting ~45 on one M2 Ultra (and a different dataset) without gradient checkpointing so it doesn't seem completely off. It should probably scale better though so getting the above numbers would be interesting.

angeloskath · 2025-01-22T06:32:00Z

By the way since you have gradient checkpointing enabled you can increase the per node batch size which will probably improve the throughput a bit even though the iterations will be slower. It will also decrease the relative communication time so it is almost certain that you can get a 5x speedup compared to a single Mac mini.

austinbv · 2025-01-22T06:46:43Z

well maybe it's right

% /opt/homebrew/bin/uv run mlx_lm.lora \
        --model mlx-community/Llama-3.3-70B-Instruct-4bit \
        --data ./converted \
        --train \
        --iters 1000 \
        --num-layers 4 \
        --val-batches 1 \
        --learning-rate 5e-5 \
        --steps-per-eval 1000 \
        --save-every 500 \
        --grad-checkpoint \
        --batch-size 5
Loading pretrained model
Fetching 13 files: 100%|██████████████████████| 13/13 [00:00<00:00, 143112.73it/s]
Loading datasets
Training
Trainable parameters: 0.023% (16.384M/70553.706M)
Starting training..., iters: 1000
Iter 1: Val loss 5.819, Val took 6.471s
Iter 10: Train loss 3.570, Learning Rate 5.000e-05, It/sec 0.053, Tokens/sec 12.006, Trained Tokens 2265, Peak mem 41.579 GB

surprised

angeloskath · 2025-01-22T07:58:16Z

Yep that looks right. Keep in mind that if you were to use distributed batch size 25 you would likely be close to 60tps so practically perfect scaling.

It does sound very low but it is a 70B parameter model nonetheless...

50tps means ~1M tokens every 5 hours which is I think sufficient to finetune overnight many a dataset.

austinbv · 2025-01-22T17:50:59Z

Thanks for your help! Yeah the issue is just my dataset is super small for this one.

austinbv closed this as completed Jan 22, 2025

austinbv reopened this Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed only utilizing the GPU on the host machine #1218

Distributed only utilizing the GPU on the host machine #1218

austinbv commented Jan 21, 2025

angeloskath commented Jan 22, 2025

austinbv commented Jan 22, 2025

austinbv commented Jan 22, 2025

angeloskath commented Jan 22, 2025

angeloskath commented Jan 22, 2025

austinbv commented Jan 22, 2025

angeloskath commented Jan 22, 2025

austinbv commented Jan 22, 2025

Distributed only utilizing the GPU on the host machine #1218

Distributed only utilizing the GPU on the host machine #1218

Comments

austinbv commented Jan 21, 2025

angeloskath commented Jan 22, 2025

austinbv commented Jan 22, 2025

austinbv commented Jan 22, 2025

angeloskath commented Jan 22, 2025

angeloskath commented Jan 22, 2025

austinbv commented Jan 22, 2025

angeloskath commented Jan 22, 2025

austinbv commented Jan 22, 2025