Unable to run Multi-GPU ORPO training on Gemma model #2267

chimezie · 2025-01-17T19:28:10Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I expect to be able to run Multi-GPU (2 48GB A40s) ORPO training on a google/gemma-2-27b fine-tune using the axolotlai/axolotl-cloud:main-latest docker image. Due to an issue with transformers and Gemma models, I had to pull the latest from their git branch to pick up the fix. I expect axolotl to be able to work with this fix to transformers.

Current behaviour

I'm getting the following traceback:

The following values were not passed to `accelerate launch` and had defaults used instead:                                                                                                    
        `--num_processes` was set to a value of `2`                                                                                                                                           
                More than one GPU was found, enabling multi-GPU training.                                                                                                            [98/1422]
                If this was unintended please pass in `--num_processes=1`.                                                                                                           [97/1422]
         --num_machines` was set to a value of `1`                                                                                                                                   [96/1422]
         --mixed_precision` was set to a value of `'no'`                                                                                                                             [95/1422]
         --dynamo_backend` was set to a value of `'no'`                                                                                                                              [94/1422]
         this warning pass in values for each of the problematic parameters or run `accelerate config`.                                                                              [93/1422]
/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/pydantic/_internal/_fields.py:151: UserWarning: Field "model_kwargs" has conflict with protected namespace "model_".                
                                                                                                                                                                                              
         e able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.                                                                                      [90/1422]
  warnings.warn(                                                                                                                                                                              
         iconda3/envs/py3.11/lib/python3.11/site-packages/pydantic/_internal/_fields.py:151: UserWarning: Field "model_kwargs" has conflict with protected namespace "model_".       [88/1422]
                                                                                                                                                                                              
         e able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.                                                                                      [86/1422]
  warnings.warn(                                                                                                                                                                              
         14 15:44:54,784] [INFO] [datasets.<module>:54] [PID:1673] PyTorch version 2.5.1+cu124 available.                                                                            [84/1422]
[2025-01-14 15:44:54,863] [INFO] [datasets.<module>:54] [PID:1672] PyTorch version 2.5.1+cu124 available.                                                                                     
[2025-01-14 15:44:55,799] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)                                                                       
[2025-01-14 15:44:55,871] [INFO] [root.spawn:60] [PID:1673] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/p
         lude -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -c /tmp/tmpe_gpimn3/test.c -o /tmp/tmpe_gpimn3/test.o                                                    [80/1422]
[2025-01-14 15:44:55,892] [INFO] [root.spawn:60] [PID:1673] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat /tmp/tmpe_gpimn3/test.o -laio -o /tmp/tmpe_gpimn3/a.out              
[2025-01-14 15:44:55,931] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)                                                                       
[2025-01-14 15:44:56,019] [INFO] [root.spawn:60] [PID:1672] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/p
         lude -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -c /tmp/tmpi0208ze4/test.c -o /tmp/tmpi0208ze4/test.o                                                    [76/1422]
[2025-01-14 15:44:56,039] [INFO] [root.spawn:60] [PID:1672] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat /tmp/tmpi0208ze4/test.o -laio -o /tmp/tmpi0208ze4/a.out              
[2025-01-14 15:44:56,334] [INFO] [root.spawn:60] [PID:1673] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/p
         lude -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -c /tmp/tmpo8_51lpi/test.c -o /tmp/tmpo8_51lpi/test.o                                                    [73/1422]
[2025-01-14 15:44:56,355] [INFO] [root.spawn:60] [PID:1673] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat /tmp/tmpo8_51lpi/test.o -L/usr/local/cuda -L/usr/local/cuda/lib64 -lc
         /tmp/tmpo8_51lpi/a.out                                                                                                                                                      [71/1422]
a3/envs/p14 15:44:56,486] [INFO] [root.spawn:60] [PID:1672] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/minicond[70/1422]
         lude -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -c /tmp/tmpjimw45bl/test.c -o /tmp/tmpjimw45bl/test.o                                                    [69/1422]
[2025-01-14 15:44:56,506] [INFO] [root.spawn:60] [PID:1672] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat /tmp/tmpjimw45bl/test.o -L/usr/local/cuda -L/usr/local/cuda/lib64 -lc
         /tmp/tmpjimw45bl/a.out                                                                                                                                                      [67/1422]
lease. Coe/axolotl/src/axolotl/monkeypatch/relora.py:16: DeprecationWarning: `TorchScript` support for functional optimizers is deprecated and will be removed in a future PyTorch re[66/1422]
         ing the `torch.compile` optimizer instead.                                                                                                                                  [65/1422]
         rch.distributed.optim import ZeroRedundancyOptimizer                                                                                                                        [64/1422]
lease. Coe/axolotl/src/axolotl/monkeypatch/relora.py:16: DeprecationWarning: `TorchScript` support for functional optimizers is deprecated and will be removed in a future PyTorch re[63/1422]
         ing the `torch.compile` optimizer instead.                                                                                                                                  [62/1422]
         rch.distributed.optim import ZeroRedundancyOptimizer                                                                                                                        [61/1422]
         14 15:44:57,358] [DEBUG] [axolotl.normalize_config:87] [PID:1673] [RANK:1] bf16 support detected, enabling for this configuration.                                          [60/1422]
[2025-01-14 15:44:57,502] [INFO] [axolotl.normalize_config:211] [PID:1673] [RANK:1] cuda memory usage baseline: 0.000GB (+0.652GB misc)                                                       
[2025-01-14 15:44:57,540] [DEBUG] [axolotl.normalize_config:87] [PID:1672] [RANK:0] bf16 support detected, enabling for this configuration.                                                   
[2025-01-14 15:44:57,651] [INFO] [axolotl.normalize_config:211] [PID:1672] [RANK:0] cuda memory usage baseline: 0.000GB (+0.652GB misc)
n potentiW114 15:44:57.325053801 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This ca[43/1422]
         e a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.        [42/1422]
[2025-01-14 15:44:57,795] [INFO] [axolotl._load_preprocessed_ds:45] [PID:1672] [RANK:0] Loading prepared dataset from disk at /workspace/data/Dibia/orpo/last_run_prepared/5a470795eef38a3841b
         1fb2...                                                                                                                                                                     [40/1422]
n potentiW114 15:44:58.554332014 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This ca[39/1422]
         e a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.        [38/1422]
                                                                                                                                                                                              
(ms)=1800E114 16:14:58.934051560 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout[36/1422]
         for 1800086 milliseconds before timing out.                                                                                                                                 [35/1422]
ued NCCL E114 16:14:58.935256620 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enque[34/1422]
         last completed NCCL work: -1. 
[rank0]:[E114 16:14:58.940127569 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout[32/1422]
         for 1800096 milliseconds before timing out.                                                                                                                                 [31/1422]
ued NCCL E114 16:14:58.940745885 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1, last enque[30/1422]
         last completed NCCL work: -1.                                                                                                                                               [29/1422]
         14 16:14:58,504] [DEBUG] [axolotl.train.train:47] [PID:1672] [RANK:0] loading tokenizer... google/gemma-2-27b                                                               [28/1422]
[2025-01-14 16:14:58,563] [INFO] [axolotl._load_preprocessed_ds:45] [PID:1673] [RANK:1] Loading prepared dataset from disk at /workspace/data/Dibia/orpo/last_run_prepared/5a470795eef38a3841b
         1fb2...                                                                                                                                                                     [26/1422]
         E114 16:14:58.306997582 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.[25/1422]
rations mE114 16:14:58.307026389 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU ope[24/1422]
         on corrupted/incomplete data.                                                                                                                                               [23/1422]
         E114 16:14:58.307033610 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.                                              [22/1422]
[rank1]:[E114 16:14:58.308462964 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collectiv
         on timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800086 milliseconds before timing out.                                [20/1422]
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):                                                                       
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7634bfab9446 in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libc10.so)                        
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x763475a19772 in /root/miniconda3/envs/py3.11/lib/pyth
         te-packages/torch/lib/libtorch_cuda.so)                                                                                                                                     [16/1422]
          c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x763475a20bb3 in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)                 [15/1422]
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x763475a2261d in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)                         
frame #4: <unknown function> + 0x145c0 (0x7634c02285c0 in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libtorch.so)                                                    
frame #5: <unknown function> + 0x94ac3 (0x7634c5a57ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)                                                                                                
          clone + 0x44 (0x7634c5ae8a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)                                                                                                       [11/1422]
                                                                                                                                                                                     [10/1422]
[rank0]:[E114 16:14:59.616227190 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.         
[rank0]:[E114 16:14:59.616255187 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations m
         on corrupted/incomplete data.                                                                                                                                                [7/1422]
[rank0]:[E114 16:14:59.616262320 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.                                                       
[rank0]:[E114 16:14:59.618075672 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collectiv
e operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800096 milliseconds before timing out.                                         
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):                                                                       
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78eec60b9446 in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libc10.so)                        
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x78ee7c019772 in /root/miniconda3/envs/py3.11/lib/pyth
on3.11/site-packages/torch/lib/libtorch_cuda.so)                                                                                                                                     [44/1422]
f pme #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x78ee7c020bb3 in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)                          
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78ee7c02261d in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)                         
frame #4: <unknown function> + 0x145c0 (0x78eec64ee5c0 in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libtorch.so)                                                    
frame #5: <unknown function> + 0x94ac3 (0x78eecc028ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)                                                                                       [40/1422]
f pme #6: clone + 0x44 (0x78eecc0b9a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)                                                                                                        3       
                                                                                                                                                                                     [38/1422]
W0114 16:14:59.121000 1543 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1672 closing signal SIGTERM                                                    
E0114 16:14:59.386000 1543 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 1 (pid: 1673) of binary: /root/miniconda3/envs/py3.11/bin/pyt
hon3

Steps to reproduce

accelerate launch -m axolotl.cli.train /path/to/config.yaml

Config yaml

base_model: google/gemma-2-27b
lora_model_dir: /path/to/previous/axolotl/run/output/

save_safetensors: true

load_in_8bit: false
load_in_4bit: true
strict: false

rl: orpo
orpo_alpha: 0.1

chat_template: gemma
datasets:
  - path: argilla/ultrafeedback-binarized-preferences-cleaned
    type: chat_template.argilla

dataset_prepared_path: /path/to/last_run_prepared
val_set_size: 0.1
output_dir: /path/to/output/orpo-out

adapter: qlora

sequence_len: 4096
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.00000800

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3

warmup_steps: 10
evals_per_epoch: 4  
eval_table_size:
eval_max_new_tokens: 128
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:

Possible solution

No response

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.11

axolotl branch-commit

The on in the axolotlai/axolotl-cloud:main-latest docker image

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

chimezie · 2025-01-20T22:21:05Z

Related to #2256

chimezie added the bug Something isn't working label Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to run Multi-GPU ORPO training on Gemma model #2267

Unable to run Multi-GPU ORPO training on Gemma model #2267

chimezie commented Jan 17, 2025

chimezie commented Jan 20, 2025

Unable to run Multi-GPU ORPO training on Gemma model #2267

Unable to run Multi-GPU ORPO training on Gemma model #2267

Comments

chimezie commented Jan 17, 2025

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

chimezie commented Jan 20, 2025