You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please check that this issue hasn't been reported before.
I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
I expect to be able to run Multi-GPU (2 48GB A40s) ORPO training on a google/gemma-2-27b fine-tune using the axolotlai/axolotl-cloud:main-latest docker image. Due to an issue with transformers and Gemma models, I had to pull the latest from their git branch to pick up the fix. I expect axolotl to be able to work with this fix to transformers.
Current behaviour
I'm getting the following traceback:
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_processes` was set to a value of `2`
More than one GPU was found, enabling multi-GPU training. [98/1422]
If this was unintended please pass in `--num_processes=1`. [97/1422]
--num_machines` was set to a value of `1` [96/1422]
--mixed_precision` was set to a value of `'no'` [95/1422]
--dynamo_backend` was set to a value of `'no'` [94/1422]
this warning pass in values for each of the problematic parameters or run `accelerate config`. [93/1422]
/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/pydantic/_internal/_fields.py:151: UserWarning: Field "model_kwargs" has conflict with protected namespace "model_".
e able to resolve this warning by setting `model_config['protected_namespaces'] = ()`. [90/1422]
warnings.warn(
iconda3/envs/py3.11/lib/python3.11/site-packages/pydantic/_internal/_fields.py:151: UserWarning: Field "model_kwargs" has conflict with protected namespace "model_". [88/1422]
e able to resolve this warning by setting `model_config['protected_namespaces'] = ()`. [86/1422]
warnings.warn(
14 15:44:54,784] [INFO] [datasets.<module>:54] [PID:1673] PyTorch version 2.5.1+cu124 available. [84/1422]
[2025-01-14 15:44:54,863] [INFO] [datasets.<module>:54] [PID:1672] PyTorch version 2.5.1+cu124 available.
[2025-01-14 15:44:55,799] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-01-14 15:44:55,871] [INFO] [root.spawn:60] [PID:1673] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/p
lude -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -c /tmp/tmpe_gpimn3/test.c -o /tmp/tmpe_gpimn3/test.o [80/1422]
[2025-01-14 15:44:55,892] [INFO] [root.spawn:60] [PID:1673] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat /tmp/tmpe_gpimn3/test.o -laio -o /tmp/tmpe_gpimn3/a.out
[2025-01-14 15:44:55,931] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-01-14 15:44:56,019] [INFO] [root.spawn:60] [PID:1672] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/p
lude -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -c /tmp/tmpi0208ze4/test.c -o /tmp/tmpi0208ze4/test.o [76/1422]
[2025-01-14 15:44:56,039] [INFO] [root.spawn:60] [PID:1672] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat /tmp/tmpi0208ze4/test.o -laio -o /tmp/tmpi0208ze4/a.out
[2025-01-14 15:44:56,334] [INFO] [root.spawn:60] [PID:1673] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/p
lude -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -c /tmp/tmpo8_51lpi/test.c -o /tmp/tmpo8_51lpi/test.o [73/1422]
[2025-01-14 15:44:56,355] [INFO] [root.spawn:60] [PID:1673] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat /tmp/tmpo8_51lpi/test.o -L/usr/local/cuda -L/usr/local/cuda/lib64 -lc
/tmp/tmpo8_51lpi/a.out [71/1422]
a3/envs/p14 15:44:56,486] [INFO] [root.spawn:60] [PID:1672] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/minicond[70/1422]
lude -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -c /tmp/tmpjimw45bl/test.c -o /tmp/tmpjimw45bl/test.o [69/1422]
[2025-01-14 15:44:56,506] [INFO] [root.spawn:60] [PID:1672] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat /tmp/tmpjimw45bl/test.o -L/usr/local/cuda -L/usr/local/cuda/lib64 -lc
/tmp/tmpjimw45bl/a.out [67/1422]
lease. Coe/axolotl/src/axolotl/monkeypatch/relora.py:16: DeprecationWarning: `TorchScript` support for functional optimizers is deprecated and will be removed in a future PyTorch re[66/1422]
ing the `torch.compile` optimizer instead. [65/1422]
rch.distributed.optim import ZeroRedundancyOptimizer [64/1422]
lease. Coe/axolotl/src/axolotl/monkeypatch/relora.py:16: DeprecationWarning: `TorchScript` support for functional optimizers is deprecated and will be removed in a future PyTorch re[63/1422]
ing the `torch.compile` optimizer instead. [62/1422]
rch.distributed.optim import ZeroRedundancyOptimizer [61/1422]
14 15:44:57,358] [DEBUG] [axolotl.normalize_config:87] [PID:1673] [RANK:1] bf16 support detected, enabling for this configuration. [60/1422]
[2025-01-14 15:44:57,502] [INFO] [axolotl.normalize_config:211] [PID:1673] [RANK:1] cuda memory usage baseline: 0.000GB (+0.652GB misc)
[2025-01-14 15:44:57,540] [DEBUG] [axolotl.normalize_config:87] [PID:1672] [RANK:0] bf16 support detected, enabling for this configuration.
[2025-01-14 15:44:57,651] [INFO] [axolotl.normalize_config:211] [PID:1672] [RANK:0] cuda memory usage baseline: 0.000GB (+0.652GB misc)
n potentiW114 15:44:57.325053801 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This ca[43/1422]
e a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. [42/1422]
[2025-01-14 15:44:57,795] [INFO] [axolotl._load_preprocessed_ds:45] [PID:1672] [RANK:0] Loading prepared dataset from disk at /workspace/data/Dibia/orpo/last_run_prepared/5a470795eef38a3841b
1fb2... [40/1422]
n potentiW114 15:44:58.554332014 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This ca[39/1422]
e a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. [38/1422]
(ms)=1800E114 16:14:58.934051560 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout[36/1422]
for 1800086 milliseconds before timing out. [35/1422]
ued NCCL E114 16:14:58.935256620 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enque[34/1422]
last completed NCCL work: -1.
[rank0]:[E114 16:14:58.940127569 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout[32/1422]
for 1800096 milliseconds before timing out. [31/1422]
ued NCCL E114 16:14:58.940745885 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1, last enque[30/1422]
last completed NCCL work: -1. [29/1422]
14 16:14:58,504] [DEBUG] [axolotl.train.train:47] [PID:1672] [RANK:0] loading tokenizer... google/gemma-2-27b [28/1422]
[2025-01-14 16:14:58,563] [INFO] [axolotl._load_preprocessed_ds:45] [PID:1673] [RANK:1] Loading prepared dataset from disk at /workspace/data/Dibia/orpo/last_run_prepared/5a470795eef38a3841b
1fb2... [26/1422]
E114 16:14:58.306997582 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.[25/1422]
rations mE114 16:14:58.307026389 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU ope[24/1422]
on corrupted/incomplete data. [23/1422]
E114 16:14:58.307033610 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [22/1422]
[rank1]:[E114 16:14:58.308462964 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collectiv
on timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800086 milliseconds before timing out. [20/1422]
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7634bfab9446 in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x763475a19772 in /root/miniconda3/envs/py3.11/lib/pyth
te-packages/torch/lib/libtorch_cuda.so) [16/1422]
c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x763475a20bb3 in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) [15/1422]
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x763475a2261d in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7634c02285c0 in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7634c5a57ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
clone + 0x44 (0x7634c5ae8a04 in /usr/lib/x86_64-linux-gnu/libc.so.6) [11/1422]
[10/1422]
[rank0]:[E114 16:14:59.616227190 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]:[E114 16:14:59.616255187 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations m
on corrupted/incomplete data. [7/1422]
[rank0]:[E114 16:14:59.616262320 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E114 16:14:59.618075672 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collectiv
e operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800096 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78eec60b9446 in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x78ee7c019772 in /root/miniconda3/envs/py3.11/lib/pyth
on3.11/site-packages/torch/lib/libtorch_cuda.so) [44/1422]
f pme #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x78ee7c020bb3 in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78ee7c02261d in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x78eec64ee5c0 in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x78eecc028ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) [40/1422]
f pme #6: clone + 0x44 (0x78eecc0b9a04 in /usr/lib/x86_64-linux-gnu/libc.so.6) 3
[38/1422]
W0114 16:14:59.121000 1543 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1672 closing signal SIGTERM
E0114 16:14:59.386000 1543 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 1 (pid: 1673) of binary: /root/miniconda3/envs/py3.11/bin/pyt
hon3
Please check that this issue hasn't been reported before.
Expected Behavior
I expect to be able to run Multi-GPU (2 48GB A40s) ORPO training on a google/gemma-2-27b fine-tune using the axolotlai/axolotl-cloud:main-latest docker image. Due to an issue with transformers and Gemma models, I had to pull the latest from their git branch to pick up the fix. I expect axolotl to be able to work with this fix to transformers.
Current behaviour
I'm getting the following traceback:
Steps to reproduce
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.11
axolotl branch-commit
The on in the axolotlai/axolotl-cloud:main-latest docker image
Acknowledgements
The text was updated successfully, but these errors were encountered: