-
-
Notifications
You must be signed in to change notification settings - Fork 766
Open
Labels
Contributions WelcomeWe welcome contributions to fix this issue!We welcome contributions to fix this issue!Medium Priority(will be worked on after all high priority issues)(will be worked on after all high priority issues)OptimizersIssues or feature requests relating to optimizersIssues or feature requests relating to optimizers
Description
Hello,
I'm running bitsandbytes==0.41.1
in a Python 3.10 miniconda environment, 8xA100 GPU (using accelerate
for multi-GPU), Cuda 12.2.
I'm having problems resuming training (DPO) from a checkpoint:
Error invalid argument at line 393 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/pythonInterface.c
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3777589) of binary: /mnt/miniconda3/envs/synlm/bin/python3.10
Traceback (most recent call last):
File "/mnt/miniconda3/envs/synlm/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/mnt/miniconda3/envs/synlm/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/mnt/miniconda3/envs/synlm/lib/python3.10/site-packages/accelerate/commands/launch.py", line 970, in launch_command
multi_gpu_launcher(args)
File "/mnt/miniconda3/envs/synlm/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/mnt/miniconda3/envs/synlm/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/mnt/miniconda3/envs/synlm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/miniconda3/envs/synlm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I'm not quite sure how to debug this error, any ideas?
hiyouga, fmmoret, psy2013GitHub, LZHgrla, Andcircle and 4 more
Metadata
Metadata
Assignees
Labels
Contributions WelcomeWe welcome contributions to fix this issue!We welcome contributions to fix this issue!Medium Priority(will be worked on after all high priority issues)(will be worked on after all high priority issues)OptimizersIssues or feature requests relating to optimizersIssues or feature requests relating to optimizers