-
-
Notifications
You must be signed in to change notification settings - Fork 936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
>=4-nodes(4*4gpu) training hangs at zero_first #2275
Comments
Thanks for the report.
Which barrier did you delete? Could you perhaps debug whether the It seems the last error you got was related to eval. Would you be able to do a small experiment and see whether if you can run without evaluation (via |
modify torch and accelerate to solve this bug: |
to
export NCCL_IB_HCA=mlx5_0,mlx5_1 can solve hang of this place, if not delete barrier() , but there is some hang after, by accelerate library. 1、16 gpu return: True False False False False False False False False False False False False False False False. |
Please check that this issue hasn't been reported before.
Expected Behavior
don‘t hang and train normaly
Current behaviour
accelerate hang here
axolotl/src/axolotl/utils/data/sft.py:
with zero_first(is_local_main_process()):
after delete barrier() in zero_first(is_main)
then hang here
/usr/local/lib/python3.10/site-packages/transformers/trainer.py:
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
self.model, self.optimizer, self.lr_scheduler
)
after add
‘’‘
export NCCL_IB_GID_INDEX=3
export NCCL_IB_TC=106
export NCCL_IB_HCA=mlx5_0,mlx5_1
‘’‘
then get errors:
‘’‘
[rank14]: Traceback (most recent call last):
[rank14]: File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank14]: return _run_code(code, main_globals, None,
[rank14]: File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
[rank14]: exec(code, run_globals)
[rank14]: File "/home/axolotl/src/axolotl/cli/train.py", line 71, in
[rank14]: fire.Fire(do_cli)
[rank14]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
[rank14]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank14]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
[rank14]: component, remaining_args = _CallAndUpdateTrace(
[rank14]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank14]: component = fn(*varargs, **kwargs)
[rank14]: File "/home/axolotl/src/axolotl/cli/train.py", line 66, in do_cli
[rank14]: do_train(parsed_cfg, parsed_cli_args)
[rank14]: File "/home/axolotl/src/axolotl/cli/train.py", line 42, in do_train
[rank14]: model, tokenizer = train(cfg=cfg, dataset_meta=dataset_meta)
[rank14]: File "/home/axolotl/src/axolotl/train.py", line 185, in train
[rank14]: trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2164, in train
[rank14]: return inner_training_loop(
[rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2592, in _inner_training_loop
[rank14]: self._maybe_log_save_evaluate(
[rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3050, in _maybe_log_save_evaluate
[rank14]: metrics = self._evaluate(trial, ignore_keys_for_eval)
[rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3004, in _evaluate
[rank14]: metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4051, in evaluate
[rank14]: output = eval_loop(
[rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4256, in evaluation_loop
[rank14]: losses = self.gather_function((losses.repeat(batch_size)))
[rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2500, in gather_for_metrics
[rank14]: data = self.gather(input_data)
[rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2456, in gather
[rank14]: return gather(tensor)
[rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 384, in wrapper
[rank14]: output = gather_object([shapes])
[rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 459, in gather_object
[rank14]: return _gpu_gather_object(object)
[rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 440, in _gpu_gather_object
[rank14]: torch.distributed.all_gather_object(output_objects, object)
[rank14]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank14]: return func(*args, **kwargs)
[rank14]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2453, in all_gather_object
[rank14]: object_list[i] = _tensor_to_object(tensor, tensor_size, group)
[rank14]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2362, in _tensor_to_object
[rank14]: return _unpickler(io.BytesIO(buf)).load()
[rank14]: _pickle.UnpicklingError: invalid load key, '
'.'.[rank13]: Traceback (most recent call last):
[rank13]: File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank13]: return _run_code(code, main_globals, None,
[rank13]: File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
[rank13]: exec(code, run_globals)
[rank13]: File "/home/axolotl/src/axolotl/cli/train.py", line 71, in
[rank13]: fire.Fire(do_cli)
[rank13]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
[rank13]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank13]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
[rank13]: component, remaining_args = _CallAndUpdateTrace(
[rank13]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank13]: component = fn(*varargs, **kwargs)
[rank13]: File "/home/axolotl/src/axolotl/cli/train.py", line 66, in do_cli
[rank13]: do_train(parsed_cfg, parsed_cli_args)
[rank13]: File "/home/axolotl/src/axolotl/cli/train.py", line 42, in do_train
[rank13]: model, tokenizer = train(cfg=cfg, dataset_meta=dataset_meta)
[rank13]: File "/home/axolotl/src/axolotl/train.py", line 185, in train
[rank13]: trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2164, in train
[rank13]: return inner_training_loop(
[rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2592, in _inner_training_loop
[rank13]: self._maybe_log_save_evaluate(
[rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3050, in _maybe_log_save_evaluate
[rank13]: metrics = self._evaluate(trial, ignore_keys_for_eval)
[rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3004, in _evaluate
[rank13]: metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4051, in evaluate
[rank13]: output = eval_loop(
[rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4256, in evaluation_loop
[rank13]: losses = self.gather_function((losses.repeat(batch_size)))
[rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2500, in gather_for_metrics
[rank13]: data = self.gather(input_data)
[rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2456, in gather
[rank13]: return gather(tensor)
[rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 384, in wrapper
[rank13]: output = gather_object([shapes])
[rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 459, in gather_object
[rank13]: return _gpu_gather_object(object)
[rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 440, in _gpu_gather_object
[rank13]: torch.distributed.all_gather_object(output_objects, object)
[rank13]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank13]: return func(*args, **kwargs)
[rank13]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2453, in all_gather_object
[rank13]: object_list[i] = _tensor_to_object(tensor, tensor_size, group)
[rank13]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2362, in _tensor_to_object
[rank13]: return _unpickler(io.BytesIO(buf)).load()
[rank13]: _pickle.UnpicklingError: invalid load key, '
‘’’
Steps to reproduce
export NCCL_BLOCKING_WAIT=1
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_TIMEOUT=10800
export NCCL_IB_GID_INDEX=3
export NCCL_IB_TC=106
export NCCL_IB_HCA=mlx5_0,mlx5_1
if [ -d ./qwen_lora_stage1 ];then
rm -rf ./qwen_lora_stage1
mkdir -p ./qwen_lora_stage1
else
mkdir -p ./qwen_lora_stage1
fi
accelerate launch -m --config_file accelerate_config3.yaml axolotl.cli.train examples/medusa/qwen25_72b_lora_stage1.yml
deepspeed zero3
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
python10
axolotl branch-commit
af727ee
Acknowledgements
The text was updated successfully, but these errors were encountered: