Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

>=4-nodes(4*4gpu) training hangs at zero_first #2275

Open
6 of 8 tasks
sankexin opened this issue Jan 22, 2025 · 3 comments
Open
6 of 8 tasks

>=4-nodes(4*4gpu) training hangs at zero_first #2275

sankexin opened this issue Jan 22, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@sankexin
Copy link

sankexin commented Jan 22, 2025

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

don‘t hang and train normaly

Current behaviour

accelerate hang here

axolotl/src/axolotl/utils/data/sft.py:
with zero_first(is_local_main_process()):

after delete barrier() in zero_first(is_main)

then hang here

/usr/local/lib/python3.10/site-packages/transformers/trainer.py:
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
self.model, self.optimizer, self.lr_scheduler
)

after add
‘’‘
export NCCL_IB_GID_INDEX=3
export NCCL_IB_TC=106
export NCCL_IB_HCA=mlx5_0,mlx5_1
‘’‘

then get errors:
‘’‘
[rank14]: Traceback (most recent call last):
[rank14]: File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank14]: return _run_code(code, main_globals, None,
[rank14]: File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
[rank14]: exec(code, run_globals)
[rank14]: File "/home/axolotl/src/axolotl/cli/train.py", line 71, in
[rank14]: fire.Fire(do_cli)
[rank14]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
[rank14]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank14]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
[rank14]: component, remaining_args = _CallAndUpdateTrace(
[rank14]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank14]: component = fn(*varargs, **kwargs)
[rank14]: File "/home/axolotl/src/axolotl/cli/train.py", line 66, in do_cli
[rank14]: do_train(parsed_cfg, parsed_cli_args)
[rank14]: File "/home/axolotl/src/axolotl/cli/train.py", line 42, in do_train
[rank14]: model, tokenizer = train(cfg=cfg, dataset_meta=dataset_meta)
[rank14]: File "/home/axolotl/src/axolotl/train.py", line 185, in train
[rank14]: trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2164, in train
[rank14]: return inner_training_loop(
[rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2592, in _inner_training_loop
[rank14]: self._maybe_log_save_evaluate(
[rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3050, in _maybe_log_save_evaluate
[rank14]: metrics = self._evaluate(trial, ignore_keys_for_eval)
[rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3004, in _evaluate
[rank14]: metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4051, in evaluate
[rank14]: output = eval_loop(
[rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4256, in evaluation_loop
[rank14]: losses = self.gather_function((losses.repeat(batch_size)))
[rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2500, in gather_for_metrics
[rank14]: data = self.gather(input_data)
[rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2456, in gather
[rank14]: return gather(tensor)
[rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 384, in wrapper
[rank14]: output = gather_object([shapes])
[rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 459, in gather_object
[rank14]: return _gpu_gather_object(object)
[rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 440, in _gpu_gather_object
[rank14]: torch.distributed.all_gather_object(output_objects, object)
[rank14]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank14]: return func(*args, **kwargs)
[rank14]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2453, in all_gather_object
[rank14]: object_list[i] = _tensor_to_object(tensor, tensor_size, group)
[rank14]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2362, in _tensor_to_object
[rank14]: return _unpickler(io.BytesIO(buf)).load()
[rank14]: _pickle.UnpicklingError: invalid load key, ''.
[rank13]: Traceback (most recent call last):
[rank13]: File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank13]: return _run_code(code, main_globals, None,
[rank13]: File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
[rank13]: exec(code, run_globals)
[rank13]: File "/home/axolotl/src/axolotl/cli/train.py", line 71, in
[rank13]: fire.Fire(do_cli)
[rank13]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
[rank13]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank13]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
[rank13]: component, remaining_args = _CallAndUpdateTrace(
[rank13]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank13]: component = fn(*varargs, **kwargs)
[rank13]: File "/home/axolotl/src/axolotl/cli/train.py", line 66, in do_cli
[rank13]: do_train(parsed_cfg, parsed_cli_args)
[rank13]: File "/home/axolotl/src/axolotl/cli/train.py", line 42, in do_train
[rank13]: model, tokenizer = train(cfg=cfg, dataset_meta=dataset_meta)
[rank13]: File "/home/axolotl/src/axolotl/train.py", line 185, in train
[rank13]: trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2164, in train
[rank13]: return inner_training_loop(
[rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2592, in _inner_training_loop
[rank13]: self._maybe_log_save_evaluate(
[rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3050, in _maybe_log_save_evaluate
[rank13]: metrics = self._evaluate(trial, ignore_keys_for_eval)
[rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3004, in _evaluate
[rank13]: metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4051, in evaluate
[rank13]: output = eval_loop(
[rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4256, in evaluation_loop
[rank13]: losses = self.gather_function((losses.repeat(batch_size)))
[rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2500, in gather_for_metrics
[rank13]: data = self.gather(input_data)
[rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2456, in gather
[rank13]: return gather(tensor)
[rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 384, in wrapper
[rank13]: output = gather_object([shapes])
[rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 459, in gather_object
[rank13]: return _gpu_gather_object(object)
[rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 440, in _gpu_gather_object
[rank13]: torch.distributed.all_gather_object(output_objects, object)
[rank13]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank13]: return func(*args, **kwargs)
[rank13]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2453, in all_gather_object
[rank13]: object_list[i] = _tensor_to_object(tensor, tensor_size, group)
[rank13]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2362, in _tensor_to_object
[rank13]: return _unpickler(io.BytesIO(buf)).load()
[rank13]: _pickle.UnpicklingError: invalid load key, '
'.
‘’’

Steps to reproduce

export NCCL_BLOCKING_WAIT=1
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_TIMEOUT=10800
export NCCL_IB_GID_INDEX=3
export NCCL_IB_TC=106
export NCCL_IB_HCA=mlx5_0,mlx5_1

if [ -d ./qwen_lora_stage1 ];then
rm -rf ./qwen_lora_stage1
mkdir -p ./qwen_lora_stage1
else
mkdir -p ./qwen_lora_stage1
fi

accelerate launch -m --config_file accelerate_config3.yaml axolotl.cli.train examples/medusa/qwen25_72b_lora_stage1.yml

deepspeed zero3

Config yaml

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

python10

axolotl branch-commit

af727ee

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@sankexin sankexin added the bug Something isn't working label Jan 22, 2025
@sankexin sankexin changed the title 4-node(4*4gpu) training hangs at zero_first >=4-nodes(4*4gpu) training hangs at zero_first Jan 22, 2025
@NanoCode012
Copy link
Collaborator

Thanks for the report.

after delete barrier() in zero_first(is_main)

Which barrier did you delete? Could you perhaps debug whether the is_local_main_process() is properly running for each node with a log? Each node should have a master.

It seems the last error you got was related to eval. Would you be able to do a small experiment and see whether if you can run without evaluation (via eval_strategy: no )

@sankexin
Copy link
Author

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

don‘t hang and train normaly

Current behaviour

accelerate hang here

axolotl/src/axolotl/utils/data/sft.py: with zero_first(is_local_main_process()):

after delete barrier() in zero_first(is_main)

then hang here

/usr/local/lib/python3.10/site-packages/transformers/trainer.py: model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( self.model, self.optimizer, self.lr_scheduler )

after add ‘’‘ export NCCL_IB_GID_INDEX=3 export NCCL_IB_TC=106 export NCCL_IB_HCA=mlx5_0,mlx5_1 ‘’‘

then get errors: ‘’‘ [rank14]: Traceback (most recent call last): [rank14]: File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main [rank14]: return _run_code(code, main_globals, None, [rank14]: File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code [rank14]: exec(code, run_globals) [rank14]: File "/home/axolotl/src/axolotl/cli/train.py", line 71, in [rank14]: fire.Fire(do_cli) [rank14]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 135, in Fire [rank14]: component_trace = _Fire(component, args, parsed_flag_args, context, name) [rank14]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire [rank14]: component, remaining_args = _CallAndUpdateTrace( [rank14]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace [rank14]: component = fn(*varargs, **kwargs) [rank14]: File "/home/axolotl/src/axolotl/cli/train.py", line 66, in do_cli [rank14]: do_train(parsed_cfg, parsed_cli_args) [rank14]: File "/home/axolotl/src/axolotl/cli/train.py", line 42, in do_train [rank14]: model, tokenizer = train(cfg=cfg, dataset_meta=dataset_meta) [rank14]: File "/home/axolotl/src/axolotl/train.py", line 185, in train [rank14]: trainer.train(resume_from_checkpoint=resume_from_checkpoint) [rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2164, in train [rank14]: return inner_training_loop( [rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2592, in _inner_training_loop [rank14]: self._maybe_log_save_evaluate( [rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3050, in _maybe_log_save_evaluate [rank14]: metrics = self._evaluate(trial, ignore_keys_for_eval) [rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3004, in _evaluate [rank14]: metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) [rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4051, in evaluate [rank14]: output = eval_loop( [rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4256, in evaluation_loop [rank14]: losses = self.gather_function((losses.repeat(batch_size))) [rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2500, in gather_for_metrics [rank14]: data = self.gather(input_data) [rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2456, in gather [rank14]: return gather(tensor) [rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 384, in wrapper [rank14]: output = gather_object([shapes]) [rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 459, in gather_object [rank14]: return _gpu_gather_object(object) [rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 440, in _gpu_gather_object [rank14]: torch.distributed.all_gather_object(output_objects, object) [rank14]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [rank14]: return func(*args, **kwargs) [rank14]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2453, in all_gather_object [rank14]: object_list[i] = _tensor_to_object(tensor, tensor_size, group) [rank14]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2362, in _tensor_to_object [rank14]: return _unpickler(io.BytesIO(buf)).load() [rank14]: _pickle.UnpicklingError: invalid load key, ''. [rank13]: Traceback (most recent call last): [rank13]: File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main [rank13]: return _run_code(code, main_globals, None, [rank13]: File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code [rank13]: exec(code, run_globals) [rank13]: File "/home/axolotl/src/axolotl/cli/train.py", line 71, in [rank13]: fire.Fire(do_cli) [rank13]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 135, in Fire [rank13]: component_trace = _Fire(component, args, parsed_flag_args, context, name) [rank13]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire [rank13]: component, remaining_args = _CallAndUpdateTrace( [rank13]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace [rank13]: component = fn(*varargs, **kwargs) [rank13]: File "/home/axolotl/src/axolotl/cli/train.py", line 66, in do_cli [rank13]: do_train(parsed_cfg, parsed_cli_args) [rank13]: File "/home/axolotl/src/axolotl/cli/train.py", line 42, in do_train [rank13]: model, tokenizer = train(cfg=cfg, dataset_meta=dataset_meta) [rank13]: File "/home/axolotl/src/axolotl/train.py", line 185, in train [rank13]: trainer.train(resume_from_checkpoint=resume_from_checkpoint) [rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2164, in train [rank13]: return inner_training_loop( [rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2592, in _inner_training_loop [rank13]: self._maybe_log_save_evaluate( [rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3050, in _maybe_log_save_evaluate [rank13]: metrics = self._evaluate(trial, ignore_keys_for_eval) [rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3004, in _evaluate [rank13]: metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) [rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4051, in evaluate [rank13]: output = eval_loop( [rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4256, in evaluation_loop [rank13]: losses = self.gather_function((losses.repeat(batch_size))) [rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2500, in gather_for_metrics [rank13]: data = self.gather(input_data) [rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2456, in gather [rank13]: return gather(tensor) [rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 384, in wrapper [rank13]: output = gather_object([shapes]) [rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 459, in gather_object [rank13]: return _gpu_gather_object(object) [rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 440, in _gpu_gather_object [rank13]: torch.distributed.all_gather_object(output_objects, object) [rank13]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [rank13]: return func(*args, **kwargs) [rank13]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2453, in all_gather_object [rank13]: object_list[i] = _tensor_to_object(tensor, tensor_size, group) [rank13]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2362, in _tensor_to_object [rank13]: return _unpickler(io.BytesIO(buf)).load() [rank13]: _pickle.UnpicklingError: invalid load key, ''. ‘’’

Steps to reproduce

export NCCL_BLOCKING_WAIT=1 export NCCL_ASYNC_ERROR_HANDLING=1 export NCCL_TIMEOUT=10800 export NCCL_IB_GID_INDEX=3 export NCCL_IB_TC=106 export NCCL_IB_HCA=mlx5_0,mlx5_1

if [ -d ./qwen_lora_stage1 ];then rm -rf ./qwen_lora_stage1 mkdir -p ./qwen_lora_stage1 else mkdir -p ./qwen_lora_stage1 fi

accelerate launch -m --config_file accelerate_config3.yaml axolotl.cli.train examples/medusa/qwen25_72b_lora_stage1.yml

deepspeed zero3

Config yaml

Possible solution

No response

Which Operating Systems are you using?

  • Linux[ ] macOS[ ] Windows

Python Version

python10

axolotl branch-commit

af727ee

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.[x] I have searched the existing issues to make sure this bug has not been reported yet.[x] I am using the latest version of axolotl.[x] I have provided enough information for the maintainers to reproduce and diagnose the issue.

modify torch and accelerate to solve this bug:
/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:buf = pickle.dumps(tensor.numpy()[:tensor_size])
/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py:# @verify_operation

@sankexin
Copy link
Author

sankexin commented Jan 24, 2025

Thanks for the report.

after delete barrier() in zero_first(is_main)

Which barrier did you delete? Could you perhaps debug whether the is_local_main_process() is properly running for each node with a log? Each node should have a master.

It seems the last error you got was related to eval. Would you be able to do a small experiment and see whether if you can run without evaluation (via eval_strategy: no )

def zero_first(is_main):
    """
    runs the wrapped context so that rank 0 runs first before other ranks
    """
    if not is_main:  # other ranks wait first
        barrier()
    yield
    if is_main:  # then rank 0 waits after it has run the context
        barrier()

to

def zero_first(is_main):
    """
    runs the wrapped context so that rank 0 runs first before other ranks
    """
    yield

export NCCL_IB_HCA=mlx5_0,mlx5_1 can solve hang of this place, if not delete barrier() , but there is some hang after, by accelerate library.

1、16 gpu return: True False False False False False False False False False False False False False False False.
2、yes, evaluate in transformers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants