>=4-nodes（4*4gpu） training hangs at zero_first #2275

sankexin · 2025-01-22T02:24:36Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

don‘t hang and train normaly

Current behaviour

accelerate hang here

axolotl/src/axolotl/utils/data/sft.py：
with zero_first(is_local_main_process()):

after delete barrier() in zero_first(is_main)

then hang here

/usr/local/lib/python3.10/site-packages/transformers/trainer.py：
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
self.model, self.optimizer, self.lr_scheduler
)

after add
‘’‘
export NCCL_IB_GID_INDEX=3
export NCCL_IB_TC=106
export NCCL_IB_HCA=mlx5_0,mlx5_1
‘’‘

then get errors：
‘’‘
[rank14]: Traceback (most recent call last):
[rank14]: File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank14]: return _run_code(code, main_globals, None,
[rank14]: File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
[rank14]: exec(code, run_globals)
[rank14]: File "/home/axolotl/src/axolotl/cli/train.py", line 71, in
[rank14]: fire.Fire(do_cli)
[rank14]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
[rank14]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank14]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
[rank14]: component, remaining_args = _CallAndUpdateTrace(
[rank14]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank14]: component = fn(*varargs, **kwargs)
[rank14]: File "/home/axolotl/src/axolotl/cli/train.py", line 66, in do_cli
[rank14]: do_train(parsed_cfg, parsed_cli_args)
[rank14]: File "/home/axolotl/src/axolotl/cli/train.py", line 42, in do_train
[rank14]: model, tokenizer = train(cfg=cfg, dataset_meta=dataset_meta)
[rank14]: File "/home/axolotl/src/axolotl/train.py", line 185, in train
[rank14]: trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2164, in train
[rank14]: return inner_training_loop(
[rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2592, in _inner_training_loop
[rank14]: self._maybe_log_save_evaluate(
[rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3050, in _maybe_log_save_evaluate
[rank14]: metrics = self._evaluate(trial, ignore_keys_for_eval)
[rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3004, in _evaluate
[rank14]: metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4051, in evaluate
[rank14]: output = eval_loop(
[rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4256, in evaluation_loop
[rank14]: losses = self.gather_function((losses.repeat(batch_size)))
[rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2500, in gather_for_metrics
[rank14]: data = self.gather(input_data)
[rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2456, in gather
[rank14]: return gather(tensor)
[rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 384, in wrapper
[rank14]: output = gather_object([shapes])
[rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 459, in gather_object
[rank14]: return _gpu_gather_object(object)
[rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 440, in _gpu_gather_object
[rank14]: torch.distributed.all_gather_object(output_objects, object)
[rank14]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank14]: return func(*args, **kwargs)
[rank14]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2453, in all_gather_object
[rank14]: object_list[i] = _tensor_to_object(tensor, tensor_size, group)
[rank14]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2362, in _tensor_to_object
[rank14]: return _unpickler(io.BytesIO(buf)).load()
[rank14]: _pickle.UnpicklingError: invalid load key, ''.
[rank13]: Traceback (most recent call last):
[rank13]: File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank13]: return _run_code(code, main_globals, None,
[rank13]: File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
[rank13]: exec(code, run_globals)
[rank13]: File "/home/axolotl/src/axolotl/cli/train.py", line 71, in
[rank13]: fire.Fire(do_cli)
[rank13]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
[rank13]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank13]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
[rank13]: component, remaining_args = _CallAndUpdateTrace(
[rank13]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank13]: component = fn(*varargs, **kwargs)
[rank13]: File "/home/axolotl/src/axolotl/cli/train.py", line 66, in do_cli
[rank13]: do_train(parsed_cfg, parsed_cli_args)
[rank13]: File "/home/axolotl/src/axolotl/cli/train.py", line 42, in do_train
[rank13]: model, tokenizer = train(cfg=cfg, dataset_meta=dataset_meta)
[rank13]: File "/home/axolotl/src/axolotl/train.py", line 185, in train
[rank13]: trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2164, in train
[rank13]: return inner_training_loop(
[rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2592, in _inner_training_loop
[rank13]: self._maybe_log_save_evaluate(
[rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3050, in _maybe_log_save_evaluate
[rank13]: metrics = self._evaluate(trial, ignore_keys_for_eval)
[rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3004, in _evaluate
[rank13]: metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4051, in evaluate
[rank13]: output = eval_loop(
[rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4256, in evaluation_loop
[rank13]: losses = self.gather_function((losses.repeat(batch_size)))
[rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2500, in gather_for_metrics
[rank13]: data = self.gather(input_data)
[rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2456, in gather
[rank13]: return gather(tensor)
[rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 384, in wrapper
[rank13]: output = gather_object([shapes])
[rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 459, in gather_object
[rank13]: return _gpu_gather_object(object)
[rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 440, in _gpu_gather_object
[rank13]: torch.distributed.all_gather_object(output_objects, object)
[rank13]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank13]: return func(*args, **kwargs)
[rank13]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2453, in all_gather_object
[rank13]: object_list[i] = _tensor_to_object(tensor, tensor_size, group)
[rank13]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2362, in _tensor_to_object
[rank13]: return _unpickler(io.BytesIO(buf)).load()
[rank13]: _pickle.UnpicklingError: invalid load key, ''.
‘’’

Steps to reproduce

export NCCL_BLOCKING_WAIT=1
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_TIMEOUT=10800
export NCCL_IB_GID_INDEX=3
export NCCL_IB_TC=106
export NCCL_IB_HCA=mlx5_0,mlx5_1

if [ -d ./qwen_lora_stage1 ];then
rm -rf ./qwen_lora_stage1
mkdir -p ./qwen_lora_stage1
else
mkdir -p ./qwen_lora_stage1
fi

accelerate launch -m --config_file accelerate_config3.yaml axolotl.cli.train examples/medusa/qwen25_72b_lora_stage1.yml

deepspeed zero3

Config yaml

Possible solution

No response

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

python10

axolotl branch-commit

af727ee

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

NanoCode012 · 2025-01-23T08:57:23Z

Thanks for the report.

after delete barrier() in zero_first(is_main)

Which barrier did you delete? Could you perhaps debug whether the is_local_main_process() is properly running for each node with a log? Each node should have a master.

It seems the last error you got was related to eval. Would you be able to do a small experiment and see whether if you can run without evaluation (via eval_strategy: no )

sankexin · 2025-01-24T05:44:54Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

don‘t hang and train normaly

Current behaviour

accelerate hang here

axolotl/src/axolotl/utils/data/sft.py： with zero_first(is_local_main_process()):

after delete barrier() in zero_first(is_main)

then hang here

/usr/local/lib/python3.10/site-packages/transformers/trainer.py： model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( self.model, self.optimizer, self.lr_scheduler )

after add ‘’‘ export NCCL_IB_GID_INDEX=3 export NCCL_IB_TC=106 export NCCL_IB_HCA=mlx5_0,mlx5_1 ‘’‘

then get errors： ‘’‘ [rank14]: Traceback (most recent call last): [rank14]: File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main [rank14]: return _run_code(code, main_globals, None, [rank14]: File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code [rank14]: exec(code, run_globals) [rank14]: File "/home/axolotl/src/axolotl/cli/train.py", line 71, in [rank14]: fire.Fire(do_cli) [rank14]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 135, in Fire [rank14]: component_trace = _Fire(component, args, parsed_flag_args, context, name) [rank14]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire [rank14]: component, remaining_args = _CallAndUpdateTrace( [rank14]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace [rank14]: component = fn(*varargs, **kwargs) [rank14]: File "/home/axolotl/src/axolotl/cli/train.py", line 66, in do_cli [rank14]: do_train(parsed_cfg, parsed_cli_args) [rank14]: File "/home/axolotl/src/axolotl/cli/train.py", line 42, in do_train [rank14]: model, tokenizer = train(cfg=cfg, dataset_meta=dataset_meta) [rank14]: File "/home/axolotl/src/axolotl/train.py", line 185, in train [rank14]: trainer.train(resume_from_checkpoint=resume_from_checkpoint) [rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2164, in train [rank14]: return inner_training_loop( [rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2592, in _inner_training_loop [rank14]: self._maybe_log_save_evaluate( [rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3050, in _maybe_log_save_evaluate [rank14]: metrics = self._evaluate(trial, ignore_keys_for_eval) [rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3004, in _evaluate [rank14]: metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) [rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4051, in evaluate [rank14]: output = eval_loop( [rank14]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4256, in evaluation_loop [rank14]: losses = self.gather_function((losses.repeat(batch_size))) [rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2500, in gather_for_metrics [rank14]: data = self.gather(input_data) [rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2456, in gather [rank14]: return gather(tensor) [rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 384, in wrapper [rank14]: output = gather_object([shapes]) [rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 459, in gather_object [rank14]: return _gpu_gather_object(object) [rank14]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 440, in _gpu_gather_object [rank14]: torch.distributed.all_gather_object(output_objects, object) [rank14]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [rank14]: return func(*args, **kwargs) [rank14]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2453, in all_gather_object [rank14]: object_list[i] = _tensor_to_object(tensor, tensor_size, group) [rank14]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2362, in _tensor_to_object [rank14]: return _unpickler(io.BytesIO(buf)).load() [rank14]: _pickle.UnpicklingError: invalid load key, ''. [rank13]: Traceback (most recent call last): [rank13]: File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main [rank13]: return _run_code(code, main_globals, None, [rank13]: File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code [rank13]: exec(code, run_globals) [rank13]: File "/home/axolotl/src/axolotl/cli/train.py", line 71, in [rank13]: fire.Fire(do_cli) [rank13]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 135, in Fire [rank13]: component_trace = _Fire(component, args, parsed_flag_args, context, name) [rank13]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire [rank13]: component, remaining_args = _CallAndUpdateTrace( [rank13]: File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace [rank13]: component = fn(*varargs, **kwargs) [rank13]: File "/home/axolotl/src/axolotl/cli/train.py", line 66, in do_cli [rank13]: do_train(parsed_cfg, parsed_cli_args) [rank13]: File "/home/axolotl/src/axolotl/cli/train.py", line 42, in do_train [rank13]: model, tokenizer = train(cfg=cfg, dataset_meta=dataset_meta) [rank13]: File "/home/axolotl/src/axolotl/train.py", line 185, in train [rank13]: trainer.train(resume_from_checkpoint=resume_from_checkpoint) [rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2164, in train [rank13]: return inner_training_loop( [rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2592, in _inner_training_loop [rank13]: self._maybe_log_save_evaluate( [rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3050, in _maybe_log_save_evaluate [rank13]: metrics = self._evaluate(trial, ignore_keys_for_eval) [rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3004, in _evaluate [rank13]: metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) [rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4051, in evaluate [rank13]: output = eval_loop( [rank13]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4256, in evaluation_loop [rank13]: losses = self.gather_function((losses.repeat(batch_size))) [rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2500, in gather_for_metrics [rank13]: data = self.gather(input_data) [rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2456, in gather [rank13]: return gather(tensor) [rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 384, in wrapper [rank13]: output = gather_object([shapes]) [rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 459, in gather_object [rank13]: return _gpu_gather_object(object) [rank13]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 440, in _gpu_gather_object [rank13]: torch.distributed.all_gather_object(output_objects, object) [rank13]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [rank13]: return func(*args, **kwargs) [rank13]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2453, in all_gather_object [rank13]: object_list[i] = _tensor_to_object(tensor, tensor_size, group) [rank13]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2362, in _tensor_to_object [rank13]: return _unpickler(io.BytesIO(buf)).load() [rank13]: _pickle.UnpicklingError: invalid load key, ''. ‘’’

Steps to reproduce

export NCCL_BLOCKING_WAIT=1 export NCCL_ASYNC_ERROR_HANDLING=1 export NCCL_TIMEOUT=10800 export NCCL_IB_GID_INDEX=3 export NCCL_IB_TC=106 export NCCL_IB_HCA=mlx5_0,mlx5_1

if [ -d ./qwen_lora_stage1 ];then rm -rf ./qwen_lora_stage1 mkdir -p ./qwen_lora_stage1 else mkdir -p ./qwen_lora_stage1 fi

accelerate launch -m --config_file accelerate_config3.yaml axolotl.cli.train examples/medusa/qwen25_72b_lora_stage1.yml

deepspeed zero3

Config yaml

Possible solution

No response

Which Operating Systems are you using?

Linux[ ] macOS[ ] Windows

Python Version

python10

axolotl branch-commit

af727ee

Acknowledgements

My issue title is concise, descriptive, and in title casing.[x] I have searched the existing issues to make sure this bug has not been reported yet.[x] I am using the latest version of axolotl.[x] I have provided enough information for the maintainers to reproduce and diagnose the issue.

modify torch and accelerate to solve this bug:
/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:buf = pickle.dumps(tensor.numpy()[:tensor_size])
/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py:# @verify_operation

sankexin · 2025-01-24T05:50:40Z

Thanks for the report.

after delete barrier() in zero_first(is_main)

Which barrier did you delete? Could you perhaps debug whether the is_local_main_process() is properly running for each node with a log? Each node should have a master.

It seems the last error you got was related to eval. Would you be able to do a small experiment and see whether if you can run without evaluation (via eval_strategy: no )

def zero_first(is_main):
    """
    runs the wrapped context so that rank 0 runs first before other ranks
    """
    if not is_main:  # other ranks wait first
        barrier()
    yield
    if is_main:  # then rank 0 waits after it has run the context
        barrier()

to

def zero_first(is_main):
    """
    runs the wrapped context so that rank 0 runs first before other ranks
    """
    yield

export NCCL_IB_HCA=mlx5_0,mlx5_1 can solve hang of this place, if not delete barrier() , but there is some hang after, by accelerate library.

1、16 gpu return: True False False False False False False False False False False False False False False False.
2、yes, evaluate in transformers.

sankexin added the bug Something isn't working label Jan 22, 2025

sankexin changed the title ~~4-node（4*4gpu） training hangs at zero_first~~ >=4-nodes（4*4gpu） training hangs at zero_first Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

>=4-nodes（4*4gpu） training hangs at zero_first #2275

>=4-nodes（4*4gpu） training hangs at zero_first #2275

sankexin commented Jan 22, 2025 •

edited

Loading

NanoCode012 commented Jan 23, 2025

sankexin commented Jan 24, 2025

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

accelerate hang here

then hang here

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

sankexin commented Jan 24, 2025 •

edited

Loading

>=4-nodes（4*4gpu） training hangs at zero_first #2275

>=4-nodes（4*4gpu） training hangs at zero_first #2275

Comments

sankexin commented Jan 22, 2025 • edited Loading

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

accelerate hang here

then hang here

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

NanoCode012 commented Jan 23, 2025

sankexin commented Jan 24, 2025

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

accelerate hang here

then hang here

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

sankexin commented Jan 24, 2025 • edited Loading

sankexin commented Jan 22, 2025 •

edited

Loading

sankexin commented Jan 24, 2025 •

edited

Loading