Training error #9

hw-liang · 2024-07-13T19:23:41Z

When I tried training with one node one gpu. I always met the following error:

[2024-07-13 12:20:37,123] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 2046131) of binary: /data/conda_envs/360dvd/bin/python Traceback (most recent call last): File "/data/conda_envs/360dvd/bin/torchrun", line 8, in sys.exit(main()) File "/data/conda_envs/360dvd/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/data/conda_envs/360dvd/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/data/conda_envs/360dvd/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/data/conda_envs/360dvd/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/data/conda_envs/360dvd/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ========================================================= train.py FAILED --------------------------------------------------------- Failures: <NO_OTHER_FAILURES> --------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2024-07-13_12:20:37 host : moss rank : 0 (local_rank: 0) exitcode : -11 (pid: 2046131) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 2046131 =========================================================

Any clues about how to solve it?

Bryant-Teng · 2024-07-18T14:34:25Z

The same error. I guess, this is the environment error. Can anyone provide some idea or suggestion?
(360dvd) root@autodl-container-8d2745b08b-cf3d6ddc:~/autodl-tmp/360/360DVD-main# CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 train.py --config configs/training/training.yaml
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 8086) of binary: /root/miniconda3/envs/360dvd/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/360dvd/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
File "/root/miniconda3/envs/360dvd/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/root/miniconda3/envs/360dvd/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/root/miniconda3/envs/360dvd/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/root/miniconda3/envs/360dvd/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/360dvd/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

ecustfry · 2024-07-24T08:00:44Z

I also get the same errors, if you have solved it, I hope you can help me, thank you

losehu · 2024-07-25T04:47:18Z

I encountered an error while running the code shown in the first image. However, the code in the second image works perfectly fine. Upon investigation, I discovered that the line from decord import VideoReader was causing the issue.

I referred to this GitHub issue and made some changes as suggested. You can try this solution as well.

ecustfry · 2024-07-26T09:11:34Z

Thanks for your advice, I solved the problem successfully！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training error #9

Training error #9

hw-liang commented Jul 13, 2024

Bryant-Teng commented Jul 18, 2024

ecustfry commented Jul 24, 2024

losehu commented Jul 25, 2024 •

edited

Loading

ecustfry commented Jul 26, 2024

Training error #9

Training error #9

Comments

hw-liang commented Jul 13, 2024

Bryant-Teng commented Jul 18, 2024

ecustfry commented Jul 24, 2024

losehu commented Jul 25, 2024 • edited Loading

ecustfry commented Jul 26, 2024

losehu commented Jul 25, 2024 •

edited

Loading