Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training error #9

Open
hw-liang opened this issue Jul 13, 2024 · 4 comments
Open

Training error #9

hw-liang opened this issue Jul 13, 2024 · 4 comments

Comments

@hw-liang
Copy link

When I tried training with one node one gpu. I always met the following error:

[2024-07-13 12:20:37,123] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 2046131) of binary: /data/conda_envs/360dvd/bin/python Traceback (most recent call last): File "/data/conda_envs/360dvd/bin/torchrun", line 8, in sys.exit(main()) File "/data/conda_envs/360dvd/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/data/conda_envs/360dvd/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/data/conda_envs/360dvd/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/data/conda_envs/360dvd/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/data/conda_envs/360dvd/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ========================================================= train.py FAILED --------------------------------------------------------- Failures: <NO_OTHER_FAILURES> --------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2024-07-13_12:20:37 host : moss rank : 0 (local_rank: 0) exitcode : -11 (pid: 2046131) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 2046131 =========================================================

Any clues about how to solve it?

@Bryant-Teng
Copy link

The same error. I guess, this is the environment error. Can anyone provide some idea or suggestion?
(360dvd) root@autodl-container-8d2745b08b-cf3d6ddc:~/autodl-tmp/360/360DVD-main# CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 train.py --config configs/training/training.yaml
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 8086) of binary: /root/miniconda3/envs/360dvd/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/360dvd/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
File "/root/miniconda3/envs/360dvd/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/360dvd/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/root/miniconda3/envs/360dvd/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/root/miniconda3/envs/360dvd/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/360dvd/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

@ecustfry
Copy link

I also get the same errors, if you have solved it, I hope you can help me, thank you

@losehu
Copy link

losehu commented Jul 25, 2024

I encountered an error while running the code shown in the first image. However, the code in the second image works perfectly fine. Upon investigation, I discovered that the line from decord import VideoReader was causing the issue.

15547f69ad3f1f81603d725fc9dbfc6

a04c59a38cf70d0e09b3a70fc452079

I referred to this GitHub issue and made some changes as suggested. You can try this solution as well.

@ecustfry
Copy link

Thanks for your advice, I solved the problem successfully!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants