-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training error #9
Comments
The same error. I guess, this is the environment error. Can anyone provide some idea or suggestion?
|
I also get the same errors, if you have solved it, I hope you can help me, thank you |
I encountered an error while running the code shown in the first image. However, the code in the second image works perfectly fine. Upon investigation, I discovered that the line from decord import VideoReader was causing the issue. I referred to this GitHub issue and made some changes as suggested. You can try this solution as well. |
Thanks for your advice, I solved the problem successfully! |
When I tried training with one node one gpu. I always met the following error:
[2024-07-13 12:20:37,123] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 2046131) of binary: /data/conda_envs/360dvd/bin/python Traceback (most recent call last): File "/data/conda_envs/360dvd/bin/torchrun", line 8, in sys.exit(main()) File "/data/conda_envs/360dvd/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/data/conda_envs/360dvd/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/data/conda_envs/360dvd/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/data/conda_envs/360dvd/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/data/conda_envs/360dvd/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ========================================================= train.py FAILED --------------------------------------------------------- Failures: <NO_OTHER_FAILURES> --------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2024-07-13_12:20:37 host : moss rank : 0 (local_rank: 0) exitcode : -11 (pid: 2046131) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 2046131 =========================================================
Any clues about how to solve it?
The text was updated successfully, but these errors were encountered: