Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recommendation_v2/torchrec_dlrm Fatal Python error: Segmentation fault #758

Closed
rvernica opened this issue Aug 2, 2024 · 1 comment
Closed

Comments

@rvernica
Copy link

rvernica commented Aug 2, 2024

I'm trying to run TorchRec DLRM following the README. I've tried both the torchx and torchrun examples and I'm getting Segmentation Fault on both. I'm using Fedora 40 Linux 6.9.11-200.fc40.x86_64.

> torchx run -s local_cwd dist.ddp -j 1x2 --script dlrm_main.py
torchx 2024-08-02 12:14:26 INFO     Tracker configurations: {}
torchx 2024-08-02 12:14:26 INFO     Log directory not set in scheduler cfg. Creating a temporary log dir that will be deleted on exit. To preserve log directory set the `log_dir` cfg option
torchx 2024-08-02 12:14:26 INFO     Log directory is: /tmp/torchx_bujouexw
local_cwd://torchx/dlrm_main-kdchrcgxwd9p5c
torchx 2024-08-02 12:14:26 INFO     Waiting for the app to finish...
dlrm_main/0 [2024-08-02 12:14:27,907] torch.distributed.run: [WARNING] 
dlrm_main/0 [2024-08-02 12:14:27,907] torch.distributed.run: [WARNING] *****************************************
dlrm_main/0 [2024-08-02 12:14:27,907] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
dlrm_main/0 [2024-08-02 12:14:27,907] torch.distributed.run: [WARNING] *****************************************
dlrm_main/0 Fatal Python error: Segmentation fault
dlrm_main/0 
dlrm_main/0 Current thread 0x00007f3e4a1d2b80 (most recent call first):
dlrm_main/0   File "/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113 in _call_store
dlrm_main/0   File "/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 64 in __init__
dlrm_main/0   File "/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 253 in create_backend
dlrm_main/0   File "/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36 in _create_c10d_handler
dlrm_main/0   File "/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/api.py", line 263 in create_handler
dlrm_main/0   File "/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 66 in get_rendezvous_handler
dlrm_main/0   File "/.local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 238 in launch_agent
dlrm_main/0   File "/.local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 135 in __call__
dlrm_main/0   File "/.local/lib/python3.12/site-packages/torch/distributed/run.py", line 803 in run
dlrm_main/0   File "/.local/lib/python3.12/site-packages/torch/distributed/run.py", line 812 in main
dlrm_main/0   File "/.local/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347 in wrapper
dlrm_main/0   File "/.local/bin/torchrun", line 8 in <module>
dlrm_main/0 
dlrm_main/0 Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special (total: 20)
torchx 2024-08-02 12:14:30 INFO     Job finished: FAILED
torchx 2024-08-02 12:14:30 ERROR    AppStatus:
  msg: <NONE>
  num_restarts: 0
  roles: []
  state: FAILED (5)
  structured_error_msg: <NONE>
  ui_url: file:///tmp/torchx_bujouexw/torchx/dlrm_main-kdchrcgxwd9p5c
> torchrun --nnodes 1 --nproc_per_node 2 --rdzv_backend c10d --rdzv_endpoint localhost --rdzv_id 54321 --role trainer dlrm_main.py
[2024-08-02 12:17:40,294] torch.distributed.run: [WARNING] 
[2024-08-02 12:17:40,294] torch.distributed.run: [WARNING] *****************************************
[2024-08-02 12:17:40,294] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-08-02 12:17:40,294] torch.distributed.run: [WARNING] *****************************************
Fatal Python error: Segmentation fault

Current thread 0x00007f45fbfd2b80 (most recent call first):
  File "/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113 in _call_store
  File "/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 64 in __init__
  File "/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 253 in create_backend
  File "/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36 in _create_c10d_handler
  File "/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/api.py", line 263 in create_handler
  File "/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 66 in get_rendezvous_handler
  File "/.local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 238 in launch_agent
  File "/.local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 135 in __call__
  File "/.local/lib/python3.12/site-packages/torch/distributed/run.py", line 803 in run
  File "/.local/lib/python3.12/site-packages/torch/distributed/run.py", line 812 in main
  File "/.local/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347 in wrapper
  File "/.local/bin/torchrun", line 8 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special (total: 20)
zsh: segmentation fault (core dumped)  torchrun --nnodes 1 --nproc_per_node 2 --rdzv_backend c10d --rdzv_endpoint  
@rvernica
Copy link
Author

rvernica commented Aug 5, 2024

Reinstalling pytorch seems to have fixed this.

@rvernica rvernica closed this as completed Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant