Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TorchRec DLRM No such file or directory: 'sbatch' #759

Closed
rvernica opened this issue Aug 5, 2024 · 1 comment
Closed

TorchRec DLRM No such file or directory: 'sbatch' #759

rvernica opened this issue Aug 5, 2024 · 1 comment

Comments

@rvernica
Copy link

rvernica commented Aug 5, 2024

TorchRec DLRM README provides an example of using torchx remotely:

torchx run -s slurm dist.ddp -j 1x8 --script dlrm_main.py

This example fails with:

> torchx run -s slurm dist.ddp -j 1x8 --script dlrm_main.py
torchx 2024-08-05 11:53:42 INFO     Tracker configurations: {}
torchx 2024-08-05 11:53:42 INFO     Checking for changes in workspace `file:///proj/java-gpu/training/recommendation_v2/torchrec_dlrm`...
torchx 2024-08-05 11:53:42 INFO     To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
torchx 2024-08-05 11:53:42 INFO     Reusing original image `ghcr.io/pytorch/torchx:0.7.0` for role[0]=dlrm_main. Either a patch was built or no changes to workspace was detected.
Traceback (most recent call last):
  File "/.local/bin/torchx", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/.local/lib/python3.12/site-packages/torchx/cli/main.py", line 118, in main
    run_main(get_sub_cmds(), argv)
  File "/.local/lib/python3.12/site-packages/torchx/cli/main.py", line 114, in run_main
    args.func(args)
  File "/.local/lib/python3.12/site-packages/torchx/cli/cmd_run.py", line 268, in run
    self._run(runner, args)
  File "/.local/lib/python3.12/site-packages/torchx/cli/cmd_run.py", line 228, in _run
    app_handle = runner.run_component(
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/.local/lib/python3.12/site-packages/torchx/runner/api.py", line 200, in run_component
    handle = self.schedule(dryrun_info)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.local/lib/python3.12/site-packages/torchx/runner/api.py", line 308, in schedule
    app_id = sched.schedule(dryrun_info)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.local/lib/python3.12/site-packages/torchx/schedulers/slurm_scheduler.py", line 388, in schedule
    p = subprocess.run(req.cmd, stdout=subprocess.PIPE, check=True)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib64/python3.12/subprocess.py", line 1955, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'sbatch'

It appears that the sbatch file is missing.

I'm using the latest revision of the master branch.

@rvernica
Copy link
Author

rvernica commented Aug 5, 2024

Fixed with sudo dnf install slurm

@rvernica rvernica closed this as completed Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant