You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TorchRec DLRM README provides an example of using torchx remotely:
torchx run -s slurm dist.ddp -j 1x8 --script dlrm_main.py
This example fails with:
> torchx run -s slurm dist.ddp -j 1x8 --script dlrm_main.py
torchx 2024-08-05 11:53:42 INFO Tracker configurations: {}
torchx 2024-08-05 11:53:42 INFO Checking for changes in workspace `file:///proj/java-gpu/training/recommendation_v2/torchrec_dlrm`...
torchx 2024-08-05 11:53:42 INFO To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
torchx 2024-08-05 11:53:42 INFO Reusing original image `ghcr.io/pytorch/torchx:0.7.0` for role[0]=dlrm_main. Either a patch was built or no changes to workspace was detected.
Traceback (most recent call last):
File "/.local/bin/torchx", line 8, in <module>
sys.exit(main())
^^^^^^
File "/.local/lib/python3.12/site-packages/torchx/cli/main.py", line 118, in main
run_main(get_sub_cmds(), argv)
File "/.local/lib/python3.12/site-packages/torchx/cli/main.py", line 114, in run_main
args.func(args)
File "/.local/lib/python3.12/site-packages/torchx/cli/cmd_run.py", line 268, in run
self._run(runner, args)
File "/.local/lib/python3.12/site-packages/torchx/cli/cmd_run.py", line 228, in _run
app_handle = runner.run_component(
^^^^^^^^^^^^^^^^^^^^^
File "/.local/lib/python3.12/site-packages/torchx/runner/api.py", line 200, in run_component
handle = self.schedule(dryrun_info)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.local/lib/python3.12/site-packages/torchx/runner/api.py", line 308, in schedule
app_id = sched.schedule(dryrun_info)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.local/lib/python3.12/site-packages/torchx/schedulers/slurm_scheduler.py", line 388, in schedule
p = subprocess.run(req.cmd, stdout=subprocess.PIPE, check=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/subprocess.py", line 548, in run
with Popen(*popenargs, **kwargs) as process:
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/subprocess.py", line 1026, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib64/python3.12/subprocess.py", line 1955, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'sbatch'
TorchRec DLRM README provides an example of using torchx remotely:
This example fails with:
It appears that the
sbatch
file is missing.I'm using the latest revision of the master branch.
The text was updated successfully, but these errors were encountered: