Closed
Description
I am using a SLURMCluster
object to run some simple python functions in parallel on an HPC cluster. When I run the script by manually passing each parameter to the SLURMCluster
object, the jobs are submitted, connect, run, and return properly. However, when I move those parameters to a dask.yaml
file (in ~/.config/dask/dask.yaml
), the job submit but never connect, finish, and return, but instead hang until I kill the running python process and cancel the subequently submitted jobs. Both ways yield the same job script with identical options specified.
What could be causing this?
Below are copies of my dask.yaml
file, as well as the SLURMCluster
object with parameters that I use when I manually specifying everything:
CORES=2
#### This works
cluster = SLURMCluster(name='worker_bee',
queue='normal',
project='TG-EAR180014',
processes=1,
cores=CORES,
memory='2GB',
interface='ib0',
header_skip=['--mem', '--cpus-per-task='],
job_extra=['-N {}'.format(CORES)]
)
#### This doesn't work
cluster = SLURMCluster()
dask.yaml
jobqueue:
slurm:
name: worker-bee
project: TG-EAR180014
queue: normal
cores: 2
memory: 2GB
processes: 1
interface: ib0
death-timeout: 60 # Number of seconds to wait if a worker can not find a scheduler
local-directory: null # Location of fast local storage like /scratch or $TMPDIR
# LSF resource manager options
shebang: "#!/usr/bin/env bash"
walltime: '00:30'
extra: []
env-extra: []
ncpus: null
header-skip: ['--mem', '--cpus-per-task=']
job-extra: ['-N 2']
log-directory: null
Metadata
Metadata
Assignees
Labels
No labels