Skip to content

SLURMCluster jobs not running when using parameters from dask.yaml #394

Closed
@Ovec8hkin

Description

@Ovec8hkin

I am using a SLURMCluster object to run some simple python functions in parallel on an HPC cluster. When I run the script by manually passing each parameter to the SLURMCluster object, the jobs are submitted, connect, run, and return properly. However, when I move those parameters to a dask.yaml file (in ~/.config/dask/dask.yaml), the job submit but never connect, finish, and return, but instead hang until I kill the running python process and cancel the subequently submitted jobs. Both ways yield the same job script with identical options specified.

What could be causing this?

Below are copies of my dask.yaml file, as well as the SLURMCluster object with parameters that I use when I manually specifying everything:

CORES=2

#### This works
cluster = SLURMCluster(name='worker_bee',                                                                                                                                                              
                       queue='normal',                                                                                                                                                                 
                       project='TG-EAR180014',                                                                                                                                                         
                       processes=1,                                                                                                                                                                    
                       cores=CORES,                                                                                                                                                                    
                       memory='2GB',                                                                                                                                                                   
                       interface='ib0',                                                                                                                                                                
                       header_skip=['--mem', '--cpus-per-task='],                                                                                                                                      
                       job_extra=['-N {}'.format(CORES)]
)


#### This doesn't work
cluster = SLURMCluster()   

dask.yaml

jobqueue:
  slurm:
    name: worker-bee
    project: TG-EAR180014
    queue: normal

    cores: 2
    memory: 2GB
    processes: 1

    interface: ib0
    death-timeout: 60           # Number of seconds to wait if a worker can not find a scheduler
    local-directory: null       # Location of fast local storage like /scratch or $TMPDIR

    # LSF resource manager options
    shebang: "#!/usr/bin/env bash"
    walltime: '00:30'
    extra: []
    env-extra: []
    ncpus: null
    header-skip: ['--mem', '--cpus-per-task=']

    job-extra: ['-N 2']
    log-directory: null

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions