You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I need to submit thousands of tasks, and due to the max size limit of job array, the tasks are devided into groups and there will be one job array for each group:
submitted_jobs= []
forgroup_idx,group_jobs_to_runinenumerate(groups):
withexcutor.batch(): # a job_array for each groupforidxingroup_jobs_to_run: # note the idx is the user defined one, not the slurm job idtask_args,task_kwargs=get_task_args(idx)
job=excutor.submit(slurm_tasks,*task_args,**task_kwargs)
submitted_jobs.append(job)
# wait for results_= [job.wait() forjobinsubmitted_jobs]
I use job.wait() to wait for all tasks to complete, however, I found it usually trigger the user rpc limit on my slurm cluster, sometimes even stuck the whole cluster, and I got the warning:
sacct: Failed to load running jobs from slurmctld
submitit WARNING (2024-01-10 22:07:41,285) - Call #6 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2024-01-10 22:07:41,285) - Call #6 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
sacct: Failed to load running jobs from slurmctld
submitit WARNING (2024-01-10 22:08:51,594) - Call #7 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2024-01-10 22:08:51,594) - Call #7 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
sacct: Failed to load running jobs from slurmctld
submitit WARNING (2024-01-10 22:15:56,718) - Call #9 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2024-01-10 22:15:56,718) - Call #9 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
sacct: Failed to load running jobs from slurmctld
submitit WARNING (2024-01-10 22:25:23,108) - Call #10 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2024-01-10 22:25:23,108) - Call #10 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
sacct: Failed to load running jobs from slurmctld
submitit WARNING (2024-01-10 23:15:34,829) - Call #15 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2024-01-10 23:15:34,829) - Call #15 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
sacct: Failed to load running jobs from slurmctld
submitit WARNING (2024-01-10 23:25:36,817) - Call #16 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2024-01-10 23:25:36,817) - Call #16 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
It seems that the submitit asked for too many duplicated requests at the same time that exceed the user rpc limit on my clustere. The JOB.wait() method is expected to run in a blocking way that may not request the task's state in parallel, and I'm not sure what machenism in submitit caused the duplicated slurm call.
The text was updated successfully, but these errors were encountered:
I need to submit thousands of tasks, and due to the max size limit of job array, the tasks are devided into groups and there will be one job array for each group:
I use job.wait() to wait for all tasks to complete, however, I found it usually trigger the user rpc limit on my slurm cluster, sometimes even stuck the whole cluster, and I got the warning:
It seems that the submitit asked for too many duplicated requests at the same time that exceed the user rpc limit on my clustere. The JOB.wait() method is expected to run in a blocking way that may not request the task's state in parallel, and I'm not sure what machenism in submitit caused the duplicated slurm call.
The text was updated successfully, but these errors were encountered: