Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine if a job failed due to exceeding the time limit #1776

Open
lee-jin-gyu96 opened this issue Sep 25, 2024 · 0 comments
Open

Determine if a job failed due to exceeding the time limit #1776

lee-jin-gyu96 opened this issue Sep 25, 2024 · 0 comments

Comments

@lee-jin-gyu96
Copy link

Hello,

I'm wondering if there's a clean way to determine if a job failed due to being timed out by slurm, or because of an "actual" error.
As far as I can tell, I have to parse the error message to check if Job not requeued because: timed-out and not checkpointable is included.

That works, but I'd be grateful for any advice if there's a better way to do this.

(The context is that I have a job that should end in X minutes. If the job takes longer than X minutes, it means there's a problem with the input, but I can't diagnose said problem before running the job. So the goal is to let my program continue running if a slurm job failed due to getting timed out.)

Thank you in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant