Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_job_status API call can fail immediately after a job completes #156

Open
Mossop opened this issue Feb 28, 2019 · 0 comments
Open

get_job_status API call can fail immediately after a job completes #156

Mossop opened this issue Feb 28, 2019 · 0 comments
Labels

Comments

@Mossop
Copy link

Mossop commented Feb 28, 2019

Summary

Calling the get_job_status API can fail if called immediately after the job completes. In my case I am doing so in response to the job_complete web-hook. The failure is "Failed to fetch job details: Error: Failed to fetch key: jobs/jjso98s8024: File not found".

If I'm reading this right there is a race condition here:

When a job completes on the local server we call finishJob (https://github.com/jhuckaby/Cronicle/blob/master/lib/job.js#L1097).
This enqueues a method to store the job's final state in storage (https://github.com/jhuckaby/Cronicle/blob/master/lib/job.js#L1097).
And then sends out the web-hook (https://github.com/jhuckaby/Cronicle/blob/master/lib/job.js#L1404).
Afterwards we remove the job from activeJobs (https://github.com/jhuckaby/Cronicle/blob/master/lib/job.js#L1115).

So when we remove from activeJobs the job isn't yet in storage.
So the web-hook request to me and then my subsequent request back to the server is racing with the task to store into storage. Surprisingly I'm losing that race about 50% of the time.

I can't quite tell if this is the same issue as https://github.com/jhuckaby/Cronicle/blob/master/lib/api/job.js#L392 as I'm just running a single master server, no slaves.

Steps to reproduce the problem

Upon receiving a web-hook job_complete send a get_job_status API call.

Your Setup

Operating system and version?

OS X 10.14.3

Node.js version?

11.9.0

Cronicle software version?

0.8.28

Are you using a multi-server setup, or just a single server?

Single server.

Are you using the filesystem as back-end storage, or S3/Couchbase?

Filesystem (I imagine the problem would be worse for S3).

Can you reproduce the crash consistently?

Roughly 50% of the time.

Log Excerpts

User.log:

13:[1551335738.198][2019-02-27 22:35:38][cantor.klickitat.local][46581][User][error][job][Failed to fetch job details: Error: Failed to fetch key: jobs/jjso98s8024: File not found][]

API.log:

[1551335738.197][2019-02-27 22:35:38][cantor.klickitat.local][46581][API][debug][6][Handling API request: POST http://localhost:3012/api/app/get_job_status/v1][{}]
[1551335738.197][2019-02-27 22:35:38][cantor.klickitat.local][46581][API][debug][9][API Params][{"api_key":"<redacted>","id":"jjso98s8024"}]
[1551335738.197][2019-02-27 22:35:38][cantor.klickitat.local][46581][API][debug][9][Activating namespaced API handler: app/api_get_job_status for URI: http://localhost:3012/api/app/get_job_status/v1][]

Filesystem.log:

4070:[1551335738.191][2019-02-27 22:35:38][cantor.klickitat.local][46581][Filesystem][debug][9][Storing JSON Object: jobs/jjso98s8024][data/jobs/59/f8/b6/59f8b667e0dc0700a4df186445df729b.json]
4074:[1551335738.194][2019-02-27 22:35:38][cantor.klickitat.local][46581][Filesystem][debug][9][Storing Binary Stream Object: jobs/jjso98s8024/log.txt.gz][data/jobs/31/55/51/3155513e934694cc3056f19427091a8a.gz]
4081:[1551335738.198][2019-02-27 22:35:38][cantor.klickitat.local][46581][Filesystem][debug][9][Fetching Object: jobs/jjso98s8024][data/jobs/59/f8/b6/59f8b667e0dc0700a4df186445df729b.json]
4091:[1551335738.201][2019-02-27 22:35:38][cantor.klickitat.local][46581][Filesystem][debug][9][Store operation complete: jobs/jjso98s8024][]
4092:[1551335738.202][2019-02-27 22:35:38][cantor.klickitat.local][46581][Filesystem][debug][9][Store operation complete: jobs/jjso98s8024/log.txt.gz][]

You can see that the failure to find the job in storage happens between the attempt to store it and completion.

@jhuckaby jhuckaby added the bug label Mar 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants