You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Calling the get_job_status API can fail if called immediately after the job completes. In my case I am doing so in response to the job_complete web-hook. The failure is "Failed to fetch job details: Error: Failed to fetch key: jobs/jjso98s8024: File not found".
If I'm reading this right there is a race condition here:
So when we remove from activeJobs the job isn't yet in storage.
So the web-hook request to me and then my subsequent request back to the server is racing with the task to store into storage. Surprisingly I'm losing that race about 50% of the time.
Upon receiving a web-hook job_complete send a get_job_status API call.
Your Setup
Operating system and version?
OS X 10.14.3
Node.js version?
11.9.0
Cronicle software version?
0.8.28
Are you using a multi-server setup, or just a single server?
Single server.
Are you using the filesystem as back-end storage, or S3/Couchbase?
Filesystem (I imagine the problem would be worse for S3).
Can you reproduce the crash consistently?
Roughly 50% of the time.
Log Excerpts
User.log:
13:[1551335738.198][2019-02-27 22:35:38][cantor.klickitat.local][46581][User][error][job][Failed to fetch job details: Error: Failed to fetch key: jobs/jjso98s8024: File not found][]
API.log:
[1551335738.197][2019-02-27 22:35:38][cantor.klickitat.local][46581][API][debug][6][Handling API request: POST http://localhost:3012/api/app/get_job_status/v1][{}]
[1551335738.197][2019-02-27 22:35:38][cantor.klickitat.local][46581][API][debug][9][API Params][{"api_key":"<redacted>","id":"jjso98s8024"}]
[1551335738.197][2019-02-27 22:35:38][cantor.klickitat.local][46581][API][debug][9][Activating namespaced API handler: app/api_get_job_status for URI: http://localhost:3012/api/app/get_job_status/v1][]
Summary
Calling the get_job_status API can fail if called immediately after the job completes. In my case I am doing so in response to the job_complete web-hook. The failure is "Failed to fetch job details: Error: Failed to fetch key: jobs/jjso98s8024: File not found".
If I'm reading this right there is a race condition here:
When a job completes on the local server we call finishJob (https://github.com/jhuckaby/Cronicle/blob/master/lib/job.js#L1097).
This enqueues a method to store the job's final state in storage (https://github.com/jhuckaby/Cronicle/blob/master/lib/job.js#L1097).
And then sends out the web-hook (https://github.com/jhuckaby/Cronicle/blob/master/lib/job.js#L1404).
Afterwards we remove the job from activeJobs (https://github.com/jhuckaby/Cronicle/blob/master/lib/job.js#L1115).
So when we remove from activeJobs the job isn't yet in storage.
So the web-hook request to me and then my subsequent request back to the server is racing with the task to store into storage. Surprisingly I'm losing that race about 50% of the time.
I can't quite tell if this is the same issue as https://github.com/jhuckaby/Cronicle/blob/master/lib/api/job.js#L392 as I'm just running a single master server, no slaves.
Steps to reproduce the problem
Upon receiving a web-hook job_complete send a get_job_status API call.
Your Setup
Operating system and version?
OS X 10.14.3
Node.js version?
11.9.0
Cronicle software version?
0.8.28
Are you using a multi-server setup, or just a single server?
Single server.
Are you using the filesystem as back-end storage, or S3/Couchbase?
Filesystem (I imagine the problem would be worse for S3).
Can you reproduce the crash consistently?
Roughly 50% of the time.
Log Excerpts
User.log:
API.log:
Filesystem.log:
You can see that the failure to find the job in storage happens between the attempt to store it and completion.
The text was updated successfully, but these errors were encountered: