-
Notifications
You must be signed in to change notification settings - Fork 224
feat: sidekiq scheduled jobs improvements - scheduled operation and add links
#1717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: sidekiq scheduled jobs improvements - scheduled operation and add links
#1717
Conversation
03a5fdb to
4eca039
Compare
…d add links When a Job is pushed via `perform_in` or friends, it goes through the following life-cycle: 1. First the Job is pushed with `at` attribute in Redis from `Sidekiq::Client` - this will now create `<...> scheduled` span (before it was `<...> publish` 2. Then the Job is handled from `Sidekiq::Scheduled#enqueue` and pushed without `at` attribute in Redis from `Sidekiq::Client` - this will keep creating `<...> publish` span 3. Then the Job is handled by the Worker as usual Additionally when using `:propagation_style = :link` and `:trace_poller_enqueue = true` a handy set of links are added between the 3 actors above.
44cda99 to
6a3983f
Compare
This reverts commit 4ed7a70.
scheduled operation and add linksscheduled operation and add links
arielvalentin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your submission. I have some concerns about how this changes/breaks primary use cases.
Please review my comments and provide any feedback to help me understand how we can support your needs.
| attributes[SemanticConventions::Trace::PEER_SERVICE] = instrumentation_config[:peer_service] if instrumentation_config[:peer_service] | ||
|
|
||
| scheduled_at = job['at'] | ||
| op = scheduled_at.nil? ? 'publish' : 'scheduled' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 This naming does not match any of the semantic convention naming for messaging.
Is there something special about scheduled spans that help you disambiguate them from any other publisher?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will answer in the next comment ...
| tracer.in_span(span_name, attributes: attributes, kind: :producer) do |span| | ||
| # In case this is Scheduled job, there is already context injected, so link to that context | ||
| # NOTE: :propagation_style = :child is not supported as it is quite tricky when :trace_poller_enqueue = true | ||
| extracted_context = OpenTelemetry.propagation.extract(job, context: OpenTelemetry::Context::ROOT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite interesting and do not see a case covered in the spec https://github.com/open-telemetry/semantic-conventions/blob/v1.24.0/docs/messaging/messaging-spans.md#span-name
If I understand this correctly, this is disconnecting the parent/child context and treats a Sidekiq scheduled job from the place where it was enqueued?
This seems unexpected to do from the client all cases of links because it is going to disconnect the clients, publishers and consumers from each other generating 3 traces ids instead of 2. By default, even in a link situation, I would expect the publisher span to use parent/child propagation and the consumer to generate links.
E.g. you have a web request that uses wait_until, I would expect the http.server span to have a parent/child relationship with the messaging.producer span:
def checkout
# ... schedule housekeeping clean up
GuestsCleanupJob.set(wait_until: Date.tomorrow.noon).perform_later(guest)
endWhen the GuestsCleanupJob then executes it would create a new trace, and link the web request trace to it.
Sidekiq maps wait_until to the at attribute AFAICT: https://github.com/sidekiq/sidekiq/blob/96f867cb58b7fa0a6a832af1a732a339aa0eb61f/lib/sidekiq/job.rb#L194
With these changes, it will end up changing this expected behavior and that is undesirable in my opinion.
My recommendation here would be for you to add a utility method to your code or identify something more specific in Sidekiq that would address your use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is how Sidekiq is working, to what I have found:
<SomeJob>.perform_in(1.minute)pushes the Job definition to a dedicated queue (sorted set) with nameschedule, using theatto sort the setSidekiq::Scheduled::Pollerruns roughly everyaverage_scheduled_poll_interval,- During the each run, it picks the items in the
schedulethat are due and pushes to the defined<SomeJob>.queue <SomeJob>.performis invoked
When it comes to OpenTelemetry instrumentation of that:
Currently when when using span_naming=:job_class (default is :queue), propagation_style=:link (default is :link) this results in the following spans
<SomeJob> publishand attrs{"messaging.destination.name": "high", ...}time=T, id=123, trace=ABC, parent=123- If using
trace_poller_enqueue: true-Sidekiq::Scheduled::Poller#enqueuetime~=T+1.minute, id=567, trace=XYZ, parent=nil, having no parent and it enqueues a batch of Jobs- no relation / link to p.1
- If using
trace_poller_enqueue: true-<SomeJob> publishand attrs{"messaging.destination.name": "high", ...}time~=T+1.minute, id=678, trace=XYZ, parent=567- no relation / link to p.1
<SomeJob> processand attrs{"messaging.destination.name": "high", ...}time~=T+1.minute+operational delay, id=987, trace=DEF- relation / link to p.3, but only if using
trace_poller_enqueue: true
- relation / link to p.3, but only if using
Caveats:
- If using the default
trace_poller_enqueue: false- the P.2, P.3 are completely missed - If using the default
trace_poller_enqueue: true- the P.2, P.3 and P.4 are correlated, but the Span from P.2 (Sidekiq::Scheduled::Poller#enqueue) can't be child of many parents as each there are NxP.1 each on its own Trace, thus only Link is available to correlate. - If using the default
trace_poller_enqueue: true- the The P.3 (<SomeJob> publish) spans are put on the trace of P.2. The spans of P.2 eventually with quite some work, may be refactored to be put as child of P.1 and have links to P.2, but that is not something I am keen on doing as it is very hard.
- If using the default
trace_poller_enqueue: false(the default) all of the P.2 and P.3 spans and correlation are missing
My goal is to have meaningful correlation from P.1 to P.4, regardless same Trace or Links, whatever is easy and feasible.
With the above caveats in place, this PR changes
<SomeJob> scheduledand attrs{"messaging.destination.name": "high", ...}time=T, id=123, trace=ABC, parent=123- changed from
publishtoscheduledto distinguish from P.3
- changed from
- If using
trace_poller_enqueue: true-Sidekiq::Scheduled::Poller#enqueuetime~=T+1.minute, id=567, trace=XYZ, parent=nil, having no parent and it enqueues a batch of Jobs- no relation / link to p.1
- If using
trace_poller_enqueue: true-<SomeJob> publishand attrs{"messaging.destination.name": "high", ...}time~=T+1.minute, id=678, trace=XYZ, parent=567- if using
propagation_style=:link- a Link to P.1 is added
- if using
<SomeJob> processand attrs{"messaging.destination.name": "high", ...}time~=T+1.minute+operational delay, id=987, trace=DEF- relation / link to p.3, but only if using
trace_poller_enqueue: true
- relation / link to p.3, but only if using
Resulting in this more meaningful list and correlation with links or traces
I have been thinking to do some monkey patching or whatever, but it is way harder, moreover this PR has value for others.
So @arielvalentin how about, the following?
- Keep correlation as improved (limited Trace and added Links)
- If you are still very keen on obliging the standards, I can easily change P.1 spans addressing feat: sidekiq scheduled jobs improvements -
scheduledoperation and add links #1717 (comment)- from
<SomeJob> **scheduled**and attrs{"messaging.destination.name": "**high**", ...}time=T, id=123, trace=ABC, parent=123 - to
<SomeJob> **publish**and attrs{"messaging.destination.name": "**schedule**", ...}time=T, id=123, trace=ABC, parent=123
- from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a sidekiq user so I'm thinking about this from the perspective of an active job user who may use the sidekiq adapter.
I see your point though of using non standard semconv in the cases where the naming pattern is Job Name instead of semantic conventions. I think in those cases using scheduled in the span name should be fine; however I think the semconv cases should remain untouched.
I'd need to review your comments more carefully since again, I'm not a sidekiq user so I don't quite understand the nuances of the cause of disconnect between the producer and consumer spans in the cases you outlined above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also am not a sidekiq user but am coming from the sem conv side. I am wondering if the messaging conventions are the correct conventions to even be following.
When I do some searching I get the impression that it is a task/job system and the official website seems to back this up.
Scale your app with Ruby's fastest job system, up to 20x faster than the competition!
Based on that I suspect it would be highly unlikely to be able to get sidekiq listed in sem conv as a messaging system.
So with that in mind as well as the sidekiq docs I find myself asking myself how would I define this and what I come up with is:
- Define an internal Workflow span to handle scheduling a task to run. Could be added to https://github.com/thompson-tomo/semantic-conventions/blob/feature/%231688_Workflows/docs/workflow/workflow-spans.md
- An optional child db span for capturing the db write operation ie redis. This would follow db semconv.
- we have an internal span for capturing the poller and also add to that doc.
- An optional child db span for capturing the db read operation ie redis. This would follow db semconv.
- we use the task completion span in https://github.com/thompson-tomo/semantic-conventions/blob/feature/%231688_Workflows/docs/workflow/workflow-spans.md as a child for each job. That job could have a link back to the scheduler span. In addition we could hopefully propogate an attribute ie workflow.trigger.id
I know this would be a significant change and additional feedback should be sought from others but i feel moving away from using messaging spans better sets up given the metric definitions available in the Workflow pr. I am more than willing to refine the attribute usage in those spans based on findings when attempting to implement it. Also in coming up with span name convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind the conventions as long as things are well observable and correlated, however this is a huge change that I am not keen on doing as part of this PR.
This PR is intended as an unobtrusive incremental change, so it can be quickly approved and merged, so let's focus on this. Later I would love helping in the refactoring with real life usage and feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it would be a big change but my concern is introducing the schedule messaging operation as a way to try and force the scenario of scheduling a task into the messaging specs. As a middle ground as if I understand correctly one of the issues is differentiating between the scheduling and publishing event. In that case why don't we if we are scheduling a task follow a definition which is not messaging. That way we aren't making the situation worse and I wouldn't have an issue if a couple of attributes were shared to assist with interoperability ie message.message.id
|
I think it is worthwhile considering the discussion in open-telemetry/semantic-conventions#1688 (comment) hence my comment #1717 (comment) |
When a Job is pushed via
perform_inor friends, it goes through the following life-cycle:atattribute in Redis fromSidekiq::Client- this will now create<...> scheduledspan (before it was<...> publishSidekiq::Scheduled#enqueueand pushed withoutatattribute in Redis fromSidekiq::Client- this will keep creating<...> publishspanAdditionally when using
:propagation_style = :linkand:trace_poller_enqueue = truea handy set of links are added between the 3 actors above.