-
Notifications
You must be signed in to change notification settings - Fork 461
EW-9366 Provide return event timestamp if span is reported without valid incomingRequest #5282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
src/workerd/io/trace.c++
Outdated
| // TODO(o11y): Once we report the user tracing spanOpen event as soon as a span is created, we | ||
| // should be able to fold this virtual call and just get the timestamp directly. | ||
| span.emplace(kj::mv(operationName), startTime.orDefault(obs->getTime())); | ||
| span.emplace(kj::mv(operationName), startTime.orDefault([&obs]() { obs->getTime() })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this saves us a timer lookup whenever we already have a timestamp, something we want to get right for performance.
|
The generated output of |
|
As discussed with Mar in a call, it might be possible to work around this edge case by restructuring |
a8de521 to
f6e2966
Compare
|
@jmorrell-cloudflare this addresses a warning encountered in edge testing – in a rare edge case, we are unable to get a timestamp when closing a span following an exception for DO-based requests. This PR proposes just providing the return timestamp in that case (which is already available at that time) without reporting a warning, as it's the best time we have (unless we figure out a different way to manage these span lifetimes later). Does that approach work for you? |
|
Missing tests or a link to a repro. If this happens on abort only, is it that concerning that we wouldn't have a timestamp to use? What would it take to change the spans not to outlive the aborted context and report early? |
Mar I thought we already discussed the rationale for this change including 1) and 2) at the Friday meeting. We also touched on 3) – might be possible but I'm not sure if it can be done safely as summarized in #5282 (comment) – we would want some more senior folks to weigh in and it would be hard to rule out that it would introduce other bugs. My understanding is that getting span lifetimes right 100% of the time is not in scope for our next milestone. |
…lid incomingRequest Edge testing showed that the given error only happens after the return event timestamp has already been provided, which happens in one edge case when a DO request is aborted and spans are only reported as the IncomingRequest is being destructed. At that time, we can't look up the time anymore, but the return event timestamp will still be helpful. As indicated in the TODO comment for the case where the IoContext itself is no longer available, we'll want to change span lifetimes at a later time so that they don't outlive the IncomingRequest/IoContext, but for now we'll provide the reasonable timestamp available to us. Also perform some cleanup while we're at it (promote warnings to errors since we've always had the return event available in this case so if they still appear that's an actual issue, avoid superfluous timer lookups).
f6e2966 to
b9c7d97
Compare
Edge testing showed that the given error only happens after the return event timestamp has already been provided, which happens in one edge case when a DO request is aborted and spans are only reported as the IncomingRequest is being destructed. At that time, we can't look up the time anymore, but the return event timestamp will still be helpful.
As indicated in the TODO comment for the case where the IoContext itself is no longer available, we'll want to change span lifetimes at a later time so that they don't outlive the IncomingRequest/IoContext, but for now we'll provide the reasonable timestamp available to us.
Also perform some cleanup while we're at it (promote warnings to errors since we've always had the return event available in this case so if they still appear that's an actual issue, avoid superfluous timer lookups).