-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(user-interaction): deduplicate interaction spans by grouping them #743
feat(user-interaction): deduplicate interaction spans by grouping them #743
Conversation
a4b939c
to
0f94b70
Compare
This seems like a really unstable way to do this. How can you be sure you aren't going to grab unrelated handlers? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should find a way to fix it in a way to prevent creating 2 spans from one click instead of combining them into one extra span which sounds and looks very very wrong. For me this is blocker
If a click event has multiple handlers, those handlers should be individual spans and you need to have a parent for them to be properly related. Imagine a scenario where there is a button with 2 handlers. One handler makes an API call and updates the page, and the other handler sends analytics data to your analytics service. In the current state, these are two separate traces which is obviously wrong. This PR suggests to make a "click" span, with a child span for each handler. You are suggesting the API call and the analytics should be a single span? |
This is a different case than this and what we discussed about Although your explanation makes sense now. So main question,
|
Codecov Report
@@ Coverage Diff @@
## main #743 +/- ##
==========================================
+ Coverage 96.87% 97.09% +0.21%
==========================================
Files 11 11
Lines 640 653 +13
Branches 126 131 +5
==========================================
+ Hits 620 634 +14
+ Misses 20 19 -1
|
In practice, I think this timeout will rarely happen, if at all. It should be safe to remove it. Each task will run to completion and schedule a macro task (0ms timeout) to capture the next event handler. The other event handler tasks will now run before that timeout task, since they were scheduled before it (thus capturing them). Each time a new span is created, all subsequent event tasks that are click events will be considered part of that span. To increase confidence, we can compare event targets. Since it's a task queue, I believe this is safe even when it happens very quickly (e.g. user click, listeners run, user clicks, listeners run).
This changeset does not combine them into one extra span. It wraps all interactions (e.g. deduplicates them) into a single span. I have integrated this changeset into an example here and added verbose logging. Please look at this and compare the patch vs. the current version of this instrumentation (uncomment the instrumentations as needed in As you will see, the current version creates six listeners, compared to the patch which creates one. The application only registers 4 listeners. The other two are from a third party script on the page (gtm.js). If you inspect the logs carefully, you can observe the flow through each patch of the zone.js methods (e.g. runTask, scheduleTask, cancelTask). It's been somewhat difficult to follow the flow through these callstacks, I found logs to be helpful when there are many tasks to run. Hopefully this explains the solution a bit more... Furthermore, please look at the test updates. It sets up two click handlers on the same target, and asserts that only one span is created.
In this PR, there is only one span per interaction. The individual handlers do not create individual spans. All downstream spans from each event handler share this "root interaction span". In your example, there is no need for the span redundancy unless very granular span durations are needed on individual spans. The network (fetch or xhr) spans that would be created from those two requests (API call+update page and analytics request) would be properly nested. This means that handlers that only interact with the client (e.g. just a page update) will not contain a new child span, but their duration is still captured in the "user interaction". However, it is possible to add this level of span granularity. @obecny to answer your question, this is a bug fix to address (1). I am not suggesting (2). This instrumentation is called "user-interaction", but the spans being captured are "user-interaction-listeners". The extra spans introduce a lot of noise into span data, making it very difficult or impossible to analyze data. |
@obecny @dyladan I have added new tests for the 50ms window logic, and enhanced its ability to handle concurrency. It is now grouped by the event target. If two clicks occur at the same time on different targets, it will still create two unique spans. The 50ms timer is necessary for the case when a handler is long running. After 50ms, the instrumentation will clear the "current interaction", allowing a subsequent click to start a new span. This doesn't impact the original span, since it's carried on the task's zone. What this means is that if the user is able to double click faster than the application can schedule the click event tasks and its subsequent completion timeout, those two clicks will be treated as one span. Remember, this is only when the user clicks on the same target extremely quickly. Here's an example task queue for the scenario:
Here's an example task queue for why the 50ms timeout is necessary:
|
1b7a535
to
f438f7c
Compare
Tests are updated to force macro tasks when necessary. New test added for multiple listeners on a single target.
Otherwise, StackContextManager is still the context manager in the zone tests for the first one.
In userInteractionl.test.ts
9b41e9b
to
7aebb40
Compare
I have removed the additional concurrency guard. In practice, this end up creating unwanted span duplicates. For our use case, we prefer there to be a single span per interaction and prefer to prevent span duplicates as much as possible. So I'm bringing my updated implementation back here. Any thoughts @dyladan @obecny? I'd prefer to not maintain my own fork of this instrumentation, and without this fix it is unusable for us. Given this issue was reported by another user it seems I'm not alone in thinking that duplicate spans are not desired. I spent a good chunk of time looking into this and would appreciate some feedback. |
sorry for the unacceptably long period of non-response. @obecny recently stepped down as a maintainer and he the primary maintainer of the browser instrumentations so this slipped through the cracks a little on me. Please bear with me a bit since I'm not a react expert and I'm still working to understand the problem and your proposed solution, so let me know if any of this is incorrect: This is a react-specific issue. This is caused by the react event delegation system. React creates an event listener for each event type on the root node. When a developer adds an event to a node like Please let me know if anything in the above paragraph is incorrect, as it is crucial to fully understand the problem before I can evaluate a solution. The proposed solution is to treat all click events that start within a 50ms window as the same click event, so if multiple click event listeners are called during this window, the span is started when the first event listener starts and ended when the last event listener ends. The underlying assumption is that there will never be two legitimate clicks within 50ms of each other, and that if two click event handlers are called, it must be because the first one has delegated the event to the second one. Please let me know if anything in the above paragraph is incorrect. |
Fixing this by grouping interactions within 50ms it doesn't really seems right for me. It might have different consequences and potential bugs once this is done and then we will have gordian knot. I still think this could be resolved a bit different. In principal if react is being used I would think of some way of detecting this and maybe add events / delegate them differently. If that means a bit different behaviour for react I would rather use this than adding extra event. What if user clicks twice in 49ms or 51ms, will this still produce 2 events or 1 and why is it 50ms instead of 60 or 40 or something else what about slower device will 50ms be still enough for example if there is some heavy CPU usage during click. How do we ensure this will work correctly in all cases. I'm sceptic towards such solution as this might be flaky :/ |
last thing have you considered using / creating a different context manager specific for react to handle those things differently ? |
I generally agree here. The underlying assumption that no clicks can happen within 50ms of each other seems like something that works 99.9% of the time but is very hard to debug when it actually breaks something. I wonder if there is some way we can detect that an event handler is the react root node event handler and ignore it?
I don't think the context manager is the correct place to fix this bug. It is not a context loss. |
Not sure if this is what you meant, but I think that having react-specific instrumentation (rather than react-targeted heuristics) is the answer here. If that instrumentation gets substantially more complex/different than this |
@dyladan this is in response to your comment asking for clarification.
This is not a react-specific issue. If you look at my example in the previous comment, you'll see no React in use. It's all vanilla JS and The issue is applicable to all listeners that share the same target. The example shows some possible scenarios:
If a solution using the context manager makes more sense, I'm all for that. However, we mostly care about having usable data right now. Generating this number of spans makes the data from the current state of the user interaction instrumentation uninterpretable. Since they all share the same attributes (e.g. same xpath), there's nothing to correlate in our tracing backend. This is another negative impact of unique-per-listener solution, since it's possible the tracing backend will "mask" a slow performing listener with another one that is more performant since they share the same xpath. While this solution is heuristical, we have already found it immensely valuable since we can more effectively identify slow/erroring listeners. Knowing that something is performing slowly or erroring on a single event target significantly eliminates the number of places we need to look. It's a tradeoff in the end, I guess. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the implementation LGTM and I think I am convinced by the arguments you made for including this change.
window.clearTimeout(spanData.interactionCompleteTimeout); | ||
spanData.interactionCompleteTimeout = undefined; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is run in the root zone in order to exempt it from task count (because _shouldCountTask
will return false
because the root zone should never have a span?). I think there should be a comment to make that clear because otherwise this could be easily broken by someone in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's the right idea. This is ensuring that instrumentation itself isn't tricked into thinking that tasks scheduled/executed/cancelled by the instrumentation within the context of a click listener task should be counted. The task count stored on spanData
is always in reference to the tasks scheduled/executed/cancelled in userland, so we make sure not to muddle that when registering async tasks in the instrumentation.
- Add comment clarifying where/why things need to run in the root zone
api.context.with( | ||
api.trace.setSpan(api.context.active(), span), | ||
() => { | ||
plugin._currentInteractionZone = Zone.current; | ||
} | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this required to be run in a context? Does it have something to do with Zone.current
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your question is valid, it doesn't really need to run in context. I believe I did this to create a separate Zone from the first listener task's Zone since the goal here is to share a Zone across multiple tasks that may not have common parent.
At the time it felt semantically off to state the first registered listener's zone should be reused. But in practice, I don't think it matters and potentially makes more sense.
- Remove additional execution context
// This allows new clicks to create spans when the previous interaction is still not complete | ||
const spanData = plugin._spansData.get(span); | ||
if (spanData) { | ||
Zone.root.run(() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Please leave a comment why this needs to be run in the root zone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same reasoning as above, but I'll make sure to mark all locations where the instrumentation registers an async task since by rule these run in the same execution context. Without doing this you're going to get incorrect taskCount state values.
- Add comment clarifying why the root zone needs to be used internally for the instrumentation
Thanks for doing this! I just recently started looking into this package but walked away from it initially b/c I saw that a single click was always causing multiple traces in my app. Will definitely use this once it is released. |
Anything I can do to help here? I'd like to see this merged. |
@stephencranedesign of course, thanks for the appreciation! 😄 I also initially avoided this instrumentation (for over a year!) before finally taking a crack at it. But once I realized its potential with more reasonable data, I decided the time investment to understand zone.js was worth it. We've been using this patch in production at Squarespace since November now. Funnily enough, within six hours of release our OTel instrumentation automatically detected a frontend outage (runtime error) for the first time! Since then, it's helped uncover several unexpected regressions across our UI. Without interaction tracing, you only get network tracing automatically (plus a small amount of non-network coverage via document load). Errors reported in a network trace on the frontend will typically be reported by backend code as well, so interaction tracing can really add immense value. |
@dyladan @stephencranedesign I should be able to get the change in this week. I was thinking it might be useful to add more configuration, in the case that consumers want to tweak the heuristic settings or use the existing behavior today. I also wanted to briefly explore the idea of addressing this within the ContextManager, as I do believe there might be an exact way to do this. I have yet to see how that might work though, and I have a hunch it would be a drastic change. But it's been some time so maybe a fresh look might shed some light. Finally, I would like to automatically record exceptions that occur in the interaction span's context. The configuration I was thinking might be: interface UserInteractionInstrumentationConfig {
// ...existing configuration
// whether or not to attempt grouping similar listeners into a single span
joinInteractionSpans: boolean;
// time window after which similar listeners won't be grouped (only applicable if joinInteractionSpans is true)
interactionWindowMs: number;
} Thoughts? |
that sounds reasonable & helpful. edit: although, I'd vote for getting the current change set in this pr merged as soon as possible so if there's a decent amount of work involved in getting exceptions on the trace maybe hold that off for a separate pr.
Based on the convo in this pr and my experience trying to use this plugin in an app, it seems like this new implementation is a bug fix and not a new feature to opt into. So I'm not sure I'm on board with providing configuration to turn the bug fix off.
That's my two cents :) |
Awesome! really cool to hear success stories like that
I could see it either way I guess. When instrumentation changes behavior it can be confusing for users who may have dashboards and alerts set up and that sort of thing. I think in this case we probably don't need a config but would like input from the other @open-telemetry/javascript-maintainers also |
Any updates here? Are you waiting on feedback from @dyladan or the javascript-maintainers? |
@stephencranedesign yep. I think just feedback on the configuration is all we need. While waiting, I've been thinking about the issue in general of why we need the heuristic in the first place. I think I've found a different approach to the implementation that doesn't patch zone.js, and leverages the public Zone API. It's almost a total rewrite since the approach is so different. I'm still developing it, but I think it would enable accurate task tracking across async event listeners that are bound to the same EventTarget. It works by creating a new ZoneSpec (called UserInteractionZoneSpec) that contains task tracking state internally. When an event listener is invoked, the instrumentation checks if an active instance of this ZoneSpec exists for a given target xpath and event type, then runs the listener in that context. If it doesn't exist, then it forks that UserInteractionZoneSpec and clears the initial state (as it is the first listener to run), then stores this as the "active zone" for that target xpath/event type on the instrumentation singleton. By running the subsequent listeners in the same zone, they will be able to share the task tracking state within the Zone. Thus we are able to detect when all async tasks complete across the listeners, and invoke a final callback While we wait, I'm going to post a second PR demonstrating this solution since it might address all of concerns of this PR. It also simplifies this instrumentation significantly (so far). Here is what the ZoneSpec looks like: Zone['UserInteractionTracingZoneSpec'] = {
name: 'UserInteractionTracingZone',
// setTimeout, microTasks
onScheduleTask: function (delegate, current, target, task) {
if (this.properties.flushed) {
return delegate.scheduleTask(target, task);
}
this.properties.count++;
delegate.scheduleTask(target, task);
},
// clearTimeout, when setTimeout finishes, when promises resolve
onInvokeTask: function (
delegate,
current,
target,
task,
applyThis,
applyArgs
) {
delegate.invokeTask(target, task, applyThis, applyArgs);
if (!this.properties.flushed) {
this.properties.count--;
}
},
onHasTask: function (delegate, current, target, hasTask) {
if (this.properties.count === 0 && !this.properties.flushed) {
this.properties.flushed = true;
target.run(this.onFlush);
}
},
onFlush: function () {},
properties: {
count: 0,
flushed: false,
xpath: null,
span: null,
},
}; Then in the instrumentation where we patch const patchedListener = function(this, ...args) {
const targetXpath = getElementXPath(this);
let activeListenerZone = plugin.activeListenerZones.get(targetXpath);
if (!activeListenerZone) {
const span = tracer.startSpan(...);
activeListenerZone = Zone.current.fork(
...Zone['UserInteractionZoneSpec'],
onFlush() {
span.end();
plugin.activeListenerZones.delete(targetXpath);
},
properties: {
span,
count: 0,
flushed: false,
xpath: targetXpath,
},
);
plugin.activeListenerZones.set(targetXpath, activeListenerZone);
}
}; |
Really looking forward to having this merged. Great work here! |
I have put up a PR with the alternative solution I discussed above (#980). It is still a work in progress but mostly feature complete. It needs additional test cases, verification of accuracy by hand with a timeline trace, and potentially some additional aggregates that would be useful for analyzing interaction traces. But as of right now it mostly replicates the existing behavior. If you're curious to see how it behaves with this solution I've updated the StackBlitz. |
This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. |
This PR was closed because it has been stale for 14 days with no activity. |
Which problem is this PR solving?
Fixes #548
Short description of the changes