-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datadog: restore previous behavior of Span::setSampled
#32264
datadog: restore previous behavior of Span::setSampled
#32264
Conversation
Signed-off-by: David Goffredo <[email protected]>
I've changed this PR to a "draft" until I figure out adequate integration testing. @Smeb, I'm wondering about the interaction, in your setup, between the lua filter extension and trace context propagation. What concerns me is the following possible scenario:
As far as I can see, this scenario would occur using both the new code and the old code. I want to create a minimal Envoy setup that demonstrates the difference in behavior between <1.27 and >=1.27, so that I can verify that this fix has the older behavior. Do you have an Envoy configuration that I could use to test that? I understand that you're working at the Istio layer. Maybe a |
👋 @dgoffredo I've set up a minimal reproduction here. You can see the difference in behaviour by changing the envoy proxy version. The version with working trace sampling gives between 1-2 traces per 20 requests, the newer version gives 20 traces per request. |
That's very helpful, @Smeb. I'll play around with your example, adding a local build of Envoy that includes the changes proposed here, and see if the problem goes away. If it all works, then we can re-review this and I'll propose the corresponding backport PRs. |
Using a modified version of @Smeb's issue reproduction, I've discovered that the changes proposed here do not solve the sampling problem. The reason that this issue did not occur with the older OpenTracing-based Datadog tracing library is that the rule-based sampling decision is made the first time any span's OpenTracing The async HTTP client first creates a child span representing the request operation, then overrides the sampling decision, and finally injects trace context into the request headers. In the old library, as a unintended consequence of calling OpenTracing's Even with my proposed change to the new code, where the "override" is performed only if there is not already a decision, we still don't have a fix. This is because, in the new code, there is no earlier decision. Creating a child span does not produce a "locked" sampling decision in the new library. Either I must complicate this Datadog-specific Envoy code further to simulate the previous behavior, or we must consider one of the other mitigations that I describe in the "The Fix" section of this PR's description. |
I'm going to alter this pull request so that the Still, let's plan to revisit what the contract of |
@dgoffredo Is there any update on the above? We'd love to get this resolved for the long term if possible. |
I have not revisited this for some time, my apologies. I will revisit it sometime next week at the latest. |
It looks like the root issue has been recently fixed. There's probably scenarios where the old behaviour do not match the new one, I'll investigate more next week. |
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
The Problem and Its Fix
@Smeb reported a Datadog tracing related issue with an Istio setup that recently had been upgraded to an Envoy at least as recent as v1.27. These newer Envoy versions contain a rewrite of the Datadog tracing extension.
The symptom was that the sampling decision for the Istio traces was 100% even when configured to be 10%.
In our investigation to find the root cause, we identified a mismatch between the contract of Envoy::Tracing::Span::setSampled(bool) and how it is sometimes used.
The problem can be thought of as a bug that I introduced during my rewrite of the Datadog tracing extension, but it reveals what might be a latent issue in how trace sampling is performed by Envoy, or at least by the lua filter extension.
The Problem
The rewritten Datadog tracing extension performs an unconditional sampling decision override whenever Span::setSampled(bool) is called.
Before the rewrite, the older OpenTracing-based Datadog tracing extension sometimes did not override an existing sampling decision when
Span::setSampled(bool)
was called. The different call stacks are explained in my comment on the parent issue.When the lua filter extension is used, it always specifies a sampling decision for requests that it makes. There are two possibilities:
httpCall
in Lua code, which is bound to this C++ function. There is no way to specify null (nil
), which would indicate "use existing decision."true
(not null!).As a result, any HTTP call made by the lua filter extension always results is a non-null sampling decision being specified in
RequestOptions
, which results inSpan::setSampled(bool)
always being invoked.Span::setSampled(bool)
is documented as overriding any existing sampling decision:So, even if Envoy is configured to sample only 10% of traces, any trace that involves an HTTP call made by the lua filter extension will override the sampling decision to "keep," thus increasing the actual sampling rate.
The old behavior of the Datadog tracing extension, which violated the contract of
Span::setSampled(bool)
, prevented this issue by ignoring the specified override whenever a sampling decision had been made previously, such as when a decision had been extracted from the downstream request or when a decision was made per the tracer's configuration.The Fix
We have a few options for how to correct the offending behavior encountered by @Smeb:
options.request_options_.setSampled(absl::nullopt)
immediately after constructingoptions
, thus changing the effective default value ofRequestOptions::sampled_
to null. This could change the trace sampling behavior for all users of the lua filter extension.true
. This could change the trace sampling behavior for all Envoy users (relevant usage).setSampled
is used (for example, in the healthcheck filter).Span::setSampled(bool)
, and defer what else to do, if anything, for later.This pull request implements option (4). I modify the implementation of Datadog's
Span::setSampled(bool)
so that it honors the specified sampling decision only if a decision has not already been made. This makessetSampled
behave as it did before the Datadog tracing extension rewrite introduced with release v1.27.Testing
See the included changes to
Span
's unit test.Note: I still need to verify that these modifications fix @Smeb's precise use case. It could be that the sampling decision made by the lua filter extension is the first decision encountered, in which case I don't think this fix would be effective. If that turns out to be true, then deeper revision of the Datadog tracing extension will be necessary (or, we could go with one of the other three fixes proposed above).
Other Info
Please label this pull request for backport review. It fixes a bug that affects the
v1.27
,v1.28
, andv1.29
release branches. I'm happy to propose those backports once this pull request is reviewed.