Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datadog: restore previous behavior of Span::setSampled #32264

Conversation

dgoffredo
Copy link
Contributor

@dgoffredo dgoffredo commented Feb 8, 2024

The Problem and Its Fix

@Smeb reported a Datadog tracing related issue with an Istio setup that recently had been upgraded to an Envoy at least as recent as v1.27. These newer Envoy versions contain a rewrite of the Datadog tracing extension.

The symptom was that the sampling decision for the Istio traces was 100% even when configured to be 10%.

In our investigation to find the root cause, we identified a mismatch between the contract of Envoy::Tracing::Span::setSampled(bool) and how it is sometimes used.

The problem can be thought of as a bug that I introduced during my rewrite of the Datadog tracing extension, but it reveals what might be a latent issue in how trace sampling is performed by Envoy, or at least by the lua filter extension.

The Problem

The rewritten Datadog tracing extension performs an unconditional sampling decision override whenever Span::setSampled(bool) is called.

Before the rewrite, the older OpenTracing-based Datadog tracing extension sometimes did not override an existing sampling decision when Span::setSampled(bool) was called. The different call stacks are explained in my comment on the parent issue.

When the lua filter extension is used, it always specifies a sampling decision for requests that it makes. There are two possibilities:

  1. The user specifies a boolean "trace_sampled" property of the table-valued argument to httpCall in Lua code, which is bound to this C++ function. There is no way to specify null (nil), which would indicate "use existing decision."
  2. The user does not specify the "trace_sampled" property. A default-constructed HttpCallOptions is used for the underlying HTTP call. HttpCallOptions contains a RequestOptions, which contains an absl::optional<bool> sampled_ that defaults to true (not null!).

As a result, any HTTP call made by the lua filter extension always results is a non-null sampling decision being specified in RequestOptions, which results in Span::setSampled(bool) always being invoked. Span::setSampled(bool) is documented as overriding any existing sampling decision:

This method overrides any previous sampling decision associated with the trace instance.
If the sampled parameter is false, this span and any subsequent child spans
are not reported to the tracing system.

So, even if Envoy is configured to sample only 10% of traces, any trace that involves an HTTP call made by the lua filter extension will override the sampling decision to "keep," thus increasing the actual sampling rate.

The old behavior of the Datadog tracing extension, which violated the contract of Span::setSampled(bool), prevented this issue by ignoring the specified override whenever a sampling decision had been made previously, such as when a decision had been extracted from the downstream request or when a decision was made per the tracer's configuration.

The Fix

We have a few options for how to correct the offending behavior encountered by @Smeb:

  1. We could modify the lua filter extension to call options.request_options_.setSampled(absl::nullopt) immediately after constructing options, thus changing the effective default value of RequestOptions::sampled_ to null. This could change the trace sampling behavior for all users of the lua filter extension.
  2. We could modify RequestOptions::sampled_ to have a default value of null instead of true. This could change the trace sampling behavior for all Envoy users (relevant usage).
  3. We could modify the contract of Span::setSampled(bool) to permit tracer implementations to ignore the override, and restore the old behavior in the Datadog tracing extension. This would require that we revisit how setSampled is used (for example, in the healthcheck filter).
  4. We could restore the old behavior of the Datadog tracing extension, thus putting it out of spec with respect to Span::setSampled(bool), and defer what else to do, if anything, for later.

This pull request implements option (4). I modify the implementation of Datadog's Span::setSampled(bool) so that it honors the specified sampling decision only if a decision has not already been made. This makes setSampled behave as it did before the Datadog tracing extension rewrite introduced with release v1.27.

Testing

See the included changes to Span's unit test.

Note: I still need to verify that these modifications fix @Smeb's precise use case. It could be that the sampling decision made by the lua filter extension is the first decision encountered, in which case I don't think this fix would be effective. If that turns out to be true, then deeper revision of the Datadog tracing extension will be necessary (or, we could go with one of the other three fixes proposed above).

Other Info

Please label this pull request for backport review. It fixes a bug that affects the v1.27, v1.28, and v1.29 release branches. I'm happy to propose those backports once this pull request is reviewed.

@dgoffredo dgoffredo marked this pull request as draft February 8, 2024 01:41
@dgoffredo
Copy link
Contributor Author

I've changed this PR to a "draft" until I figure out adequate integration testing.

@Smeb, I'm wondering about the interaction, in your setup, between the lua filter extension and trace context propagation.

What concerns me is the following possible scenario:

  • You configure Envoy to keep 100% of traces.
  • You configure the Datadog tracer to keep 10% of traces.
  • A request comes into Envoy, and it has no trace context. So, Envoy creates a new trace using the Datadog tracer.
  • The sampling would be calculated at 10%, but that doesn't happen until either we send an outgoing request or finish the spans.
  • The lua filter makes an HTTP request, but specifies sampled_ = true and so the sampling decision is "keep," i.e. 100%.

As far as I can see, this scenario would occur using both the new code and the old code. I want to create a minimal Envoy setup that demonstrates the difference in behavior between <1.27 and >=1.27, so that I can verify that this fix has the older behavior.

Do you have an Envoy configuration that I could use to test that? I understand that you're working at the Istio layer. Maybe a /config_dump of the underlying proxy would help.

@Smeb
Copy link

Smeb commented Feb 14, 2024

👋 @dgoffredo I've set up a minimal reproduction here. You can see the difference in behaviour by changing the envoy proxy version. The version with working trace sampling gives between 1-2 traces per 20 requests, the newer version gives 20 traces per request.

@dgoffredo
Copy link
Contributor Author

That's very helpful, @Smeb. I'll play around with your example, adding a local build of Envoy that includes the changes proposed here, and see if the problem goes away.

If it all works, then we can re-review this and I'll propose the corresponding backport PRs.

@dgoffredo
Copy link
Contributor Author

dgoffredo commented Feb 16, 2024

Using a modified version of @Smeb's issue reproduction, I've discovered that the changes proposed here do not solve the sampling problem.

The reason that this issue did not occur with the older OpenTracing-based Datadog tracing library is that the rule-based sampling decision is made the first time any span's OpenTracing Context() is accessed. This happens not only when trace context is injected into an outgoing request, but also when a child span is created.

The async HTTP client first creates a child span representing the request operation, then overrides the sampling decision, and finally injects trace context into the request headers.

In the old library, as a unintended consequence of calling OpenTracing's Span::Context() when creating the child span, a sampling decision is made using the configured sampling rules (e.g. 10%), and then the decision is locked. The subsequent "override" and injection do not change this locked sampling decision.

Even with my proposed change to the new code, where the "override" is performed only if there is not already a decision, we still don't have a fix. This is because, in the new code, there is no earlier decision. Creating a child span does not produce a "locked" sampling decision in the new library.

Either I must complicate this Datadog-specific Envoy code further to simulate the previous behavior, or we must consider one of the other mitigations that I describe in the "The Fix" section of this PR's description.

@dgoffredo
Copy link
Contributor Author

I'm going to alter this pull request so that the setSampled behavior is consistent with the old Datadog tracing extension.

Still, let's plan to revisit what the contract of setSampled should be, and how it interacts with Envoy's async HTTP client, especially as used by the lua filter module.

@Smeb
Copy link

Smeb commented Mar 7, 2024

@dgoffredo Is there any update on the above? We'd love to get this resolved for the long term if possible.

@dgoffredo
Copy link
Contributor Author

I have not revisited this for some time, my apologies. I will revisit it sometime next week at the latest.

@dmehala
Copy link
Contributor

dmehala commented Apr 3, 2024

It looks like the root issue has been recently fixed. There's probably scenarios where the old behaviour do not match the new one, I'll investigate more next week.

Copy link

github-actions bot commented May 3, 2024

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label May 3, 2024
Copy link

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot closed this May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants