Skip to content

Conversation

@deejgregor
Copy link
Contributor

@deejgregor deejgregor commented Oct 28, 2025

What Does This Do

This does two main things:

  1. Adds long running traces to the flare report.
  2. Allow flare dumps and individual files from flares to be downloaded with JMX.

Examples:

$ echo "run -b datadog.flare:type=TracerFlare getFlareFile datadog.trace.agent.core.LongRunningTracesTracker long_running_traces.txt" |  \
    java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
         -jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent
[{"service":"pending-traces-test","name":"step-3","resource":"step-3","trace_id":1110088093037488208,"span_id":3740396906142869284,"parent_id":6982939151275616389,"start":1761670337688000209,"duration":0,"error":0,"metrics":{"step.number":3,"trace.number":1,"thread.id":30},"meta":{"thread.name":"Trace-1"}},{"service":"pending-traces-test","name":"step-2","resource":"step-2","trace_id":1110088093037488208,"span_id":6468860803773086654,"parent_id":6982939151275616389,"start":1761670337582715042,"duration":0,"error":0,"metrics":{"step.number":2,"trace.number":1,"thread.id":30},"meta":{"thread.name":"Trace-1"}},{"service":"pending-traces-test","name":"step-1","resource":"step-1","trace_id":1110088093037488208,"span_id":1210573307183346962,"parent_id":6982939151275616389,"start":1761670337477268167,"duration":0,"error":0,"metrics":{"step.number":1,"trace.number":1,"thread.id":30},"meta":{"thread.name":"Trace-1"}}]
$ echo "run -b datadog.flare:type=TracerFlare generateFullFlareZip" | \
    java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
        -jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent | \
    base64 -d > /tmp/flare.zip && \
    unzip -v /tmp/flare.zip
Archive:  /tmp/flare.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
      71  Defl:N       46  35% 10-28-2025 09:54 8963e853  flare_info.txt
      26  Defl:N       26   0% 10-28-2025 09:54 39f97d4e  tracer_version.txt
    9229  Defl:N     3316  64% 10-28-2025 09:54 f4c7920b  initial_config.txt
     487  Defl:N      231  53% 10-28-2025 09:54 f0284361  jvm_args.txt
      75  Defl:N       66  12% 10-28-2025 09:54 886a98a0  classpath.txt
     144  Defl:N       73  49% 10-28-2025 09:54 433c143d  library_path.txt
     307  Defl:N      170  45% 10-28-2025 09:54 773992bb  dynamic_config.txt
    1196  Defl:N      374  69% 10-28-2025 09:54 7396b38c  tracer_health.txt
      47  Defl:N       42  11% 10-28-2025 09:54 700f06af  span_metrics.txt
       0  Defl:N        2   0% 10-28-2025 09:54 00000000  pending_traces.txt
    2448  Defl:N      500  80% 10-28-2025 09:54 8b69071d  instrumenter_state.txt
      71  Defl:N       70   1% 10-28-2025 09:54 c84166ad  instrumenter_metrics.txt
     923  Defl:N      272  71% 10-28-2025 09:54 1f7f39aa  long_running_traces.txt
     213  Defl:N      130  39% 10-28-2025 09:54 eed91e78  dynamic_instrumentation.txt
       0  Defl:N        2   0% 10-28-2025 09:54 00000000  tracer.log
       0  Defl:N        2   0% 10-28-2025 09:54 00000000  jmxfetch.txt
--------          -------  ---                            -------
   15237             5322  65%                            16 files

Motivation

While adding custom instrumentation to a complex, asynchronous application we found it was challenging to validate if all spans were end()ed during tests. dd.trace.debug=true and dd.trace.experimental.long-running.enabled=true could be used with some post-processing of debug logs, however this didn't work for our needs because the application breaks with that level of logging. When dd.trace.experimental.long-running.enabled=true is used, the long running traces are sent to Datadog's backend, however they are not searchable until they are finished, so we didn't have a good way to find them. This change gives us two ways to access the long running traces list with either a flare report or via JMX.

I initially started by adding JMX MBeans to retrieve just the pending and long running traces and counters. Once I added the long running traces to the flare report to parity with pending traces, I realized that a more generic mechanism to allow getting flare details over JMX might be useful. After adding a TracerFlare MBean, this seemed like a far more valuable route and I removed the code I had added for pending/long running trace MBeans.

Additional Notes

This PR has a number of commits and I suggest reviewing commit-by-commit, paying special attention to the notes in bold below:

Contributor Checklist

Jira ticket: [PROJ-IDENT]

Synchronized accesses to traceArray in LongRunningTracesTracker
since the flare reporter can now access the array. This shouldn't
be a concern for blocking because addTrace and flushAndCompact are
the existing calls from PendingTraceBuffer's run() loop and
getTracesAsJson is called by the reporter thread and will complete
fairly quickly.
…ature

This allows dumping long running traces when not connected to a
Datadog Agent using the new JMX flare feature. A warning message
will be logged in this case to indicate that long running traces
will not be sent upstream but are available in a flare.

Previously the long running traces buffer would always be empty,
even though the feature was enabled with
dd.trace.experimental.long-running.enabled=true. This led to a
good amount of confusion when I was initially developing a feature
to dump long running traces without a local Datadog Agent running.
The JMX telemetry feature is controlled by dd.telemetry.jmx.enabled
and is disabled by default. It enables JMXFetch telemetry (if
JMXFetch is enabled, which it is byd default) and also enables a
new tracer flare MBean at datadog.flare:type=TracerFlare. This new
MBean exposes three operations:

java.lang.String listFlareFiles()
- Returns a list of sources and files available from each source.

java.lang.String getFlareFile(java.lang.String p1,java.lang.String p2)
- Returns a single file from a specific reporter (or flare source).
- If the file ends in ".txt", it is returned as-is, otherwise it is
  base64 encoded.

java.lang.String generateFullFlareZip()
- Returns a full flare dump, base64 encoded.

An easy way to enable this for testing is to add these arguments:
    -Ddd.telemetry.jmx.enabled=true
    -Dcom.sun.management.jmxremote
    -Dcom.sun.management.jmxremote.host=127.0.0.1
    -Dcom.sun.management.jmxremote.port=9010
    -Dcom.sun.management.jmxremote.authenticate=false
    -Dcom.sun.management.jmxremote.ssl=false

To test, you can use jmxterm (https://github.com/jiaqi/jmxterm) like
this:

echo "run -b datadog.flare:type=TracerFlare listFlareFiles" | \
    java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
        -jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent

echo "run -b datadog.flare:type=TracerFlare getFlareFile datadog.trace.agent.core.LongRunningTracesTracker long_running_traces.txt" | \
    java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
        -jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent | \
    jq .

echo "run -b datadog.flare:type=TracerFlare generateFullFlareZip" | \
    java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
        -jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent | \
    base64 -d > /tmp/flare.zip && \
    unzip -v /tmp/flare.zip
…ng priority

This likely isn't an important metric to track, but I noticed these
traces were the only ones not reflected in existing LongRunningTraces
metrics, so I thought it might be good to add for completeness.
@deejgregor deejgregor requested a review from a team as a code owner October 28, 2025 17:00
@aw-dd
Copy link

aw-dd commented Oct 29, 2025

Jira card for context: APMS-17557

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants