`logging`, `tracing` and `OpenTelemetry` design #1949

Mirko-von-Leipzig · 2026-04-16T10:27:12Z

Mirko-von-Leipzig
Apr 16, 2026
Maintainer

I'd like some input on managing our tracing (and therefore telemetry) levels and targets. Its a bit murky to me, and while I've got some ideas, I don't know if there is something I'm missing.

I'll give an outline of what I want and then some ideas. Maybe what I want is dumb, either way -- input please :)

Desires

Goal 1: Configurable production trace volumes and levels

I'd like to have an admin endpoint where we can configure the tracing level dynamically. This is possible, we just need to create it. A problem is that we currently have target=COMPONENT which means we don't have much fine grained control here.

This is coupled with distributed open telemetry traces. We essentially have three "knobs" which we can turn.

Which trace spans are enabled (via tracing aka RUST_LOG).
Span sampling inside the process
Span sampling outside the process

We currently do (1) only, and that's set at startup.

Goal 2: Useful local node logs

Our current info logs are what we consider necessary for production, and they're traces. This is waaaaay too much information for local node users who don't need 99% of this.

Ideally we have a better default setting here for normal users. This can be done somewhat trivially if we have better info/debug/trace level standards. Currently we set info for everything we care about in production. Better might be if info is reserved for main operations only, and then for production we set the levels to include debug where we want.

This isn't a big problem because we can simply set the docker compose up with a nicer default RUST_LOG (or whatever configuration we choose) for the local node.

What is an issue, is if users want more logs in a specific area for their domain. e.g. trace network transaction logs.

Proposed solutions

Tracing targets and levels

Add an admin endpoint which allows setting the trace level string. Even nicer would be a UI which we somehow populate with all known targets somehow populated at compile time.. that's an advanced thing though, and not required now.

We also want more fine grained hierarchical target names. Maybe similar to the span names instead. Hierarchical so we can specific broadly, but also be more specific as needed.

Something to also be aware of, is that one can target specific names with target[name]=<level> but that only works with exact matches which is not ergonomic.

We should also come up with a better standard. e.g. root level span is info, children are debug, and events / loops are trace. For example, adding a transaction would be info at the RPC level, possibly at the mempool and validator ingress. debug would be the internal methods called, and trace events might be things that occur within the mempool e.g. account[N].state A -> B, transaction %ID selected for batch etc.

Telemetry volume control

Probably in process sampling is not desirable, at least insofar as performance allows. This is because of the distributed nature of the system, since we want to collect a single "trace" across multiple services and they cannot communicate which trace to sample collectively since useful sampling is only possible after the trace is completed (tail sampling). We can of course use head-sampling (predetermine which trace to sample before completion), but this isn't terribly useful if we want to sample all traces which contain errors. I also think head-sampling is fairly similar to what tracing level control already gives us.

Out of process sampling can be done by setting up a separate instance which receives all traces and determines which to keep, before passing these outward. This is fairly simple to setup with basic yaml e.g. keep all errors, 1% of RPC calls etc.

Telemetry standards

This has been mentioned in issues, and PRs already. But we need a consistent property naming structure. i.e. transaction.id and not tx_id, transaction_id, tx.id (all of which are currently present).

Its an uphill review battle making these consistent. Part of the problem is that tracing is overly flexible and we want "restrictions". I think wrapping tracing/otel in a separate crate and exposing our own restricted macros is the solution, combined with some nice traits. This is mostly orthogonal to the other things though, just mentioning for completeness.

As an example:

/// Replaces `tracing::instrument`, notably:
///   - removes almost all options with sane defaults e.g. `skip_all`
///   - logs _no field_ properties -- must be done inline instead
///   - _always_ logs errors using a report
///   - target = <name> (if we decide this is a good idea)
#[instrument(<level>, <name>, root | non-root)]
fn func2(in: Input) -> Result<Output, Error> {
    // `in` property is injected using the extension trait.
    info!(in, "Info event inside func2);

    // Similarly, `in` is injected into current span using the extension trait.
    Span::current().set_field(&in);

    // And again, but overriding the default field name.
    Span::current().set_field_with_name("custom", &in);
}

/// And something similar for objects i.e. so we can inject a block which automatically
/// sets all its properties.
pub trait OTelField {
    const NAME: &str;

    fn value(&self) -> tracing::Value;
}

pub trait SpanExtension: Span {
    fn set_field(&self, field: &OTelField) {
        self.set_field_with_name(field.NAME, field);
    }

    fn set_field_with_name(&self, name: &'static str, field: &OTelField) {
        self.set_property(name, field.value());
    }
}

Open-ish questions

How to structure our hierarchical targets
How to enable power users with a local node
- We already plan to ship a trace explorer with the docker compose
- Ideally we would list the traces they can enable somehow
Better ideas or ways to structure this and enforce things
This isn't super well structured, so please, input anything else.

huitseeker · 2026-04-16T16:33:33Z

huitseeker
Apr 16, 2026
Collaborator

Runtime telemetry control : notes & remarks

Separate log level from span level

The current pain may be that one filter is trying to serve two jobs. Local logs should be quiet enough for a person to read. Exported spans should keep enough detail to debug work that crosses RPC, store, mempool, prover, validator, and other paths.

The node should split those controls:

let log_level = config.log_string.unwrap_or_else(|| "info".into());
let env_filter =
    EnvFilter::try_from_default_env().unwrap_or_else(|_| EnvFilter::new(log_level));

let (log_filter, reload_handle) = reload::Layer::new(env_filter);

let span_level = config.span_level.unwrap_or(Level::INFO);
let span_filter = filter::filter_fn(move |metadata| {
    metadata.is_span() && *metadata.level() <= span_level
});

That points to separate knobs:

MIDEN_LOG_FILTER=warn,miden.node=info
MIDEN_TRACE_FILTER=info,miden.node.block_producer=debug,miden.node.store.db=debug

RUST_LOG can stay as a fallback. It should not be the only knob if it changes both stdout and OTLP export.

Expose a FilterHandle backed by tracing_subscriber::reload, with update() and get()

Runtime filter reload should be small. Keep a handle, validate the new filter, install it, and return the active value.

#[derive(Clone, Debug)]
pub struct FilterHandle(reload::Handle<EnvFilter, Registry>);

impl FilterHandle {
    pub fn update<S: AsRef<str>>(&self, directives: S) -> Result<(), BoxError> {
        let filter = EnvFilter::try_new(directives)?;
        self.0.reload(filter)?;
        Ok(())
    }

    pub fn get(&self) -> Result<String, BoxError> {
        self.0
            .with_current(|filter| filter.to_string())
            .map_err(Into::into)
    }
}

The admin API can stay blunt:

GET  /admin/tracing/filter
PUT  /admin/tracing/filter
GET  /admin/logging/filter
PUT  /admin/logging/filter

The endpoint should reject invalid filters before changing state. If raw EnvFilter strings are accepted, disable regex field matching and keep the endpoint private.

The operator workflow should be:

1. Run with quiet logs.
2. See a store, mempool, or builder issue.
3. Raise one target to debug or trace.
4. Collect logs or traces.
5. Restore the old filter without restarting the node.

Do we have metrics?

Span-derived latency can be useful, but only with low-cardinality labels. A subscriber layer can record also record the span duration in a metrics db when the span closes. We may want to consider a Prometheus instance for proper metrics in general — they're not a substitute of spans, but a useful complement.

Follow-up questions

Tracing design in the abstract is hard to slice through for me. I think the main question is what an operator should learn or change without restarting the node.

If we want to approach this bottom-up (which I prefer), the cases to test are concrete:

A local node user runs Docker Compose and sees too much output. What should they see by default?
An operator sees slow block production. Which target should they raise first?
An operator sees failed transactions. Should they raise RPC, validator, mempool, ntx-builder, or all of them?
A maintainer wants one account, transaction, or block. Do target filters help, or is that only a trace query?
A production node starts failing. Do we raise detail live, or rely on already-exported spans and collector sampling?

The gap may not be "better levels", but rather missing debug modes.

Possible presets:

local-default      = quiet logs for local users
operator-default   = production logs plus normal trace export
debug-store        = store reads and writes
debug-mempool      = selection and commit paths
debug-ntx-builder  = actors and transaction execution
debug-rpc          = requests, responses, and client routing

These presets would show whether the target tree, span names, and fields are good enough.

Two questions should stay separate:

1. What spans are created?
2. What spans are exported or printed?

If a span is not created, no collector can recover it. If it is created but not exported, volume can still be controlled later.

Questions that I'm asking myself — but it may just be that I'm uneducated about our setup:

What should local-node logs show by default?
What is the default production trace volume budget?
Which spans must always exist for errors or latency?
Which spans are only useful during live debugging?
Should log filters and trace filters be separate?
Should runtime filter changes expire after a TTL?
Should admin filter changes affect stdout, OTLP export, or both?
Should operators use presets instead of raw EnvFilter strings?
Which field names are stable enough to document?
How do we keep high-cardinality fields out of derived metrics?
Which targets are public operator names, and which are internal module detail?

A fast way to expose the real gap is to solve three incidents on paper:

1. Slow block build
2. Failed transaction submission
3. Noisy local Docker node

For each one, I'd think about the filter change, expected spans, and fields needed. That might show the source of the issue.

0 replies

bitwalker · 2026-04-17T15:17:47Z

bitwalker
Apr 17, 2026
Collaborator

In the past, I accomplished this by having all services emit everything (at least everything that I wanted to be dynamically configurable) based on the presence of a specific request header, and then sampling of traces was configured at the service mesh level (this was a k8s cluster, which I had set up with istio + embassy for the mesh). I was using Grafana's Tempo for traces (since that integrated nicely with the other Grafana products in a way that Jaeger itself did not, though Tempo was Jaeger-compatible).

Each service in the mesh had a sidecar that handled mesh communication, and would record trace spans for inbound/outbound requests. To determine whether or not to trace; a request that entered the mesh would have a request header (containing the trace id) conditionally set based on the sampling configuration. When that request header was present, then everything related to that request would be traced, so long as the request header was propagated by a service when it made requests to other services in the mesh. The sidecar containers would record basic trace spans for the mesh-level parts, but it was up to each individual service to record additional spans for its internal behavior, if supported.

The nice thing about this, is that things were very simple for individual services, and all the dynamic configuration was handled by the mesh (i.e. what requests would be sampled, based on what criteria). It also ensured that for a traced request, we had everything for that request, not just parts of it. The configuration I used was a combination of "trace N% of requests" and "trace request R if R meets some criteria (i.e. contains some request header, comes from a specific user, etc.)" - and these could be controlled on the fly. We'd frequently use this in conjunction with some client side functionality we had to enable tracing of everything a specific user was doing either app-wide, or in specific areas.

Storage is always the tricky element here. If you don't sample enough requests, then when something goes wrong, the odds are that you won't have a trace for that request. On the other hand, if you sample too much, then the system itself can degrade due to tracing overhead, and you need a lot of storage for all that data. So this aspect really requires experimentation and regular maintenance to observe whether or not tweaks are needed.

Now, in my case, all of this was being surfaced in Grafana, which has rich features for querying traces, so the issue of navigating all the data was basically solved out of the box. We also weren't emitting traces to stdout/stderr as logs, so that just never affected us when running things locally (while we could run things in a local configuration with tracing enabled, our default development configuration did not do this, tracing was just disabled by default).

I don't know if that is useful or not. I mean, our infra is obviously not what I outlined above, but one doesn't need a k8s cluster + service mesh to accomplish more or less the same goals, so long as you have some kind of service proxy/smart load balancer (e.g. embassy) to handle the sampling configuration bit when requests enter the system. Ultimately though, the approach you take really needs to be tailored around how you plan to surface traces for observation (i.e. in my case I knew I was using Grafana, so I could make choices based on its strengths/weaknesses).

On the topic of log target structuring. This is actually something I recently tackled in the compiler repo, resulting in me implementing a new crate midenc-log which is derived from env_logger, but provides richer facilities for filtering log instrumentation based on the current MIDENC_TRACE spec.

The compiler has a lot of instrumentation output, so using MIDENC_TRACE=trace on anything remotely non-trivial is simply too verbose to even try to read through. All of that instrumentation is valuable when troubleshooting specific problems though, so what is crucial is having precise ways to filter what actually gets emitted to stdout. There are a couple dimensions that we generally want the output for:

A specific "component" of the compiler. This isn't tied to a specific module, but rather something more abstract, like the driver, codegen, infrastructure for dataflow analysis and rewrites, individual passes, that sort of thing.
If a specific component is too broad, we may want to narrow that down further by "topic", i.e. a subset of work performed by a given component
For certain things, such as instrumentation related to a specific compiler pass, the output would still be far too noisy because we're compiling multiple modules/functions at once. So we also want to be able to filter trace logs by specific symbols or operations in the IR.
There are cases where, say, we want to observe all logs for a given compiler component while compiling some function, but filter out any logs from some sub-topic(s) of that component because they are noisy/irrelevant. This requires both positive and negative filters (i.e. include anything that matches a positive filter, unless it matches a negative filter).

So basically we have two types of hierarchy:

The compiler architecture itself comprised of components (e.g. driver, pass) and topics within those components (e.g. pass:canonicalizer)
The structure of the code being compiled, e.g. modules and functions.

And we want to be able to filter on both simultaneously, with additional refinements to declutter the output of a specific run. With midenc-log, we're able to now do things like MIDENC_TRACE=codegen=trace,-codegen:solver MIDENC_TRACE_FILTER=symbol:foo to show all the codegen-related trace logs for a symbol foo, but silencing all of the logs from the codegen:solver topic.

To impose this structure a bit more rigidly, we also define a TraceTarget type in our core IR crate, which handles emitting target strings in the structure expected by midenc-log. The actual structure is a lot simpler now though, since aside from target=component or target=component:topic, anything else we want to filter against is emitted as metadata and MIDENC_TRACE_FILTER applies its filter against the metadata. I'd like so do something similar to what you proposed and define new macros that wrap the ones from log so that it's more ergonomic to log things in the compiler, but the important thing is really the filtering capabilities.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`logging`, `tracing` and `OpenTelemetry` design #1949

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

logging, tracing and OpenTelemetry design #1949

Uh oh!

Uh oh!

Mirko-von-Leipzig Apr 16, 2026 Maintainer

Desires

Goal 1: Configurable production trace volumes and levels

Goal 2: Useful local node logs

Proposed solutions

Tracing targets and levels

Telemetry volume control

Telemetry standards

Open-ish questions

Replies: 2 comments

Uh oh!

Uh oh!

huitseeker Apr 16, 2026 Collaborator

Runtime telemetry control : notes & remarks

Separate log level from span level

Expose a FilterHandle backed by tracing_subscriber::reload, with update() and get()

Do we have metrics?

Follow-up questions

Uh oh!

bitwalker Apr 17, 2026 Collaborator

`logging`, `tracing` and `OpenTelemetry` design #1949

Mirko-von-Leipzig
Apr 16, 2026
Maintainer

huitseeker
Apr 16, 2026
Collaborator

bitwalker
Apr 17, 2026
Collaborator