Skip to content

Library crates should propagate errors instead of silently logging them #3210

@svencowart

Description

@svencowart

Description

Problem

The OpenTelemetry Rust SDK currently logs errors on behalf of applications, which is inappropriate for library crates. While ADR-001 provides error handling guidance, it includes a problematic allowance:

Failures during regular operation should not panic, instead returning errors to the caller where appropriate, or logging an error if not appropriate.

This guidance needs to be updated. For library crates, it is never appropriate to log errors. Per CONTRIBUTING.md, the SDK should either return errors to callers or delegate to a global error handler registered by the application. However, many codepaths are logging directly instead, leaving applications unable to respond to failures.

Example from span_processor.rs:

fn on_end(&self, span: SpanData) {
    let result = self.exporter.lock().map(|mut exporter| {
        exporter.export(vec![span])
    });

    if let Err(err) = result {
        otel_error!(
            name: "BatchSpanProcessor.Export.Error",
            error = format!("{:?}", err)
        );
    }
}

Why this is problematic:

Library crates logging on behalf of applications creates several problems:

  1. No error visibility: Applications cannot detect, count, or respond to failures
  2. No integration: Errors cannot be integrated with the application's monitoring, alerting, or metrics systems
  3. Inconsistent formatting: Library logs don't match the application's logging format, style, or context (request IDs, etc.), causing confusion for operators and breaking log ingestion pipelines
  4. Policy violations: The library makes policy decisions (what to log, when, how) that belong to the application

Standard Rust library crates (std, tokio, serde, etc.) return errors and let applications decide how to handle them. OpenTelemetry Rust should follow the same pattern.

Proposed Solution

For synchronous operations with a direct caller:

  • Return OTelSdkResult or appropriate error types defined in opentelemetry-sdk::error
  • Let callers decide whether to log, retry, or propagate errors
  • Aligns with existing SpanExporter, LogExporter, and PushMetricExporter traits which already return OTelSdkResult

For background/asynchronous operations without a direct caller:

  • Implement an error callback mechanism via with_error_handler() on processor builders
  • The callback is invoked when background tasks fail
  • Users can then log, emit metrics, trigger alerts, or implement custom strategies

Affected Areas

Traces (High Priority)

  • opentelemetry-sdk/src/trace/span_processor.rs - Remove error logging in batch/simple processors
  • opentelemetry-sdk/src/trace/span_processor_with_async_runtime.rs - Add error callback for background exports
  • opentelemetry-sdk/src/trace/provider.rs - Remove redundant error logging in shutdown

Metrics (High Priority)

  • opentelemetry/src/metrics/instruments.rs - InstrumentProvider trait methods should return Result
  • opentelemetry-sdk/src/metrics/meter.rs - Return errors instead of logging and creating no-op instruments
  • opentelemetry-sdk/src/metrics/meter_provider.rs - Propagate shutdown errors per ADR-001 patterns
  • Periodic reader implementations - Expose background export errors via error callback

Logs (High Priority)

  • opentelemetry-sdk/src/logs/log_processor.rs - Make LogProcessor::emit() fallible
  • opentelemetry-sdk/src/logs/simple_log_processor.rs - Return errors from emit operations
  • opentelemetry-sdk/src/logs/log_processor_with_async_runtime.rs - Add error callback for background processing

Other

  • opentelemetry-zipkin/src/exporter/env.rs - Replace eprintln! with proper error returns
  • Update examples to demonstrate proper error handling
  • Update tests to verify error propagation

Implementation Strategy

  1. Phase 1: Traces

    • Remove all otel_error!, otel_warn!, otel_debug! calls that mask export failures
    • Add with_error_handler() to BatchSpanProcessorBuilder
    • Background export errors invoke user-provided callback
    • Synchronous operations return OTelSdkResult
  2. Phase 2: Metrics

    • Update trait definitions to return Result types per ADR-001 guidance
    • Implement error callbacks for periodic readers
    • Update meter implementation to propagate errors from instrument creation
  3. Phase 3: Logs

    • Make LogProcessor::emit() fallible where appropriate
    • Add error callbacks for async log processors
    • Update log appenders to propagate errors
  4. Phase 4: Documentation & Examples

    • Update ADR-001 to remove the allowance for logging errors in library crates
    • Update all examples to demonstrate proper error handling
    • Add migration guide documenting breaking changes
    • Document error callback patterns and best practices

Backward Compatibility

This is a breaking change that will require:

  • Minor version bump (0.x -> 0.y, as the crate is pre-1.0)
  • Migration guide for users updating from previous versions
  • Updated examples and documentation
  • Update to ADR-001 clarifying that library crates must never log errors on behalf of applications

However, the benefits justify the breaking change:

  • Proper library design following Rust best practices and standard library conventions
  • Better error visibility and control for applications
  • Enables custom error handling strategies (retry, metrics, alerting)
  • Improved debuggability and observability in production

Additional Context

Why ADR-001 allows logging:

The allowance for logging "where errors cannot be returned" likely stems from background operations where there's no direct caller. However, the solution is not to log, but rather to:

  • Use error callbacks that applications can register
  • Delegate to a global error handler if one is registered
  • Return errors wherever possible

Technical feasibility:

  • The OpenTelemetry specification requires operations like on_end() to be fast and non-blocking, but does not mandate void return types
  • Returning an error does not violate the non-blocking requirement
  • Error callbacks provide a way to handle background failures without blocking the hot path
  • This change aligns with the existing exporter traits (SpanExporter, LogExporter, PushMetricExporter) which already return OTelSdkResult from their methods

Related: This aligns with Rust's error handling best practices, the Rust standard library's patterns, and the principle that libraries should be "honest" about failures, letting applications make all policy decisions about logging, retrying, and error handling.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions