-
Couldn't load subscription status.
- Fork 586
Description
Description
Problem
The OpenTelemetry Rust SDK currently logs errors on behalf of applications, which is inappropriate for library crates. While ADR-001 provides error handling guidance, it includes a problematic allowance:
Failures during regular operation should not panic, instead returning errors to the caller where appropriate, or logging an error if not appropriate.
This guidance needs to be updated. For library crates, it is never appropriate to log errors. Per CONTRIBUTING.md, the SDK should either return errors to callers or delegate to a global error handler registered by the application. However, many codepaths are logging directly instead, leaving applications unable to respond to failures.
Example from span_processor.rs:
fn on_end(&self, span: SpanData) {
let result = self.exporter.lock().map(|mut exporter| {
exporter.export(vec![span])
});
if let Err(err) = result {
otel_error!(
name: "BatchSpanProcessor.Export.Error",
error = format!("{:?}", err)
);
}
}Why this is problematic:
Library crates logging on behalf of applications creates several problems:
- No error visibility: Applications cannot detect, count, or respond to failures
- No integration: Errors cannot be integrated with the application's monitoring, alerting, or metrics systems
- Inconsistent formatting: Library logs don't match the application's logging format, style, or context (request IDs, etc.), causing confusion for operators and breaking log ingestion pipelines
- Policy violations: The library makes policy decisions (what to log, when, how) that belong to the application
Standard Rust library crates (std, tokio, serde, etc.) return errors and let applications decide how to handle them. OpenTelemetry Rust should follow the same pattern.
Proposed Solution
For synchronous operations with a direct caller:
- Return
OTelSdkResultor appropriate error types defined inopentelemetry-sdk::error - Let callers decide whether to log, retry, or propagate errors
- Aligns with existing
SpanExporter,LogExporter, andPushMetricExportertraits which already returnOTelSdkResult
For background/asynchronous operations without a direct caller:
- Implement an error callback mechanism via
with_error_handler()on processor builders - The callback is invoked when background tasks fail
- Users can then log, emit metrics, trigger alerts, or implement custom strategies
Affected Areas
Traces (High Priority)
-
opentelemetry-sdk/src/trace/span_processor.rs- Remove error logging in batch/simple processors -
opentelemetry-sdk/src/trace/span_processor_with_async_runtime.rs- Add error callback for background exports -
opentelemetry-sdk/src/trace/provider.rs- Remove redundant error logging in shutdown
Metrics (High Priority)
-
opentelemetry/src/metrics/instruments.rs-InstrumentProvidertrait methods should returnResult -
opentelemetry-sdk/src/metrics/meter.rs- Return errors instead of logging and creating no-op instruments -
opentelemetry-sdk/src/metrics/meter_provider.rs- Propagate shutdown errors per ADR-001 patterns - Periodic reader implementations - Expose background export errors via error callback
Logs (High Priority)
-
opentelemetry-sdk/src/logs/log_processor.rs- MakeLogProcessor::emit()fallible -
opentelemetry-sdk/src/logs/simple_log_processor.rs- Return errors from emit operations -
opentelemetry-sdk/src/logs/log_processor_with_async_runtime.rs- Add error callback for background processing
Other
-
opentelemetry-zipkin/src/exporter/env.rs- Replaceeprintln!with proper error returns - Update examples to demonstrate proper error handling
- Update tests to verify error propagation
Implementation Strategy
-
Phase 1: Traces
- Remove all
otel_error!,otel_warn!,otel_debug!calls that mask export failures - Add
with_error_handler()toBatchSpanProcessorBuilder - Background export errors invoke user-provided callback
- Synchronous operations return
OTelSdkResult
- Remove all
-
Phase 2: Metrics
- Update trait definitions to return
Resulttypes per ADR-001 guidance - Implement error callbacks for periodic readers
- Update meter implementation to propagate errors from instrument creation
- Update trait definitions to return
-
Phase 3: Logs
- Make
LogProcessor::emit()fallible where appropriate - Add error callbacks for async log processors
- Update log appenders to propagate errors
- Make
-
Phase 4: Documentation & Examples
- Update ADR-001 to remove the allowance for logging errors in library crates
- Update all examples to demonstrate proper error handling
- Add migration guide documenting breaking changes
- Document error callback patterns and best practices
Backward Compatibility
This is a breaking change that will require:
- Minor version bump (0.x -> 0.y, as the crate is pre-1.0)
- Migration guide for users updating from previous versions
- Updated examples and documentation
- Update to ADR-001 clarifying that library crates must never log errors on behalf of applications
However, the benefits justify the breaking change:
- Proper library design following Rust best practices and standard library conventions
- Better error visibility and control for applications
- Enables custom error handling strategies (retry, metrics, alerting)
- Improved debuggability and observability in production
Additional Context
Why ADR-001 allows logging:
The allowance for logging "where errors cannot be returned" likely stems from background operations where there's no direct caller. However, the solution is not to log, but rather to:
- Use error callbacks that applications can register
- Delegate to a global error handler if one is registered
- Return errors wherever possible
Technical feasibility:
- The OpenTelemetry specification requires operations like
on_end()to be fast and non-blocking, but does not mandate void return types - Returning an error does not violate the non-blocking requirement
- Error callbacks provide a way to handle background failures without blocking the hot path
- This change aligns with the existing exporter traits (
SpanExporter,LogExporter,PushMetricExporter) which already returnOTelSdkResultfrom their methods
Related: This aligns with Rust's error handling best practices, the Rust standard library's patterns, and the principle that libraries should be "honest" about failures, letting applications make all policy decisions about logging, retrying, and error handling.