Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions opentelemetry-otlp/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Released 2025-Sep-25
- Update `opentelemetry-proto` and `opentelemetry-http` dependency version to 0.31.0
- Add HTTP compression support with `gzip-http` and `zstd-http` feature flags
- Add retry with exponential backoff and throttling support for HTTP and gRPC exporters
This behaviour is opt in via the `experimental-grpc-retry` and `experimental-http-retry flags` on this crate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This behaviour is opt in via the `experimental-grpc-retry` and `experimental-http-retry flags` on this crate.
This behaviour is opt in via the `experimental-grpc-retry` and `experimental-http-retry` flags on this crate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also mention that there is no ability to customize retry policy.

Copy link
Member Author

@scottgerring scottgerring Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed code formatting.
You can customize the retry policy, e.g. something like this:

 use opentelemetry_otlp::{SpanExporter, RetryPolicy};
  use opentelemetry_otlp::WithTonicConfig; // or WithHttpConfig

  let custom_policy = RetryPolicy {
      max_retries: 5,
      initial_delay_ms: 200,
      max_delay_ms: 3200,
      jitter_ms: 50,
  };

  let exporter = SpanExporter::builder()
      .with_tonic()  // or .with_http()
      .with_retry_policy(custom_policy)
      .build()?;

Updated changelog accordingly.


## 0.30.0

Expand Down
44 changes: 14 additions & 30 deletions opentelemetry-otlp/src/retry.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,11 @@
//! specified retry policy, using exponential backoff and jitter to determine the delay between
//! retries. The function uses error classification to determine retry behavior and can honor
//! server-provided throttling hints.
#[cfg(any(
feature = "experimental-grpc-retry",
feature = "experimental-http-retry"
))]
use opentelemetry::otel_info;

#[cfg(any(
feature = "experimental-grpc-retry",
Expand All @@ -17,24 +22,23 @@ use opentelemetry::otel_warn;
feature = "experimental-grpc-retry",
feature = "experimental-http-retry"
))]
use std::future::Future;
use opentelemetry_sdk::runtime::Runtime;
#[cfg(any(
feature = "experimental-grpc-retry",
feature = "experimental-http-retry"
))]
use std::hash::{DefaultHasher, Hasher};
use std::time::Duration;
use std::future::Future;
#[cfg(any(
feature = "experimental-grpc-retry",
feature = "experimental-http-retry"
))]
use std::time::SystemTime;

use std::hash::{DefaultHasher, Hasher};
use std::time::Duration;
#[cfg(any(
feature = "experimental-grpc-retry",
feature = "experimental-http-retry"
))]
use opentelemetry_sdk::runtime::Runtime;
use std::time::SystemTime;

/// Classification of errors for retry purposes.
#[derive(Debug, Clone, PartialEq)]
Expand All @@ -61,26 +65,6 @@ pub struct RetryPolicy {
pub jitter_ms: u64,
}

/// A runtime stub for when experimental_async_runtime is not enabled.
/// This allows retry policy to be configured but no actual retries occur.
#[cfg(not(any(
feature = "experimental-grpc-retry",
feature = "experimental-http-retry"
)))]
#[derive(Debug, Clone, Default)]
pub struct NoOpRuntime;

#[cfg(not(any(
feature = "experimental-grpc-retry",
feature = "experimental-http-retry"
)))]
impl NoOpRuntime {
/// Creates a new no-op runtime.
pub fn new() -> Self {
Self
}
}

// Generates a random jitter value up to max_jitter
#[cfg(any(
feature = "experimental-grpc-retry",
Expand Down Expand Up @@ -144,13 +128,13 @@ where

match error_type {
RetryErrorType::NonRetryable => {
otel_warn!(name: "OtlpRetry", message = format!("Operation {:?} failed with non-retryable error: {:?}", operation_name, err));
otel_warn!(name: "OtlpRetryNonRetryable", operation = operation_name, error = format!("{:?}", err));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operation here would always be "export", so this is not a quite useful. I guess you meant to use something like signal=logs/traces/metrics, transport=http/grpc ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operation includes both the client type and signal, e.g.

self.export_http_with_retry(
batch,
OtlpHttpClient::build_logs_export_body,
"HttpLogsClient.Export",
)

match super::tonic_retry_with_backoff(
#[cfg(feature = "experimental-grpc-retry")]
Tokio,
#[cfg(not(feature = "experimental-grpc-retry"))]
(),
self.retry_policy.clone(),
crate::retry_classification::grpc::classify_tonic_status,
"TonicTracesClient.Export",

We could plumb the things through seperately as &str to structure this, but for logging this feels a bit overwrought imho

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otel_warn!(
name: "Export.Failed.NonRetryable",
signal: "Logs",
transport: "HTTP",
message = "OTLP export failed with non-retryable error - telemetry data will be lost"
);

^ Suggestion. We should probably include minimum info like erro_code too, but detailed error message should be done at an inner level, not by the retry module.

Copy link
Member Author

@scottgerring scottgerring Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken the name style from here, and left the signal/transport out, favouring operation, based on my comment above (it tells you this stuff already).

I am leery of adding error_code here, as it's generally logged anyway and it would make the interfaces around retry more complicated just for extra logging, but it sounds like you agree with that take ("detailed error message should be done at an inner level") ?

return Err(err);
}
RetryErrorType::Retryable if attempt < policy.max_retries => {
attempt += 1;
// Use exponential backoff with jitter
otel_warn!(name: "OtlpRetry", message = format!("Retrying operation {:?} due to retryable error: {:?}", operation_name, err));
otel_info!(name: "OtlpRetryRetrying", operation = operation_name, error = format!("{:?}", err));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error here is going to be the stringified/formatted one, likely coming from the underlying transport library, partially defeating structured logging purpose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, this is a retry attempt and need to log the delay/jitter etc. to be fully useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error here is going to be the stringified/formatted one, likely coming from the underlying transport library, partially defeating structured logging purpose.

Do you have a suggestion here?

let jitter = generate_jitter(policy.jitter_ms);
let delay_with_jitter = std::cmp::min(delay + jitter, policy.max_delay_ms);
runtime
Expand All @@ -161,13 +145,13 @@ where
RetryErrorType::Throttled(server_delay) if attempt < policy.max_retries => {
attempt += 1;
// Use server-specified delay (overrides exponential backoff)
otel_warn!(name: "OtlpRetry", message = format!("Retrying operation {:?} after server-specified throttling delay: {:?}", operation_name, server_delay));
otel_info!(name: "OtlpRetryThrottled", operation = operation_name, error = format!("{:?}", err), delay = format!("{:?}", server_delay));
runtime.delay(server_delay).await;
// Don't update exponential backoff delay for next attempt since server provided specific timing
}
_ => {
// Max retries reached
otel_warn!(name: "OtlpRetry", message = format!("Operation {:?} failed after {} attempts: {:?}", operation_name, attempt, err));
otel_warn!(name: "OtlpRetryExhausted", operation = operation_name, error = format!("{:?}", err), attempts = attempt);
return Err(err);
}
}
Expand Down