Skip to content

Conversation

@scottgerring
Copy link
Member

Fixes #
Follow on from #3126

Changes

Please provide a brief description of the changes here.

Merge requirement checklist

  • CONTRIBUTING guidelines followed
  • Unit tests added/updated (if applicable)
  • Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • Changes in public API reviewed (if applicable)

@scottgerring scottgerring requested a review from a team as a code owner October 21, 2025 06:06
@codecov
Copy link

codecov bot commented Oct 21, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.8%. Comparing base (3b2f751) to head (483c4de).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@          Coverage Diff          @@
##            main   #3204   +/-   ##
=====================================
  Coverage   80.8%   80.8%           
=====================================
  Files        128     128           
  Lines      23090   23090           
=====================================
  Hits       18676   18676           
  Misses      4414    4414           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@scottgerring scottgerring force-pushed the chore/otlp-retry-postreview-cleanup branch from aaa5214 to 16d0863 Compare October 21, 2025 06:15
@scottgerring
Copy link
Member Author

scottgerring commented Oct 21, 2025

Open for suggestions for where we can put the runtime selection. This strikes me as a broader question about the stabilisation of the runtime mechanism.

https://github.com/scottgerring/opentelemetry-rust/blob/9ba3a06df952d4888009764bebe665281f9e45b7/opentelemetry-otlp/src/exporter/http/mod.rs#L402-L409

match error_type {
RetryErrorType::NonRetryable => {
otel_warn!(name: "OtlpRetry", message = format!("Operation {:?} failed with non-retryable error: {:?}", operation_name, err));
otel_warn!(name: "OtlpRetryNonRetryable", operation = operation_name, error = format!("{:?}", err));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operation here would always be "export", so this is not a quite useful. I guess you meant to use something like signal=logs/traces/metrics, transport=http/grpc ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operation includes both the client type and signal, e.g.

self.export_http_with_retry(
batch,
OtlpHttpClient::build_logs_export_body,
"HttpLogsClient.Export",
)

match super::tonic_retry_with_backoff(
#[cfg(feature = "experimental-grpc-retry")]
Tokio,
#[cfg(not(feature = "experimental-grpc-retry"))]
(),
self.retry_policy.clone(),
crate::retry_classification::grpc::classify_tonic_status,
"TonicTracesClient.Export",

We could plumb the things through seperately as &str to structure this, but for logging this feels a bit overwrought imho

attempt += 1;
// Use exponential backoff with jitter
otel_warn!(name: "OtlpRetry", message = format!("Retrying operation {:?} due to retryable error: {:?}", operation_name, err));
otel_info!(name: "OtlpRetryRetrying", operation = operation_name, error = format!("{:?}", err));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error here is going to be the stringified/formatted one, likely coming from the underlying transport library, partially defeating structured logging purpose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, this is a retry attempt and need to log the delay/jitter etc. to be fully useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error here is going to be the stringified/formatted one, likely coming from the underlying transport library, partially defeating structured logging purpose.

Do you have a suggestion here?

- Update `opentelemetry-proto` and `opentelemetry-http` dependency version to 0.31.0
- Add HTTP compression support with `gzip-http` and `zstd-http` feature flags
- Add retry with exponential backoff and throttling support for HTTP and gRPC exporters
This behaviour is opt in via the `experimental-grpc-retry` and `experimental-http-retry flags` on this crate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This behaviour is opt in via the `experimental-grpc-retry` and `experimental-http-retry flags` on this crate.
This behaviour is opt in via the `experimental-grpc-retry` and `experimental-http-retry` flags on this crate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also mention that there is no ability to customize retry policy.

Copy link
Member Author

@scottgerring scottgerring Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed code formatting.
You can customize the retry policy, e.g. something like this:

 use opentelemetry_otlp::{SpanExporter, RetryPolicy};
  use opentelemetry_otlp::WithTonicConfig; // or WithHttpConfig

  let custom_policy = RetryPolicy {
      max_retries: 5,
      initial_delay_ms: 200,
      max_delay_ms: 3200,
      jitter_ms: 50,
  };

  let exporter = SpanExporter::builder()
      .with_tonic()  // or .with_http()
      .with_retry_policy(custom_policy)
      .build()?;

Updated changelog accordingly.

match error_type {
RetryErrorType::NonRetryable => {
otel_warn!(name: "OtlpRetry", message = format!("Operation {:?} failed with non-retryable error: {:?}", operation_name, err));
otel_warn!(name: "OtlpRetryNonRetryable", operation = operation_name, error = format!("{:?}", err));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otel_warn!(
name: "Export.Failed.NonRetryable",
signal: "Logs",
transport: "HTTP",
message = "OTLP export failed with non-retryable error - telemetry data will be lost"
);

^ Suggestion. We should probably include minimum info like erro_code too, but detailed error message should be done at an inner level, not by the retry module.

Copy link
Member Author

@scottgerring scottgerring Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken the name style from here, and left the signal/transport out, favouring operation, based on my comment above (it tells you this stuff already).

I am leery of adding error_code here, as it's generally logged anyway and it would make the interfaces around retry more complicated just for extra logging, but it sounds like you agree with that take ("detailed error message should be done at an inner level") ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants