-
Notifications
You must be signed in to change notification settings - Fork 586
chore: OTLP retry post-review cleanup #3204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
chore: OTLP retry post-review cleanup #3204
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3204 +/- ##
=====================================
Coverage 80.8% 80.8%
=====================================
Files 128 128
Lines 23090 23090
=====================================
Hits 18676 18676
Misses 4414 4414 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
aaa5214 to
16d0863
Compare
|
Open for suggestions for where we can put the runtime selection. This strikes me as a broader question about the stabilisation of the runtime mechanism. |
opentelemetry-otlp/src/retry.rs
Outdated
| match error_type { | ||
| RetryErrorType::NonRetryable => { | ||
| otel_warn!(name: "OtlpRetry", message = format!("Operation {:?} failed with non-retryable error: {:?}", operation_name, err)); | ||
| otel_warn!(name: "OtlpRetryNonRetryable", operation = operation_name, error = format!("{:?}", err)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
operation here would always be "export", so this is not a quite useful. I guess you meant to use something like signal=logs/traces/metrics, transport=http/grpc ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operation includes both the client type and signal, e.g.
opentelemetry-rust/opentelemetry-otlp/src/exporter/http/logs.rs
Lines 8 to 12 in 16d0863
| self.export_http_with_retry( | |
| batch, | |
| OtlpHttpClient::build_logs_export_body, | |
| "HttpLogsClient.Export", | |
| ) |
opentelemetry-rust/opentelemetry-otlp/src/exporter/tonic/trace.rs
Lines 78 to 85 in 16d0863
| match super::tonic_retry_with_backoff( | |
| #[cfg(feature = "experimental-grpc-retry")] | |
| Tokio, | |
| #[cfg(not(feature = "experimental-grpc-retry"))] | |
| (), | |
| self.retry_policy.clone(), | |
| crate::retry_classification::grpc::classify_tonic_status, | |
| "TonicTracesClient.Export", |
We could plumb the things through seperately as &str to structure this, but for logging this feels a bit overwrought imho
opentelemetry-otlp/src/retry.rs
Outdated
| attempt += 1; | ||
| // Use exponential backoff with jitter | ||
| otel_warn!(name: "OtlpRetry", message = format!("Retrying operation {:?} due to retryable error: {:?}", operation_name, err)); | ||
| otel_info!(name: "OtlpRetryRetrying", operation = operation_name, error = format!("{:?}", err)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the error here is going to be the stringified/formatted one, likely coming from the underlying transport library, partially defeating structured logging purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, this is a retry attempt and need to log the delay/jitter etc. to be fully useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the error here is going to be the stringified/formatted one, likely coming from the underlying transport library, partially defeating structured logging purpose.
Do you have a suggestion here?
opentelemetry-otlp/CHANGELOG.md
Outdated
| - Update `opentelemetry-proto` and `opentelemetry-http` dependency version to 0.31.0 | ||
| - Add HTTP compression support with `gzip-http` and `zstd-http` feature flags | ||
| - Add retry with exponential backoff and throttling support for HTTP and gRPC exporters | ||
| This behaviour is opt in via the `experimental-grpc-retry` and `experimental-http-retry flags` on this crate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This behaviour is opt in via the `experimental-grpc-retry` and `experimental-http-retry flags` on this crate. | |
| This behaviour is opt in via the `experimental-grpc-retry` and `experimental-http-retry` flags on this crate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also mention that there is no ability to customize retry policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed code formatting.
You can customize the retry policy, e.g. something like this:
use opentelemetry_otlp::{SpanExporter, RetryPolicy};
use opentelemetry_otlp::WithTonicConfig; // or WithHttpConfig
let custom_policy = RetryPolicy {
max_retries: 5,
initial_delay_ms: 200,
max_delay_ms: 3200,
jitter_ms: 50,
};
let exporter = SpanExporter::builder()
.with_tonic() // or .with_http()
.with_retry_policy(custom_policy)
.build()?;Updated changelog accordingly.
opentelemetry-otlp/src/retry.rs
Outdated
| match error_type { | ||
| RetryErrorType::NonRetryable => { | ||
| otel_warn!(name: "OtlpRetry", message = format!("Operation {:?} failed with non-retryable error: {:?}", operation_name, err)); | ||
| otel_warn!(name: "OtlpRetryNonRetryable", operation = operation_name, error = format!("{:?}", err)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otel_warn!(
name: "Export.Failed.NonRetryable",
signal: "Logs",
transport: "HTTP",
message = "OTLP export failed with non-retryable error - telemetry data will be lost"
);
^ Suggestion. We should probably include minimum info like erro_code too, but detailed error message should be done at an inner level, not by the retry module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken the name style from here, and left the signal/transport out, favouring operation, based on my comment above (it tells you this stuff already).
I am leery of adding error_code here, as it's generally logged anyway and it would make the interfaces around retry more complicated just for extra logging, but it sounds like you agree with that take ("detailed error message should be done at an inner level") ?
Fixes #
Follow on from #3126
Changes
Please provide a brief description of the changes here.
Merge requirement checklist
CHANGELOG.mdfiles updated for non-trivial, user-facing changes