Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default otel reporter timeout of 900s is too long. #1670

Closed
tmc opened this issue Mar 4, 2021 · 18 comments · Fixed by #1842
Closed

Default otel reporter timeout of 900s is too long. #1670

tmc opened this issue Mar 4, 2021 · 18 comments · Fixed by #1842
Assignees
Labels
bug Something isn't working exporters triaged

Comments

@tmc
Copy link

tmc commented Mar 4, 2021

Since this timeout can occur when a process is closing/exiting it can mean a significant amount of time when the process is just retrying before it ultimately ends.

For example, it's easy to misconfigure the target service e.g.:

	status = StatusCode.UNAVAILABLE
	details = "DNS resolution failed for service: opencensus-collector:55680"

which can cause processes to easily hang at exit (and print nothing by default).

@tmc tmc added the bug Something isn't working label Mar 4, 2021
@owais
Copy link
Contributor

owais commented Mar 6, 2021

I agree it is probably a bit too much but reducing it should not be the solution to services freezing at shutdown. At shutdown, exporters/processors should be given a grace-period to shutdown cleanly and if they don't then the pipeline should force them to shutdown and exit. We should have a different issue for that.

@lzchen
Copy link
Contributor

lzchen commented Mar 8, 2021

@owais
What is the decision for this issue? Not increase the timeout but have a better shutdown mechanism for exporters/processors?

@owais
Copy link
Contributor

owais commented Mar 8, 2021

I think these are two separate issues.

  1. Exporter should not hang om shutdown and should have a grace period after which it force quits dropping in memory spans.
  2. Exporter should have a shorter timeout on network ops in general as 900s is probably too long unless this is not a real problem and only manifests when network errors happen during shutdown. In that case we only need 1.

@lzchen
Copy link
Contributor

lzchen commented Mar 8, 2021

I think we can address both. Java has default timeout to be 10 seconds, maybe we can do the same?

@lzchen
Copy link
Contributor

lzchen commented Mar 8, 2021

Related #346

@owais
Copy link
Contributor

owais commented Mar 8, 2021

SGTM

@srikanthccv
Copy link
Member

@lzchen OTLP python also has the default timeout 10 seconds for Export.

self._timeout = (
timeout
or int(environ.get(OTEL_EXPORTER_OTLP_TIMEOUT, 0))
or 10 # default: 10 seconds
)

The 900 secs is to deal with the transient errors by using exponential back-off retry strategy. This might still be too long but just want to make sure we are not confusing one with other.
max_value = 900
for delay in expo(max_value=max_value):
if delay == max_value:
return self._result.FAILURE
try:
self._client.Export(
request=self._translate_data(data),
metadata=self._headers,
timeout=self._timeout,
)
return self._result.SUCCESS
except RpcError as error:
if error.code() in [
StatusCode.CANCELLED,
StatusCode.DEADLINE_EXCEEDED,
StatusCode.PERMISSION_DENIED,
StatusCode.UNAUTHENTICATED,
StatusCode.RESOURCE_EXHAUSTED,
StatusCode.ABORTED,
StatusCode.OUT_OF_RANGE,
StatusCode.UNAVAILABLE,
StatusCode.DATA_LOSS,
]:
retry_info_bin = dict(error.trailing_metadata()).get(
"google.rpc.retryinfo-bin"
)
if retry_info_bin is not None:
retry_info = RetryInfo()
retry_info.ParseFromString(retry_info_bin)
delay = (
retry_info.retry_delay.seconds
+ retry_info.retry_delay.nanos / 1.0e9
)
logger.debug(
"Waiting %ss before retrying export of span", delay
)
sleep(delay)
continue
if error.code() == StatusCode.OK:
return self._result.SUCCESS
return self._result.FAILURE

@lzchen
Copy link
Contributor

lzchen commented Mar 9, 2021

@lonewolf3739
Thanks for the explanation. It's outlined that it is the same value as the Go implementation. If that's the case shouldn't we be consistent? Does 900s make sense for the "conventional" use case of actually retrying instead of this specific use case where the exporter hangs due to misconfiguration? If so, we should just solve (1) as outlined here.

@srikanthccv
Copy link
Member

900 secs is too high for maximum backoff time. Generally it is either 32 or 64 seconds and sometimes can be little high based on use case. I couldn't find what is the recommended maximum value by OTLP but I assume it wouldn't be as high 900 seconds. Probably we can raise an issue on the spec repo.

@owais
Copy link
Contributor

owais commented Mar 9, 2021

Thanks for digging into this @lonewolf3739. Agree we should clarify with spec or at least with other SIG maintainers.

And we should probably have separate tickets for possibly reducing exp-backoff max time and add grace period to exporters on shutdown.

@github-actions
Copy link

github-actions bot commented Apr 9, 2021

This issue was marked stale due to lack of activity. It will be closed in 30 days.

@srikanthccv srikanthccv added discussion Issue or PR that needs/is extended discussion. exporters triaged and removed backlog labels Apr 13, 2021
@srikanthccv
Copy link
Member

srikanthccv commented Apr 13, 2021

I brought this up in spec sig meeting sometime back and they agreed this is too long but wouldn't recommend any value now as it requires some work to do. So as of now it is upto language implementations to decide.

@lzchen
Copy link
Contributor

lzchen commented Apr 16, 2021

@lonewolf3739
So is the recommendation to leave it as it is?

@srikanthccv
Copy link
Member

@lonewolf3739
So is the recommendation to leave it as it is?

Until the spec is updated it is upto each language SIG to make decision on what to do.

@lzchen
Copy link
Contributor

lzchen commented Apr 16, 2021

@lonewolf3739
So for now can we just reduce it to 64s or something?

@srikanthccv
Copy link
Member

Yes, I think that is very reasonable value.

@lzchen lzchen removed the discussion Issue or PR that needs/is extended discussion. label Apr 19, 2021
@srikanthccv
Copy link
Member

Assigning self to address this and related issues.

@srikanthccv srikanthccv self-assigned this Apr 22, 2021
@aramakrishnan19
Copy link

Please let me know on when is this planned to be changed, ready to help with this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporters triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants