Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace (and all) exporters should have an export timeout #346

Open
toumorokoshi opened this issue Dec 20, 2019 · 18 comments
Open

Trace (and all) exporters should have an export timeout #346

toumorokoshi opened this issue Dec 20, 2019 · 18 comments
Assignees
Labels
feature-request help wanted release:required-for-ga To be resolved before GA release sdk Affects the SDK package. triaged

Comments

@toumorokoshi
Copy link
Member

The OpenTelemetry SDK should have a timeout on exporters during the flush call, or some way to configure this behavior.

This would ensure that the flush thread doesn't get hung indefinitely: https://github.com/open-telemetry/opentelemetry-python/pull/320/files#r360471105

@toumorokoshi toumorokoshi added the good first issue Good first issue label Dec 26, 2019
@c24t
Copy link
Member

c24t commented Jan 9, 2020

We probably want to have a general timeout mechanism in the export layer, hopefully we can configure this once for all exporters.

@mauriciovasquezbernal
Copy link
Member

I was checking #385 but I got some questions on the issue itself, so I'll comment here.

I don't understand what is the scope of this issue. According to the specification: "Export() must not block indefinitely, there must be a reasonable upper limit after which the call must time out with an error result (typically FailedRetryable)." [1]

The specification also states that the shutdown() and force_flush() methods on the span processor should not block and the developers could make the timeout configurable.

Is this issue about ensuring that export() doesn't block, adding a timeout parameter on shutdown() and force_flush(), both or anything else?

[1] https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/sdk-tracing.md#exportbatch

@dgzlopes
Copy link
Contributor

From reading this Zipkin review and the comments on this issue, I thought that the right scope for the PR was to:

  • Implement a general timeout mechanism.
  • Use that mechanism on the export layer so export() doesn't block.

But now from reading your message I've some mixed feelings about the PR. Maybe @c24t and @toumorokoshi can throw some light on about what's the expected scope for this issue.

Also, thank you for pointing out that part of the specification... I didn't see it!

@c24t
Copy link
Member

c24t commented Jan 29, 2020

I don't understand what is the scope of this issue. According to the specification: "Export() must not block indefinitely, there must be a reasonable upper limit after which the call must time out with an error result (typically FailedRetryable)."

I think we can treat exporter timeout and processor timeout as separate issues.

Our exporters don't currently support timeouts, which is a violation of the spec as written. Ideally each exporter would have its own (possibly globally-configurable) timeout, and the processors would handle FailedRetryable errors by retrying the export.

Even if we had that, I think it's still a good idea for the span processor to have a hard timeout. We could, for example, keep retrying a FailedRetryable export up until the span processor timeout, at which point we'd kill the export and move on to the next batch.

The timeout interval for an exporter depends on the expected behavior of the exporter since different backends will have different SLAs. The timeout interval for a span processor might only depend on the export interval, for example if we want to guarantee that it exports a batch of spans every X seconds.

  • Use that mechanism on the export layer so export() doesn't block.

export still blocks here, it just doesn't block indefinitely. It's actually intentional that the worker thread blocks on calls to export since we only ever want to have a single ongoing call per exporter, following the spec: "Export() will never be called concurrently for the same exporter instance. Export() can be called again only after the current call returns."

I think it's fine to go ahead with a general purpose timeout in span processors.

@ocelotl
Copy link
Contributor

ocelotl commented Jan 29, 2020

I don't understand what is the scope of this issue. According to the specification: "Export() must not block indefinitely, there must be a reasonable upper limit after which the call must time out with an error result (typically FailedRetryable)."

I think we can treat exporter timeout and processor timeout as separate issues.

Our exporters don't currently support timeouts, which is a violation of the spec as written. Ideally each exporter would have its own (possibly globally-configurable) timeout, and the processors would handle FailedRetryable errors by retrying the export.

Even if we had that, I think it's still a good idea for the span processor to have a hard timeout. We could, for example, keep retrying a FailedRetryable export up until the span processor timeout, at which point we'd kill the export and move on to the next batch.

The timeout interval for an exporter depends on the expected behavior of the exporter since different backends will have different SLAs. The timeout interval for a span processor might only depend on the export interval, for example if we want to guarantee that it exports a batch of spans every X seconds.

  • Use that mechanism on the export layer so export() doesn't block.

export still blocks here, it just doesn't block indefinitely. It's actually intentional that the worker thread blocks on calls to export since we only ever want to have a single ongoing call per exporter, following the spec: "Export() will never be called concurrently for the same exporter instance. Export() can be called again only after the current call returns."

I think it's fine to go ahead with a general purpose timeout in span processors.

I like this implementation because it puts the timeout mechanism in a place accessible for all the implementations of anything with a simple from ....utils import timeout. The timeout mechanism for exporters and processors is pretty much the same thing. Leaving the implementation of this mechanism to the exporters can easily result in a bit of duplicated code spread around. If the reason why this is needed is because the exporters may have different SLAs or timeout needs, then we can provide a way to configure that directly in a decorator for the export method, something like this:

def timeout(timeout):

    def inner_0(function):

        def inner_1(*args, timeout=timeout, **kwargs):

            print(timeout)

            function(*args, **kwargs)

        return inner_1

    return inner_0


@timeout(90)
def export_0(first, second):
    print(first)
    print(second)


@timeout(100)
def export_1(first, second):
    print(first)
    print(second)


export_0(1, 2, timeout=9)
export_1(1, 2)
9
1
2
100
1
2

In this way, different exporters can have different timeout values defined for them in their own code by simply changing the argument passed to the timeout decorator without having to implement the timeout mechanism by themselves.

@c24t
Copy link
Member

c24t commented Jan 30, 2020

The timeout mechanism for exporters and processors is pretty much the same thing

For exporters that use requests or (or urllib3 or socket directly) we should pass the timeout down to the underlying socket. We don't have that option at the span processor level.

I think it's a good idea to have a general purpose timeout decorator -- and for exporters to use that decorator if there's not a better option -- I just mean to say that a timeout on the span processor doesn't obviate the need for one on the exporter.

@ocelotl
Copy link
Contributor

ocelotl commented Jan 30, 2020

The timeout mechanism for exporters and processors is pretty much the same thing

For exporters that use requests or (or urllib3 or socket directly) we should pass the timeout down to the underlying socket. We don't have that option at the span processor level.

I think it's a good idea to have a general purpose timeout decorator -- and for exporters to use that decorator if there's not a better option -- I just mean to say that a timeout on the span processor doesn't obviate the need for one on the exporter.

Ok, I see your point. What I mean is that the specification seems to be pretty strong on the requirement that export should not hang indefinitely, and from that I understand that there should be a mechanism (preferably provided by us, who implement the specification) that makes sure of that.

So, for example, let's consider a possible export function that passes down the timeout to a socket:

from time import sleep

from socket_library import socket, SocketTimeout

...

# @timeout(60)  # The problem explained below would be fixed by uncommenting this line.
def export(self, timeout=60):
    ...
    try:
        socket.get(..., timeout)
    except SocketTimeout:
        while True:
            sleep(1)

I am exaggerating, of course 🙂 . But that is my point, I think the specification requires export not to hang forever and it would be convenient if we could provide a mechanism that ensures that, regardless of the implementation of the method itself.

@ocelotl
Copy link
Contributor

ocelotl commented Jan 30, 2020

BTW

@dgzlopes
Copy link
Contributor

Actually I think using timeout-decorator or pebble is a good idea. But I don't know how I feel about adding a dependency on this.

From the reviews in #385, I re-implemented timeout functionality using decorators and multiprocessing instead of signals. I would love if you can give it a look :)

It could be helpful if the implementation path is chosen.

Note: Took _target implementation from timeout-decorator. I think it's cleaner than what I had implemented this morning. It was a bit of a hack.

import sys
from time import sleep, time
from multiprocessing import Queue, Process


# Implementation


def _target(queue, function, *args, **kwargs):
    try:
        queue.put((True, function(*args, **kwargs)))
    except:
        queue.put((False, sys.exc_info()[1]))


def timeout(timeout):
    def inner_0(function):
        def inner_1(*args, timeout=timeout, **kwargs):
            pqueue = Queue(1)
            process = Process(target=_target, args=(pqueue, function), daemon=True)
            process.start()
            timeout = timeout + time()
            while process.is_alive():
                sleep(0.5)
                if timeout < time():
                    process.terminate()
                    raise TimeoutError
            flag, load = pqueue.get()
            if flag:
                return load
            else:
                raise load

        return inner_1

    return inner_0


# Usage


@timeout(5)
def export():
    for number in range(1, 5):
        print(number)
        sleep(0.5)
    return True


try:
    if export(timeout=2):
        print("Worked")
except TimeoutError:
    print("Failed")

if export():
    print("Worked")

@Oberon00
Copy link
Member

Oberon00 commented Jan 31, 2020

Personally I think that only mostly the exporter should be responsible for honoring timeouts. Aborting uncooperative threads has always been a bad idea. What we can do is add an interface (in the general sense of "interface") between the processor and the exporter that tells the exporter how long the timeout is, or if it was cancelled. Also, the batch span processor should stop passing more batches to the exporter once the total timeout is reached.

If we need any additional threads or processes to handle timeouts, the cost-benefit ratio is wrong IMHO. After all, a well-written exporter should be able to handle timeouts efficiently.

@c24t

We could, for example, keep retrying a FailedRetryable export up until the span processor timeout, at which point we'd kill the export and move on to the next batch.

I'd implement FailedRetryable such that the spans are only retried in the next interval. Immediate retrying IMHO only makes sense for SimpleSpanProcessor.

@ocelotl
Copy link
Contributor

ocelotl commented Jan 31, 2020

Ok, maybe another random idea here. How about instead of blocking on export or shutdown or anything else, we just call them in a non-blocking way and then we wait timeout seconds for them to have done their exporting or shutdowning?.

So, yes, basically using some kind of join. Can this approach fit our specification needs? 🤔

P.D.: this join has a timeout argument

@mauriciovasquezbernal
Copy link
Member

I second @Oberon00. If the exporters honor the timeouts we could base the timeout logic on span processors on that.

@mauriciovasquezbernal mauriciovasquezbernal added the sdk Affects the SDK package. label Feb 14, 2020
@codeboten codeboten added the release:required-for-ga To be resolved before GA release label Jul 22, 2020
@codeboten codeboten self-assigned this Jul 22, 2020
@codeboten
Copy link
Contributor

@aabmass will follow up to determine if this is in the spec or not.

@codeboten codeboten assigned aabmass and unassigned codeboten Aug 27, 2020
srikanthccv pushed a commit to srikanthccv/opentelemetry-python that referenced this issue Nov 1, 2020
@codeboten
Copy link
Contributor

As per the spec:

Export() must not block indefinitely, there must be a reasonable upper limit after which the call must time out with an error result (Failure).

https://github.com/open-telemetry/opentelemetry-specification/blob/15ac35c174f097b37e4e7b8a3653700bb2e7e763/specification/trace/sdk.md#exportbatch

@codeboten codeboten added the 1.0.0rc2 release candidate 2 for tracing GA label Nov 26, 2020
@codeboten
Copy link
Contributor

@aabmass are you still looking at this issue?

@aabmass
Copy link
Member

aabmass commented Mar 5, 2021

@codeboten sorry for the delay. I'm gonna unassign myself for now in case any one else has time to pick it up

cc @lzchen

@aabmass aabmass removed their assignment Mar 5, 2021
@sroy07
Copy link

sroy07 commented Oct 7, 2024

Hi @srikanthccv , is someone working on this issue? I would like to pick this issue. Thanks

@xrmx xrmx removed the good first issue Good first issue label Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request help wanted release:required-for-ga To be resolved before GA release sdk Affects the SDK package. triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.