Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry list/download operation depending on the HTTP response #300

Open
orbitz opened this issue Dec 18, 2024 · 10 comments
Open

Retry list/download operation depending on the HTTP response #300

orbitz opened this issue Dec 18, 2024 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@orbitz
Copy link

orbitz commented Dec 18, 2024

Is your feature request related to a problem? Please describe.
We are hosting our own LIST and DOWNLOAD proxies and sometimes an operation might fail due to a temporary problem, for example a 502 or a connection failure in the case of a transient network hiccup. As far as I have seen, tenv does not support this, or at least it does not expose environment variables to configure the behaviour.

Describe the solution you'd like
For our own software we support a few retries with exponential backoff. We have found this works for us, depending on the failure case. I think the ability to do this and configure it via env variables would be useful.

Describe alternatives you've considered
We considered executing the install ourselves and retrying if it fails however the interface we provide to users is meant to not know they are using tenv underneath, they just call tofu, so it is not necessarily feasible to ensure an install is done prior to the execution because we do not know when a customer will issue the command.

Additional context

@kvendingoldo
Copy link
Collaborator

retries with exponential backoff sound like a good idea!

@kvendingoldo kvendingoldo added the enhancement New feature or request label Dec 18, 2024
@dvaumoron
Copy link
Contributor

I am not sure that this feature should be embed in tenv code base, however you can use the tenvlib package to manage this kind of use case.

@orbitz
Copy link
Author

orbitz commented Dec 26, 2024

Any argument as to why? I'm guessing 90%+ usage of tenv is via the tenv CLI. And currently that means there is no good retry mechanism for that usage.

@dvaumoron
Copy link
Contributor

dvaumoron commented Dec 26, 2024

a good retry mecanism for a CLI usage ? I don't get your point, the user can lauch the command again in case of network/server temporary failure

why does it worth the cost of increasing maintenance effort ?

@orbitz
Copy link
Author

orbitz commented Dec 27, 2024

  1. It simplifies the interface for the user, generally these failures are ephemeral and just retrying will resolve it (depending on the failure).
  2. If you use tenv indirectly via tofu or terraform where it fetches and installs as part of the tofu or terraform command, then the user has to distinguish between tenv failing and and tofu or terraform failing, so it's harder to determine if you even should retry, whereas tenv has the knowledge already so it can make that decision. tenv is already doing magic behind the scenes to make using tofu and terraform feel like a seamless experience, but part of doing network calls, I would argue, require including retries in them to maintain that seamless experience, so networks are fickle beasts.
  3. I guess I don't know Go well enough to say, but I would imagine adding retry for these operations is just wrapping it in a retry function? Is there more to it than that?

@dvaumoron
Copy link
Contributor

  1. I disagree on that point, a user without response will wonder why his command hang (this is the reason behind the choice to display tenv output when there is an installation, we don't want the user to kill a "slow" process, instead of being a true proxy as it try to be in other case)
  2. this point seem interesting, however you seem to forget that tofu (or terraform) are designed to be idempotent, so there should not be issue to re run a command. Moreover tenv use colorized output in proxy mode to allows to know where the error come from (proxied output stay unchanged), although I guess some improvement could be done there (better error message and option to change the color to avoid missing the difference when the default display color is the same).
  3. it depends on what you call "wrapping it in a retry function", because "tenv has the knowledge already so it can make that decision" mean you want a logic, although what kind ? network failure ? server failure depending on http code ? Is it useless to launch again a signature check ? Not really with Cosign, some failure are linked to server stability, so there are decision to make to do that properly.

Again, I think that could become a lot of work for a small benefit

@orbitz
Copy link
Author

orbitz commented Jan 2, 2025

Points 1 and 2 are perhaps easy for a human for interpret, but for automation it is harder. Did tofu init fail because of an ephemeral network error or because of a configuration problem?

For point 3, if you believe it is too much complexity for tenv, I'm in no position to disagree.

I think there is a lot of value here. I believe you are mostly considering the use case of humans directly interacting with tenv, however if you're using tenv in automation, you're using it hundreds of times per hour and those ephemeral failures add up. Additionally, tenv downloads multiple network operations, it's not a great experience to have to re-run the whole operation is just one of those fails (listing, downloading main binary, downloading signatures).

That being said, how does this ticket end? Is there further discussion to be had or is the decision made, in which case we can close the ticket?

@dvaumoron
Copy link
Contributor

If the decision was made, the ticket would already been closed. However I am still trying to figure out why it could be interesting...

If you experience a lot of failure within your network, and want to avoid to relaunch a whole pipeline, you can script the retry mecanism there.

You said you want your users to have a better experience with their call, however depending on my answers you talk about CLI usage or automation. Your particular use case is unclear, and I still don't get why you need that feature to be done within tenv.

@orbitz
Copy link
Author

orbitz commented Jan 29, 2025

Thank you, @dvaumoron. The reasons I believe this is a useful feature of tenv are as follows:

  1. tenv provides a transparent experience to users where you can run tofu or terraform and it will install the appropriate version before executing the command.
  2. When running anything that touches the network at scale will result in the occasional ephemeral failure.
  3. When using tenv in an automated environment, the human user is not in a position to make the decision as to whether or not to run the operation again.

Given that tenv allows a transparent experience, when it comes to automation in the face of transient failures, I think there are two options:

  1. The tool can take care of retries itself.
  2. The tool can communicate up to the caller such that it knows if it is a scenario where a retry is viable.

I am advocating for (1) however I think (2) could be viable as well. I think for (2) to be viable, tenv needs to communicate to the automation, somehow, that the error was a transient error. As far as I know, tenv does not do this. There is some logging, but I don't think evaluating logging is a great solution, perhaps tenv could set a specific exit code that is not used by tofu or terraform that could be used to indicate the failure type?

@dvaumoron
Copy link
Contributor

I will take a look to have a specific return code when a failure come from tenv side

@dvaumoron dvaumoron self-assigned this Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants