Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Automatic job retries from failure for transient errors #11251

Open
3 tasks done
lucidviews opened this issue Jan 28, 2025 · 1 comment
Open
3 tasks done

[Feature] Automatic job retries from failure for transient errors #11251

lucidviews opened this issue Jan 28, 2025 · 1 comment

Comments

@lucidviews
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Occasionally job runs fail due to transient errors, e.g. some connection time-out.
In such cases, all it sometimes takes to fix this issue is to retry the failed job manually without making any changes in dbt or in the data platform.

It'd be great if dbt would introduce an error taxonomy that classifies errors as 'potentially transient', i.e., worth retrying, and 'guaranteed to persist', i.e., this will 100% fail again if no changes are made.
If a job could then be configured to automatically rerun from failure if a retriable exception occurs (potentially with a configurable delay), it would alleviate the need for manual action.

Describe alternatives you've considered

  1. One could leverage the job chaining feature with the result selector.
    • Create another job that only runs if the job that you would want to auto-retry fails - docs
    • Use the 'result' selector to only build those resources that have status = skipped/fail/error in the previous job - docs
    • Shortcoming is that this does not differentiate between retriable vs. non-retriable exceptions s which leads to a lot of false positives...
  2. Use an external scheduler like Airflow and build custom logic to retry jobs upon failure

Who will this benefit?

Let's take a common scenario:

  1. A Build all job runs at night
  2. It fails due to a transient error
  3. An immediate/slightly delayed retry would result in a successful run
  4. It still takes until the next working day for someone to see the failed job and manually trigger the retry
  5. Data is therefore periodically stale and stakeholders get upset

In the above scenario, everyone would still wake up to fresh data with this feature - yay (:

Are you interested in contributing this feature?

No

Anything else?

No response

@lucidviews lucidviews changed the title [Feature] <Automatic job retries from failure for transient errors> Automatic job retries from failure for transient errors Jan 28, 2025
@lucidviews lucidviews changed the title Automatic job retries from failure for transient errors [Feature] Automatic job retries from failure for transient errors Jan 28, 2025
@amychen1776 amychen1776 transferred this issue from dbt-labs/dbt-adapters Jan 28, 2025
@sarkerg34
Copy link

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants