[Feature] Automatic job retries from failure for transient errors #11251

lucidviews · 2025-01-28T16:41:17Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Occasionally job runs fail due to transient errors, e.g. some connection time-out.
In such cases, all it sometimes takes to fix this issue is to retry the failed job manually without making any changes in dbt or in the data platform.

It'd be great if dbt would introduce an error taxonomy that classifies errors as 'potentially transient', i.e., worth retrying, and 'guaranteed to persist', i.e., this will 100% fail again if no changes are made.
If a job could then be configured to automatically rerun from failure if a retriable exception occurs (potentially with a configurable delay), it would alleviate the need for manual action.

Describe alternatives you've considered

One could leverage the job chaining feature with the result selector.
- Create another job that only runs if the job that you would want to auto-retry fails - docs
- Use the 'result' selector to only build those resources that have status = skipped/fail/error in the previous job - docs
- Shortcoming is that this does not differentiate between retriable vs. non-retriable exceptions s which leads to a lot of false positives...
Use an external scheduler like Airflow and build custom logic to retry jobs upon failure

Who will this benefit?

Let's take a common scenario:

A Build all job runs at night
It fails due to a transient error
An immediate/slightly delayed retry would result in a successful run
It still takes until the next working day for someone to see the failed job and manually trigger the retry
Data is therefore periodically stale and stakeholders get upset

In the above scenario, everyone would still wake up to fresh data with this feature - yay (:

Are you interested in contributing this feature?

No

Anything else?

No response

sarkerg34 · 2025-01-29T20:23:59Z

+1

lucidviews changed the title ~~[Feature] <Automatic job retries from failure for transient errors>~~ Automatic job retries from failure for transient errors Jan 28, 2025

lucidviews changed the title ~~Automatic job retries from failure for transient errors~~ [Feature] Automatic job retries from failure for transient errors Jan 28, 2025

amychen1776 transferred this issue from dbt-labs/dbt-adapters Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Automatic job retries from failure for transient errors #11251

[Feature] Automatic job retries from failure for transient errors #11251

lucidviews commented Jan 28, 2025

sarkerg34 commented Jan 29, 2025

[Feature] Automatic job retries from failure for transient errors #11251

[Feature] Automatic job retries from failure for transient errors #11251

Comments

lucidviews commented Jan 28, 2025

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

sarkerg34 commented Jan 29, 2025