You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion
Describe the feature
Occasionally job runs fail due to transient errors, e.g. some connection time-out.
In such cases, all it sometimes takes to fix this issue is to retry the failed job manually without making any changes in dbt or in the data platform.
It'd be great if dbt would introduce an error taxonomy that classifies errors as 'potentially transient', i.e., worth retrying, and 'guaranteed to persist', i.e., this will 100% fail again if no changes are made.
If a job could then be configured to automatically rerun from failure if a retriable exception occurs (potentially with a configurable delay), it would alleviate the need for manual action.
Describe alternatives you've considered
One could leverage the job chaining feature with the result selector.
Create another job that only runs if the job that you would want to auto-retry fails - docs
Use the 'result' selector to only build those resources that have status = skipped/fail/error in the previous job - docs
Shortcoming is that this does not differentiate between retriable vs. non-retriable exceptions s which leads to a lot of false positives...
Use an external scheduler like Airflow and build custom logic to retry jobs upon failure
Who will this benefit?
Let's take a common scenario:
A Build all job runs at night
It fails due to a transient error
An immediate/slightly delayed retry would result in a successful run
It still takes until the next working day for someone to see the failed job and manually trigger the retry
Data is therefore periodically stale and stakeholders get upset
In the above scenario, everyone would still wake up to fresh data with this feature - yay (:
Are you interested in contributing this feature?
No
Anything else?
No response
The text was updated successfully, but these errors were encountered:
lucidviews
changed the title
[Feature] <Automatic job retries from failure for transient errors>
Automatic job retries from failure for transient errors
Jan 28, 2025
lucidviews
changed the title
Automatic job retries from failure for transient errors
[Feature] Automatic job retries from failure for transient errors
Jan 28, 2025
Is this your first time submitting a feature request?
Describe the feature
Occasionally job runs fail due to transient errors, e.g. some connection time-out.
In such cases, all it sometimes takes to fix this issue is to retry the failed job manually without making any changes in dbt or in the data platform.
It'd be great if dbt would introduce an error taxonomy that classifies errors as 'potentially transient', i.e., worth retrying, and 'guaranteed to persist', i.e., this will 100% fail again if no changes are made.
If a job could then be configured to automatically rerun from failure if a retriable exception occurs (potentially with a configurable delay), it would alleviate the need for manual action.
Describe alternatives you've considered
Who will this benefit?
Let's take a common scenario:
In the above scenario, everyone would still wake up to fresh data with this feature - yay (:
Are you interested in contributing this feature?
No
Anything else?
No response
The text was updated successfully, but these errors were encountered: