You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion
Describe the feature
currently lookback accepts an integer, representing the [today-n : today] range of the incremental run
in most companies the distribution of delayed data is very skewed towards the newer end of the lookback range [citation needed].
i.e. 90+% of delayed data arrives after 1 day, and then comes the long tail.
to improve efficiency, implement a lookback that accepts [0, 1, n], where n is the greatest possible delay. when running regularly, this would not immediately update the data in [1 < x < n], saving significant compute by skipping. instead, the data would be fully updated after n days, in a rolling fashion.
Describe alternatives you've considered
we implemented our own version of this a while ago, with a date range macro that accept both an integer (range without gaps) or an array of integers (range with gaps, or just specific days).
simplified:
{# reprocessing specific days -#}
{% if lookback is sequence -%}
({% for day in lookback %}
{{ event_date }} between
current_date - {{ day }}
and current_date - {{ day }}
{{ 'or' if not loop.last -}}
{% endfor -%}
)
{# reprocessing last x days -#}
{% else -%}
{{ event_date }} between
current_date - {{ lookback }}
and current_date
{% endif -%}
Who will this benefit?
any clients with...
large datasets, i.e. computation is a significant cost factor
who "want all data"
delayed data has a typical recency skew
Are you interested in contributing this feature?
yes, if it's as easy as our macro ;)
Anything else?
for perspective, this is currently a blocker for us for implementing microbatches. the advantage of calculating daily batches is completely offset by not being able to skip "plot-irrelevant" days.
The text was updated successfully, but these errors were encountered:
Is this your first time submitting a feature request?
Describe the feature
currently
lookback
accepts an integer, representing the [today-n : today] range of the incremental runin most companies the distribution of delayed data is very skewed towards the newer end of the lookback range [citation needed].
i.e. 90+% of delayed data arrives after 1 day, and then comes the long tail.
to improve efficiency, implement a
lookback
that accepts[0, 1, n]
, where n is the greatest possible delay. when running regularly, this would not immediately update the data in[1 < x < n]
, saving significant compute by skipping. instead, the data would be fully updated after n days, in a rolling fashion.Describe alternatives you've considered
we implemented our own version of this a while ago, with a date range macro that accept both an integer (range without gaps) or an array of integers (range with gaps, or just specific days).
simplified:
Who will this benefit?
any clients with...
Are you interested in contributing this feature?
yes, if it's as easy as our macro ;)
Anything else?
for perspective, this is currently a blocker for us for implementing microbatches. the advantage of calculating daily batches is completely offset by not being able to skip "plot-irrelevant" days.
The text was updated successfully, but these errors were encountered: