Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] allow gaps in the lookback range for microbatch #11242

Open
3 tasks done
data-blade opened this issue Jan 27, 2025 · 0 comments
Open
3 tasks done

[Feature] allow gaps in the lookback range for microbatch #11242

data-blade opened this issue Jan 27, 2025 · 0 comments
Labels
enhancement New feature or request triage

Comments

@data-blade
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

currently lookback accepts an integer, representing the [today-n : today] range of the incremental run

in most companies the distribution of delayed data is very skewed towards the newer end of the lookback range [citation needed].

i.e. 90+% of delayed data arrives after 1 day, and then comes the long tail.

to improve efficiency, implement a lookback that accepts [0, 1, n], where n is the greatest possible delay. when running regularly, this would not immediately update the data in [1 < x < n], saving significant compute by skipping. instead, the data would be fully updated after n days, in a rolling fashion.

Describe alternatives you've considered

we implemented our own version of this a while ago, with a date range macro that accept both an integer (range without gaps) or an array of integers (range with gaps, or just specific days).

simplified:

{# reprocessing specific days -#}
{% if lookback is sequence -%}
	({% for day in lookback %}
		{{ event_date }} between
			current_date - {{ day }}
			and current_date - {{ day }}
		{{ 'or' if not loop.last -}}
	{% endfor -%}
	)
{# reprocessing last x days -#}
{% else -%}
	{{ event_date }} between
		current_date - {{ lookback }}
		and current_date
{% endif -%}

Who will this benefit?

any clients with...

  • large datasets, i.e. computation is a significant cost factor
  • who "want all data"
  • delayed data has a typical recency skew

Are you interested in contributing this feature?

yes, if it's as easy as our macro ;)

Anything else?

for perspective, this is currently a blocker for us for implementing microbatches. the advantage of calculating daily batches is completely offset by not being able to skip "plot-irrelevant" days.

@data-blade data-blade added enhancement New feature or request triage labels Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage
Projects
None yet
Development

No branches or pull requests

1 participant