Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Dates aren't evaluated equally in Microbatching for deletion and view creation #11233

Open
2 tasks done
D3nn3 opened this issue Jan 23, 2025 · 0 comments
Open
2 tasks done
Labels
bug Something isn't working triage

Comments

@D3nn3
Copy link

D3nn3 commented Jan 23, 2025

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Just to preface

  1. This is only a problem when not working with/in the UTC timezone (which is discouraged according to the docs).
  2. This is a problem in Snowflake, can't say anything for different warehouse implementations.

Context

Let's think about micro-batching as a 3 step process:

  1. Create temporary view containing the entries of a batch, e.g. of one day
  2. Delete all entries of that batch (in our example one day) from the target table
  3. Insert the newly created rows from step 1 into the target table

The bug is regarding how comparing dates/timestamps work in step 1 and 2.

Comparing dates/timestamps

For creating the temporary view for e.g. the 2025-01-19, dbt filters the upstream tables based on the event_time column, which looks like the following:

...

some_cte as (
  select
    *
  from
    (
      select
        *
      from
        some_table
      where
        event_time_column >= '2025-01-19 00:00:00+00:00'
        and event_time_column < '2025-01-20 00:00:00+00:00'
    )
)

...

When deleting the rows to re-insert into the table, the generated SQL query looks as follows:

delete from target_table
  where (
    event_time_column >= to_timestamp_tz('2025-01-19 00:00:00+00:00')
    and event_time_column < to_timestamp_tz('2025-01-20 00:00:00+00:00')
  )

As you can see, the comparison is inconsistent: '<timestamp>' vs. to_timestamp_tz('<timestamp>')

If one works with dates and outside the UTC timezone, the comparisons lead to different results, ultimately leading to deleting different rows in the target table than the ones that have been re-calculated. In our case this led to duplicate rows in the target table. See below for a short SQL query demonstrating the different results in the date/timestamp comparisons.

Expected Behavior

Even when working outside the UTC timezone, I'd expect that the delete/insert queries target the same range of rows.

A fix should be fairly simple: Make the comparison in both queries consistent (e.g. use to_timestamp_tz('<timestamp>') in both).

Steps To Reproduce

Here is a short code snippet to try out the different behavior leading to the error described above:

alter session set timezone = "Europe/Berlin";

select
  '2025-01-19'::date as event_time_column,
  event_time_column >= '2025-01-19 00:00:00+00:00' and event_time_column < '2025-01-20 00:00:00+00:00' as _1,
  event_time_column >= to_timestamp_tz('2025-01-19 00:00:00+00:00') and event_time_column < to_timestamp_tz('2025-01-20 00:00:00+00:00') as _2

Relevant log output

Environment

We use the following docker image: ghcr.io/dbt-labs/dbt-snowflake:1.9.0

Which database adapter are you using with dbt?

No response

Additional Context

No response

@D3nn3 D3nn3 added bug Something isn't working triage labels Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

1 participant