Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbt-athena: Rework iceberg full refresh for high availability #947

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

alex-antonison
Copy link

@alex-antonison alex-antonison commented Mar 26, 2025

resolves #458
docs dbt-labs/docs.getdbt.com/#

Problem

When dbt-athena does a full refresh on an Iceberg table, it deletes the table and then re-creates the table. This results in the Iceberg table being unavailable for however long it takes to full refresh the table. For Iceberg tables that take longer to produce, this is very problematic as it causes significant amount of downtime.

Solution

Instead of dropping the Iceberg at the beginning of the full refresh, instead first check if the Iceberg table already exists and if so, create the new full refresh table as a tmp relation and then upon completion, swap out the full refreshed Iceberg table for the existing Iceberg table. This approach adds in high availability support when doing a full refresh on an Iceberg table.

For this to work, it requires using s3_data_naming set to schema_table_unique as in order to prevent the s3 locations being inconsistent, it will create the new tmp_relation within the same s3 path using a UUID to keep them separate. Once the new fully refreshed tmp_relation is completed, can then simply use renaming to change the tmp_relation to the target_relation.

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

@alex-antonison alex-antonison requested a review from a team as a code owner March 26, 2025 08:13
@cla-bot cla-bot bot added the cla:yes The PR author has signed the CLA label Mar 26, 2025
@alex-antonison alex-antonison changed the title Rework iceberg full refresh for high availability dbt-athena: Rework iceberg full refresh for high availability Mar 26, 2025
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change in the dbt-athena package. For details on how to document a change, see the Contributing Guide.

@alex-antonison alex-antonison marked this pull request as draft March 26, 2025 08:45
@alex-antonison alex-antonison marked this pull request as ready for review March 26, 2025 11:19
@parsable-alex-antonison

Realized there is an issue with my renaming approach where the underlying s3 location is not being updated. Need to revisit my approach.

table where it maintains the correct s3 path
@alex-antonison
Copy link
Author

With the help of @nicor88 I was able to come up with a solution to the s3 paths being inconsistent.

The solution does require that s3_data_naming is schema_table_unique because it relies on the table Iceberg s3 data being stored within a unique UUID location within the same s3 location as the existing table. With this approach, I can fully materialize the existing incremental iceberg table, then do a table name switch, and then drop the existing table relation.

-- Running in full refresh, support High Availability for Iceberg table type --
-- Must use s3_data_naming schema_table_unique in order to support high availability --
-- on a full fresh for an incremental iceberg table --
{% elif should_full_refresh() and table_type == 'iceberg' and s3_data_naming == 'schema_table_unique' %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to do a similar check to what was done here

('unique' in s3_data_naming and external_location is none)

pretty much we allow this feature by default for all unique location, and location that are not explicitly set via external_location.

Copy link
Contributor

@nicor88 nicor88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach that you proposed looks good.

It will be good to add functional test to this feature to verify systematically that all runs fine.

@colin-rogers-dbt colin-rogers-dbt self-assigned this Apr 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes The PR author has signed the CLA triage:ready-for-review In Eng's queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feature: Add high availability for iceberg in incremental models and full refresh
5 participants