-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dbt-athena: Rework iceberg full refresh for high availability #947
base: main
Are you sure you want to change the base?
dbt-athena: Rework iceberg full refresh for high availability #947
Conversation
Thank you for your pull request! We could not find a changelog entry for this change in the dbt-athena package. For details on how to document a change, see the Contributing Guide. |
of full refresh
Realized there is an issue with my |
table where it maintains the correct s3 path
With the help of @nicor88 I was able to come up with a solution to the s3 paths being inconsistent. The solution does require that s3_data_naming is |
-- Running in full refresh, support High Availability for Iceberg table type -- | ||
-- Must use s3_data_naming schema_table_unique in order to support high availability -- | ||
-- on a full fresh for an incremental iceberg table -- | ||
{% elif should_full_refresh() and table_type == 'iceberg' and s3_data_naming == 'schema_table_unique' %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might want to do a similar check to what was done here
('unique' in s3_data_naming and external_location is none)
pretty much we allow this feature by default for all unique location, and location that are not explicitly set via external_location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach that you proposed looks good.
It will be good to add functional test to this feature to verify systematically that all runs fine.
resolves #458
docs dbt-labs/docs.getdbt.com/#
Problem
When
dbt-athena
does a full refresh on an Iceberg table, it deletes the table and then re-creates the table. This results in the Iceberg table being unavailable for however long it takes to full refresh the table. For Iceberg tables that take longer to produce, this is very problematic as it causes significant amount of downtime.Solution
Instead of dropping the Iceberg at the beginning of the full refresh, instead first check if the Iceberg table already exists and if so, create the new full refresh table as a tmp relation and then upon completion, swap out the full refreshed Iceberg table for the existing Iceberg table. This approach adds in high availability support when doing a full refresh on an Iceberg table.
For this to work, it requires using s3_data_naming set to
schema_table_unique
as in order to prevent the s3 locations being inconsistent, it will create the new tmp_relation within the same s3 path using a UUID to keep them separate. Once the new fully refreshed tmp_relation is completed, can then simply use renaming to change thetmp_relation
to thetarget_relation
.Checklist