Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added dedup sort example #2235

Open
wants to merge 2 commits into
base: devel
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions docs/website/docs/general-usage/incremental-loading.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,39 @@ If you use the `merge` write disposition, but do not specify merge or primary ke
The appended data will be inserted from a staging table in one transaction for most destinations in this case.
:::

Example: Deduplication with Timestamp based sorting

```py
# Sample data
data = [
{"id": 1, "metadata_modified": "2024-01-01", "value": "A"},
{"id": 1, "metadata_modified": "2024-01-02", "value": "B"},
{"id": 2, "metadata_modified": "2024-01-01", "value": "C"},
{"id": 2, "metadata_modified": "2024-01-01", "value": "D"}, # Same metadata_modified as above
]

# Define the resource with dedup_sort configuration
@dlt.resource(
primary_key='id',
write_disposition='merge',
columns={
"metadata_modified": {"dedup_sort": "desc"}
}
)
def sample_data():
for item in data:
yield item
```
When this resource is executed, the following deduplication rules are applied:

1. For records with different values in the `dedup_sort` column:
- The record with the highest value is kept when using `desc`
- For example, between records with id=1, the one with `"metadata_modified"="2024-01-02"` is kept

2. For records with identical values in the dedup_sort column:
- The first occurrence encountered is kept
- For example, between records with id=2 and identical `"metadata_modified"="2024-01-01"`, the first record (value="C") is kept

#### Delete records
The `hard_delete` column hint can be used to delete records from the destination dataset. The behavior of the delete mechanism depends on the data type of the column marked with the hint:
1) `bool` type: only `True` leads to a delete—`None` and `False` values are disregarded.
Expand Down
Loading