Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to write partitioned delta dataset #75

Open
TomAugspurger opened this issue Jul 23, 2024 · 0 comments
Open

Failed to write partitioned delta dataset #75

TomAugspurger opened this issue Jul 23, 2024 · 0 comments

Comments

@TomAugspurger
Copy link
Collaborator

(Not strictly a stac-geoparquet issue, but just dumping this here for now)

I extracted a week's worth of sentinel-2 data from the PC's STAC API and wrote it out with deltalake.write_deltalake. This worked great.

Next, I wanted to try writing out something that was spatially partitioned, and that failed:

In [1]: import deltalake, httpx

In [2]: token = httpx.get("https://planetarycomputer.microsoft.com/api/sas/v1/token/pcstacitems/items").json()["token"]

In [3]: table = deltalake.DeltaTable("az://items/sentinel-2-delta/data.delta", storage_options={"account_name": "pcstacitems", "sas_token": token})  # this is the table with the unpartitioned assets.

In [4]: ds = table.to_pyarrow_dataset()

In [5]: ds
Out[5]: <pyarrow._dataset.FileSystemDataset at 0x7ff50c54e080>

In [6]: pa_table = table.to_pyarrow_table()

In [7]: deltalake.write_deltalake("/tmp/split.delta/", pa_table, engine="rust", partition_by=["s2:mgrs_tile"])  # eventually killed by my OS

I haven't looked into what's going on. There's a couple upstream issues in delta-rs about memory spikes, but nothing definitive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant