Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally use pyarrow types in to_geodataframe #31

Merged
merged 3 commits into from
Mar 29, 2024

Conversation

TomAugspurger
Copy link
Collaborator

This updates to_geodataframe to optionally use pyarrow types, rather than NumPy. These types let us faithfully represent the actual nested types, rather than casting everything to object. I think this will be a good default in the future. For now, it's just optional.

There are some changes to the actual values associated with this change, related to how optional fields are stored.

If the source STAC documents had some values like

            {
                "a": {
                    "href": "a.tif",
                },
                "b": {
                    "href": "b.tif",
                    "title": "B",
                }
            }

the new output will have a struct type with two fields href and title. The value of a.title will be None, instead of just being absent.

This updates to_geodataframe to optionally use pyarrow types, rather
than NumPy. These types let us faithfully represent the actual nested
types, rather than casting everything to `object`.
@kylebarron
Copy link
Collaborator

Awesome! Excited to see this!

for k, v in items2.items():
if k in DATETIME_COLUMNS:
items2[k] = pd.arrays.ArrowExtensionArray(
pa.array(pd.to_datetime(v, format="ISO8601"))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want the output here to be identical to what we're getting in #27.

Right now, the date time columns from this PR end up with nanosecond precision, while Kyle's PR has microsecond precision. I'm not sure if there's a correct default, but we should try and get them the same.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than guessing, I've made this a parameter for to_geodataframe. The default is ns which will be compatible with what pandas was doing previously for NumPy dtypes.

We're actually still relying on pandas' to_datetime for parsing strings into timestamps, before casting to Arrow. Apparently pyarrow's pc.strptime doesn't support fractional seconds yet: apache/arrow#20146

@TomAugspurger TomAugspurger merged commit dfd384a into stac-utils:main Mar 29, 2024
1 check passed
@TomAugspurger TomAugspurger deleted the feature/arrow-types branch March 29, 2024 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants