Optionally use pyarrow types in to_geodataframe #31

TomAugspurger · 2024-03-17T19:25:44Z

This updates to_geodataframe to optionally use pyarrow types, rather than NumPy. These types let us faithfully represent the actual nested types, rather than casting everything to object. I think this will be a good default in the future. For now, it's just optional.

There are some changes to the actual values associated with this change, related to how optional fields are stored.

If the source STAC documents had some values like

            {
                "a": {
                    "href": "a.tif",
                },
                "b": {
                    "href": "b.tif",
                    "title": "B",
                }
            }

the new output will have a struct type with two fields href and title. The value of a.title will be None, instead of just being absent.

This updates to_geodataframe to optionally use pyarrow types, rather than NumPy. These types let us faithfully represent the actual nested types, rather than casting everything to `object`.

kylebarron · 2024-03-18T17:44:22Z

Awesome! Excited to see this!

TomAugspurger · 2024-03-24T15:11:22Z

stac_geoparquet/stac_geoparquet.py

+        for k, v in items2.items():
+            if k in DATETIME_COLUMNS:
+                items2[k] = pd.arrays.ArrowExtensionArray(
+                    pa.array(pd.to_datetime(v, format="ISO8601"))


I want the output here to be identical to what we're getting in #27.

Right now, the date time columns from this PR end up with nanosecond precision, while Kyle's PR has microsecond precision. I'm not sure if there's a correct default, but we should try and get them the same.

Rather than guessing, I've made this a parameter for to_geodataframe. The default is ns which will be compatible with what pandas was doing previously for NumPy dtypes.

We're actually still relying on pandas' to_datetime for parsing strings into timestamps, before casting to Arrow. Apparently pyarrow's pc.strptime doesn't support fractional seconds yet: apache/arrow#20146

Optionally use pyarrow types in to_geodataframe

fb798f4

This updates to_geodataframe to optionally use pyarrow types, rather than NumPy. These types let us faithfully represent the actual nested types, rather than casting everything to `object`.

TomAugspurger force-pushed the feature/arrow-types branch from 73fdfac to fb798f4 Compare March 17, 2024 19:57

TomAugspurger commented Mar 24, 2024

View reviewed changes

TomAugspurger and others added 2 commits March 24, 2024 15:19

ts resolution

5c646cb

parameter for datetime precision

9c60219

TomAugspurger merged commit dfd384a into stac-utils:main Mar 29, 2024
1 check passed

TomAugspurger deleted the feature/arrow-types branch March 29, 2024 17:50

TomAugspurger mentioned this pull request Apr 2, 2024

Convert all ndararys to lists in to_item_collection #3

Closed

This was referenced Oct 24, 2024

ValueError: Invalid character while parsing year ('N', Index: 0) #79

Closed

Let pyarrow cast strings to dates #80

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optionally use pyarrow types in to_geodataframe #31

Optionally use pyarrow types in to_geodataframe #31

TomAugspurger commented Mar 17, 2024

kylebarron commented Mar 18, 2024

TomAugspurger Mar 24, 2024

TomAugspurger Mar 29, 2024

Optionally use pyarrow types in to_geodataframe #31

Optionally use pyarrow types in to_geodataframe #31

Conversation

TomAugspurger commented Mar 17, 2024

kylebarron commented Mar 18, 2024

TomAugspurger Mar 24, 2024

Choose a reason for hiding this comment

TomAugspurger Mar 29, 2024

Choose a reason for hiding this comment