Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45522: [Parquet][C++] Parquet GEOMETRY and GEOGRAPHY logical type implementations #45459

Open
wants to merge 150 commits into
base: main
Choose a base branch
from

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Feb 7, 2025

Rationale for this change

The GEOMETRY and GEOGRAPHY logical types are being proposed as an addition to the Parquet format.

What changes are included in this PR?

This is a continuation of @Kontinuation 's initial PR (#43977) implementing apache/parquet-format#240 , which included:

  • Added geometry logical types (printing, serialization, deserialization)
  • Added geometry column statistics (serialization, deserialization, writing)
  • Support reading/writing parquet files containing geometry columns

Changes after this were:

  • Rebasing on the latest apache/arrow
  • Split geography/geometry types
  • Synchronize the final parameter names (e.g., no more "encoding", "edges" -> "algorithm")
  • Simplify geometry_util_internal.h and use Status instead of exceptions according to suggestions from the previous PR

In order to write test files, I also:

  • Implemented conversion to/from the GeoArrow extension type
  • Wired the requisite options to pyarrow so that the files could be written from Python

Those last two are probably a bit much for this particular PR, and I'm happy to move them.

Some things that aren't in this PR (but should be in this one or a future PR):

  • Update the bounding box logic to implement the "wraparound" bounding boxes where max > min (and generally make sure the stats for geography are written for trivial cases)
  • Test more invalid WKB cases

Are these changes tested?

Yes!

Are there any user-facing changes?

Yes!

Example from the included Python bindings:

import pyarrow as pa
from pyarrow import parquet
import geoarrow.pyarrow as ga  # For registering the extension type
import geopandas

path = "/Users/dewey/gh/parquet-testing/data/geospatial/example-crs_vermont-4326.parquet"
file = parquet.ParquetFile(path, arrow_extensions_enabled=True)
file.schema
#> <pyarrow._parquet.ParquetSchema object at 0x1136ee600>
#> required group field_id=-1 schema {
#>   optional binary field_id=-1 geometry (Geometry(crs=));
#> }
file.metadata.metadata
#> (eventually should contain any CRSes that were dumped there)
geometry_index = len(file.schema.names) - 1
file.metadata.row_group(0).column(geometry_index).geospatial_statistics
#> <pyarrow._parquet.GeospatialStatistics object at 0x117b07f40>
#>   geospatial_types: [3]
#>   xmin: -73.4296726142165
#>   xmax: -71.50351111518535
#>   ymin: 42.72708222103286
#>   ymax: 45.00831248634144
#>   zmin: None
#>   zmax: None
#>   mmin: None
#>   mmax: None

# Type and CRS should propagate through
file.schema_arrow.field("geometry").type
#> WkbType(geoarrow.wkb <OGC:CRS84>)

# GeoPandas should be able to take the result of this and ensure
# the CRS is not lost (and that the geometry column is picked up)
table = file.read()
df = geopandas.GeoDataFrame.from_arrow(table)
df.geometry.crs.name
#> 'WGS 84 (CRS84)'
df.geometry.head(5)
#> 0    POLYGON ((-72.45707 42.72708, -73.28203 42.743...
#> Name: geometry, dtype: geometry
parquet.write_table(table, "foofy.parquet", write_geospatial_logical_types=True)
parquet.read_table("foofy.parquet", arrow_extensions_enabled=True).schema
#> geometry: extension<geoarrow.wkb<WkbType>>

@wgtmac
Copy link
Member

wgtmac commented Feb 25, 2025

Create a separate branch for writing example files with arbitrary CRSes

Is it possible to add new examples instead of creating a new branch? It is much simpler to link rapidjson to these example executables. Test files are also reproducible in this approach.

Keep support for reading into GeoArrow.

I'm fine with this given that the integration is trivial.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 25, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 25, 2025
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 25, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 25, 2025
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants