Skip to content

wip: GeoVortex #3213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 21 commits into
base: develop
Choose a base branch
from
Draft

wip: GeoVortex #3213

wants to merge 21 commits into from

Conversation

a10y
Copy link
Contributor

@a10y a10y commented May 5, 2025

This branch serves as an implementation sketch for a new series of extension types around geospatial vector data. This is largely an exercise to help flush out what's lacking in the currently extension type infra in a more "real" way than our temporal types. Still very raw

  • Provides Vortex encodings that canonicalize into GeoArrow and ability to zero-copy load geoarrow data into Vortex
  • Add pluggability of Arrow <> Vortex type conversion via a similar mechanism to the KernelRef traits
    • This also necessitates that we convert between DType and a Field rather than a DataType, as Field carries the Arrow extension type metadata

Another question this begs: should stats be pluggable in the same way layouts are? One very simple way we can do much better than GeoParquet, which only prunes at the whole-file level, is by implementing a spatial bloom filter (with S2 cell IDs instead of hash functions). Or maybe you don't need a new Stat for this and it's just a Layout?

a10y added 3 commits May 2, 2025 16:47
This branch serves as an implementation sketch for a new series of extension types around geospatial vector data.

It provides new extension type IDs, as well as bindings to GeoArrow so that data can be converted back and forth into the canonical format for sharing with tools that speak GeoArrow.

This is just a sketch and is meant to inform some ideas around what's missing with extension types
Copy link
Contributor

@gatesn gatesn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe you don't need a new Stat for this and it's just a Layout?

Yes that's what I also had in mind for string-y bloom filters too. I think it gets us pretty far? We could always add an "ExtensionStat" later that serializes to [u8] if we do need to plumb through the in-memory format too.

geo/src/array.rs Outdated
use crate::{GeoMetadata, GeometryType};

/// Holder for what is known to be one of the blessed extension array types.
pub enum GeometryArray<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sort of equivalent to Canonical? I wonder if there's value in allowing an ExtensionDType to canonicalize into a custom impl of some dyn ExtensionArray?

It's a bit more complex than Arrow extension types, because it means dtypes must come with a vtable, instead of just some metadata bytes. Not sure yet.

}

/// Type-erased pointer to a [`DTypeConversion`] implementation.
pub struct DTypeConversionRef(ArcRef<dyn DTypeConversion>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the big question right. About whether extension DTypes themselves can extend behaviour, or whether they are just defined as metadata bytes and can extend behaviour through encodings.

If we do allow them to extend behaviour, I think I would want a dyn ExtDType, to avoid the marshalling of metadata bytes all over the place. Same reason we have dyn Array instead of the old ArrayData struct.

Copy link

cloudflare-workers-and-pages bot commented May 7, 2025

Deploying vortex-bench with  Cloudflare Pages  Cloudflare Pages

Latest commit: 6132502
Status: ✅  Deploy successful!
Preview URL: https://ab47689c.vortex-bench.pages.dev
Branch Preview URL: https://aduffy-geovortex.vortex-bench.pages.dev

View logs

@a10y a10y force-pushed the aduffy/geovortex branch from 12fa7d8 to a2139a5 Compare May 7, 2025 21:25
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces an initial implementation sketch for geospatial vector extension types ("GeoVortex") that leverage GeoArrow for canonicalization and zero-copy loading, as well as a pluggable Arrow–Vortex type conversion mechanism.

  • Introduces new Vortex encodings for geospatial data and updates the conversion APIs from to_arrow_dtype to to_arrow.
  • Adds a new geo extension crate ("vortex-geo") with support for Point, LineString, Polygon, and WKB types.
  • Includes minor API updates across several modules and new helper macros for error handling.

Reviewed Changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
vortex-scalar/src/list.rs Updates parameter signature for list creation to accept types convertible into Arc.
vortex-jni/src/array.rs Updates Arrow type conversion call from to_arrow_dtype() to to_arrow().
vortex-error/src/lib.rs Adds a new convenience macro (vortex_assert!) for simplified assertion-based error checking.
vortex-dtype/src/* Introduces new macros and updates Arrow conversion methods, re-exporting inventory when the "arrow" feature is enabled.
vortex-array/src/* Updates to_arrow conversion calls and canonical conversion tests to use the new API.
pyvortex/src/* Updates Python bindings to utilize the new to_arrow() conversion functions.
extensions/geo/* Introduces geospatial extension types and associated array wrappers, with new conversion logic.
Cargo.toml (root and extensions/geo) Adds the vortex-geo crate to workspace members and the appropriate dependencies.

//! materialize an Arrow ArrayRef at the very end of the processing chain.
//! All of the Arrow primitive types map onto the equivalent Vortex primitives:
//!
//! ```rust
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatesn curious for your thoughts on this design, particularly the ArrowTypeConversion plugin trait.

I realized that I was wrong about needing to change DType::to_arrow's signature, because the only place Arrow is able to reason about extension types anyway is in the context of a full struct Schema. So instead I ended up just making this plugin trait that is used to convert ExtDType children of Vortex Struct types.

I'm also wondering if this should be broadened so that ToArrowTemporal could actually just be replaced by this as well, in which case we'd return not a metadata hashmap but a full arrow Field type.

@a10y a10y requested a review from Copilot May 8, 2025 15:36
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces an implementation sketch for new geospatial extension types (GeoVortex) and updates several conversion APIs to use the unified to_arrow method. Key changes include:

  • Refactoring DType conversion methods from to_arrow_dtype to to_arrow throughout the codebase.
  • Adding new macros, functions, and modules to support geospatial types with extension type conversion via the inventory mechanism.
  • Updating Cargo.toml dependencies and documentation to support geospatial extension types.

Reviewed Changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
vortex-scalar/src/list.rs Changed parameter to impl Into<Arc> for improved flexibility.
vortex-jni/src/array.rs Updated call to new conversion API (to_arrow) for Arrow interoperability.
vortex-error/src/lib.rs Introduced new vortex_assert macro with accompanying documentation.
vortex-dtype/src/lib.rs Re-exported inventory under the "arrow" feature for extension support.
vortex-dtype/src/arrow.rs Refactored DType conversion and Arrow schema construction; removed non-nullable check for top-level structs.
vortex-dtype/Cargo.toml Added dependency on inventory to support Arrow extension types.
vortex-array/* Updated conversion calls and metadata propagation; cleaned up commented legacy code.
pyvortex/* Updated conversion API usage to reflect the changes from to_arrow_dtype to to_arrow.
extensions/geo/* Introduced geospatial extension types, metadata routines, and extension arrays.
Cargo.toml Added extensions/* membership and updated dependencies, including a minor version bump for chrono.

.all(|field| field.eq_ignore_nullability(PType::F64.into()));
}
Dimension::XYZM => {
vortex_assert!(schema.nfields() == 3);
Copy link
Preview

Copilot AI May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Dimension::XYZM case, the schema should have 4 fields (x, y, z, m) but the assertion expects 3. Please update the check to 'schema.nfields() == 4' to correctly validate the schema.

Suggested change
vortex_assert!(schema.nfields() == 3);
vortex_assert!(schema.nfields() == 4);

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to make a macro or something b/c I keep making this mistake...

Comment on lines 290 to 295
let DataType::Struct(fields) = self.to_arrow()? else {
vortex_bail!(
"Cannot convert non-struct dtype to Arrow schema: {:?}",
self
)
};
Copy link
Preview

Copilot AI May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of the non-nullability check for the top-level struct in the Arrow schema conversion is a significant API change. Please confirm this is intentional and consider updating the documentation to explain the new behavior.

Suggested change
let DataType::Struct(fields) = self.to_arrow()? else {
vortex_bail!(
"Cannot convert non-struct dtype to Arrow schema: {:?}",
self
)
};
let data_type = self.to_arrow()?;
let DataType::Struct(fields) = data_type else {
vortex_bail!(
"Cannot convert non-struct dtype to Arrow schema: {:?}",
self
)
};
if !self.is_nullable() {
vortex_bail!(
"Top-level struct in Arrow schema must be nullable: {:?}",
self
)
}

Copilot uses AI. Check for mistakes.

@@ -118,6 +118,7 @@ impl ComputeFnVTable for ToArrow {
let ToArrowArgs { array, arrow_type } = ToArrowArgs::try_from(args)?;
Ok(arrow_type
.map(|arrow_type| DType::from_arrow((arrow_type, array.dtype().nullability())))
// .ok_or_else(|| vortex_err!("inconvertible DType"))
Copy link
Preview

Copilot AI May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a commented-out error fallback in the conversion function. If it is no longer needed, please remove the commented code to improve clarity.

Suggested change
// .ok_or_else(|| vortex_err!("inconvertible DType"))

Copilot uses AI. Check for mistakes.

@kylebarron
Copy link

kylebarron commented May 8, 2025

One very simple way we can do much better than GeoParquet, which only prunes at the whole-file level, is by implementing a spatial bloom filter (with S2 cell IDs instead of hash functions).

I don't really know anything about vortex, and I should also understand bloom filters better than I do, but in case you have any use case for serializing an R-Tree as part of geospatial data to get row-level spatial intersections, you might want to look at https://github.com/kylebarron/geo-index, which defines a zero-copy 2D R-Tree (it's a port of https://github.com/mourner/flatbush, and you can read more about it here if you're curious). I talked to lance about it a while ago (if I understand correctly vortex seems similar), and they were receptive to the idea but I didn't have time to implement it for them.

Comment on lines +31 to +33
geoarrow-geoparquet = { git = "https://github.com/geoarrow/geoarrow-rs.git", rev = "c241034f", features = [
"compression",
] }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way I've been considering a refactor to the geoparquet APIs so that users call upstream parquet APIs directly. geoarrow/geoarrow-rs#1089

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also FYI the crate name changed to just geoparquet. (The geoparquet crate name had been squatted on but crates.io removed it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants