-
Notifications
You must be signed in to change notification settings - Fork 42
wip: GeoVortex #3213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
wip: GeoVortex #3213
Conversation
This branch serves as an implementation sketch for a new series of extension types around geospatial vector data. It provides new extension type IDs, as well as bindings to GeoArrow so that data can be converted back and forth into the canonical format for sharing with tools that speak GeoArrow. This is just a sketch and is meant to inform some ideas around what's missing with extension types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe you don't need a new Stat for this and it's just a Layout?
Yes that's what I also had in mind for string-y bloom filters too. I think it gets us pretty far? We could always add an "ExtensionStat" later that serializes to [u8]
if we do need to plumb through the in-memory format too.
geo/src/array.rs
Outdated
use crate::{GeoMetadata, GeometryType}; | ||
|
||
/// Holder for what is known to be one of the blessed extension array types. | ||
pub enum GeometryArray<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this sort of equivalent to Canonical? I wonder if there's value in allowing an ExtensionDType to canonicalize into a custom impl of some dyn ExtensionArray
?
It's a bit more complex than Arrow extension types, because it means dtypes must come with a vtable, instead of just some metadata bytes. Not sure yet.
vortex-dtype/src/arrow.rs
Outdated
} | ||
|
||
/// Type-erased pointer to a [`DTypeConversion`] implementation. | ||
pub struct DTypeConversionRef(ArcRef<dyn DTypeConversion>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the big question right. About whether extension DTypes themselves can extend behaviour, or whether they are just defined as metadata bytes and can extend behaviour through encodings.
If we do allow them to extend behaviour, I think I would want a dyn ExtDType
, to avoid the marshalling of metadata bytes all over the place. Same reason we have dyn Array
instead of the old ArrayData
struct.
Deploying vortex-bench with
|
Latest commit: |
6132502
|
Status: | ✅ Deploy successful! |
Preview URL: | https://ab47689c.vortex-bench.pages.dev |
Branch Preview URL: | https://aduffy-geovortex.vortex-bench.pages.dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces an initial implementation sketch for geospatial vector extension types ("GeoVortex") that leverage GeoArrow for canonicalization and zero-copy loading, as well as a pluggable Arrow–Vortex type conversion mechanism.
- Introduces new Vortex encodings for geospatial data and updates the conversion APIs from to_arrow_dtype to to_arrow.
- Adds a new geo extension crate ("vortex-geo") with support for Point, LineString, Polygon, and WKB types.
- Includes minor API updates across several modules and new helper macros for error handling.
Reviewed Changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
vortex-scalar/src/list.rs | Updates parameter signature for list creation to accept types convertible into Arc. |
vortex-jni/src/array.rs | Updates Arrow type conversion call from to_arrow_dtype() to to_arrow(). |
vortex-error/src/lib.rs | Adds a new convenience macro (vortex_assert!) for simplified assertion-based error checking. |
vortex-dtype/src/* | Introduces new macros and updates Arrow conversion methods, re-exporting inventory when the "arrow" feature is enabled. |
vortex-array/src/* | Updates to_arrow conversion calls and canonical conversion tests to use the new API. |
pyvortex/src/* | Updates Python bindings to utilize the new to_arrow() conversion functions. |
extensions/geo/* | Introduces geospatial extension types and associated array wrappers, with new conversion logic. |
Cargo.toml (root and extensions/geo) | Adds the vortex-geo crate to workspace members and the appropriate dependencies. |
//! materialize an Arrow ArrayRef at the very end of the processing chain. | ||
//! All of the Arrow primitive types map onto the equivalent Vortex primitives: | ||
//! | ||
//! ```rust |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatesn curious for your thoughts on this design, particularly the ArrowTypeConversion
plugin trait.
I realized that I was wrong about needing to change DType::to_arrow
's signature, because the only place Arrow is able to reason about extension types anyway is in the context of a full struct Schema
. So instead I ended up just making this plugin trait that is used to convert ExtDType children of Vortex Struct types.
I'm also wondering if this should be broadened so that ToArrowTemporal
could actually just be replaced by this as well, in which case we'd return not a metadata hashmap but a full arrow Field
type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces an implementation sketch for new geospatial extension types (GeoVortex) and updates several conversion APIs to use the unified to_arrow method. Key changes include:
- Refactoring DType conversion methods from to_arrow_dtype to to_arrow throughout the codebase.
- Adding new macros, functions, and modules to support geospatial types with extension type conversion via the inventory mechanism.
- Updating Cargo.toml dependencies and documentation to support geospatial extension types.
Reviewed Changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
vortex-scalar/src/list.rs | Changed parameter to impl Into<Arc> for improved flexibility. |
vortex-jni/src/array.rs | Updated call to new conversion API (to_arrow) for Arrow interoperability. |
vortex-error/src/lib.rs | Introduced new vortex_assert macro with accompanying documentation. |
vortex-dtype/src/lib.rs | Re-exported inventory under the "arrow" feature for extension support. |
vortex-dtype/src/arrow.rs | Refactored DType conversion and Arrow schema construction; removed non-nullable check for top-level structs. |
vortex-dtype/Cargo.toml | Added dependency on inventory to support Arrow extension types. |
vortex-array/* | Updated conversion calls and metadata propagation; cleaned up commented legacy code. |
pyvortex/* | Updated conversion API usage to reflect the changes from to_arrow_dtype to to_arrow. |
extensions/geo/* | Introduced geospatial extension types, metadata routines, and extension arrays. |
Cargo.toml | Added extensions/* membership and updated dependencies, including a minor version bump for chrono. |
.all(|field| field.eq_ignore_nullability(PType::F64.into())); | ||
} | ||
Dimension::XYZM => { | ||
vortex_assert!(schema.nfields() == 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the Dimension::XYZM case, the schema should have 4 fields (x, y, z, m) but the assertion expects 3. Please update the check to 'schema.nfields() == 4' to correctly validate the schema.
vortex_assert!(schema.nfields() == 3); | |
vortex_assert!(schema.nfields() == 4); |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need to make a macro or something b/c I keep making this mistake...
vortex-dtype/src/arrow.rs
Outdated
let DataType::Struct(fields) = self.to_arrow()? else { | ||
vortex_bail!( | ||
"Cannot convert non-struct dtype to Arrow schema: {:?}", | ||
self | ||
) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of the non-nullability check for the top-level struct in the Arrow schema conversion is a significant API change. Please confirm this is intentional and consider updating the documentation to explain the new behavior.
let DataType::Struct(fields) = self.to_arrow()? else { | |
vortex_bail!( | |
"Cannot convert non-struct dtype to Arrow schema: {:?}", | |
self | |
) | |
}; | |
let data_type = self.to_arrow()?; | |
let DataType::Struct(fields) = data_type else { | |
vortex_bail!( | |
"Cannot convert non-struct dtype to Arrow schema: {:?}", | |
self | |
) | |
}; | |
if !self.is_nullable() { | |
vortex_bail!( | |
"Top-level struct in Arrow schema must be nullable: {:?}", | |
self | |
) | |
} |
Copilot uses AI. Check for mistakes.
@@ -118,6 +118,7 @@ impl ComputeFnVTable for ToArrow { | |||
let ToArrowArgs { array, arrow_type } = ToArrowArgs::try_from(args)?; | |||
Ok(arrow_type | |||
.map(|arrow_type| DType::from_arrow((arrow_type, array.dtype().nullability()))) | |||
// .ok_or_else(|| vortex_err!("inconvertible DType")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a commented-out error fallback in the conversion function. If it is no longer needed, please remove the commented code to improve clarity.
// .ok_or_else(|| vortex_err!("inconvertible DType")) | |
Copilot uses AI. Check for mistakes.
I don't really know anything about vortex, and I should also understand bloom filters better than I do, but in case you have any use case for serializing an R-Tree as part of geospatial data to get row-level spatial intersections, you might want to look at https://github.com/kylebarron/geo-index, which defines a zero-copy 2D R-Tree (it's a port of https://github.com/mourner/flatbush, and you can read more about it here if you're curious). I talked to lance about it a while ago (if I understand correctly vortex seems similar), and they were receptive to the idea but I didn't have time to implement it for them. |
geoarrow-geoparquet = { git = "https://github.com/geoarrow/geoarrow-rs.git", rev = "c241034f", features = [ | ||
"compression", | ||
] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way I've been considering a refactor to the geoparquet APIs so that users call upstream parquet
APIs directly. geoarrow/geoarrow-rs#1089
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also FYI the crate name changed to just geoparquet
. (The geoparquet
crate name had been squatted on but crates.io removed it)
This branch serves as an implementation sketch for a new series of extension types around geospatial vector data. This is largely an exercise to help flush out what's lacking in the currently extension type infra in a more "real" way than our temporal types. Still very raw
Another question this begs: should stats be pluggable in the same way layouts are? One very simple way we can do much better than GeoParquet, which only prunes at the whole-file level, is by implementing a spatial bloom filter (with S2 cell IDs instead of hash functions). Or maybe you don't need a new Stat for this and it's just a Layout?