-
Notifications
You must be signed in to change notification settings - Fork 914
Deprecate RecordBatchOptions::with_match_field_names #7406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
arrow-array/src/record_batch.rs
Outdated
@@ -742,6 +740,7 @@ impl RecordBatch { | |||
#[non_exhaustive] | |||
pub struct RecordBatchOptions { | |||
/// Match field names of structs and lists. If set to `true`, the names must match. | |||
#[deprecated(note = "match_field_names is unsound")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is marked non_exhaustive so the churn should be fairly minimal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we mark it unsound, I think it would help to provide a link / reference with an explanation about why it is unsound
For example, it is not clear to me why mismatched names is unsound 🤔
RecordBatch::try_new_with_options( | ||
schema, | ||
columns, | ||
&RecordBatchOptions::new().with_match_field_names(false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why this was here, as the below code will always create the correct types AFAICT - perhaps a workaround for a since fixed bug?
arrow-array/src/record_batch.rs
Outdated
@@ -764,6 +764,8 @@ impl RecordBatchOptions { | |||
} | |||
|
|||
/// Sets the `match_field_names` of `RecordBatchOptions` and returns this [`RecordBatch`] | |||
#[deprecated(note = "match_field_names is unsound")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be there any reason it is unsound so the user can accept exact risks when using it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbh I've not sat down and worked out a precise exploit chain, if I had this would flagged as a security vulnerability. However, it is breaking a pretty fundamental invariant that is assumed in a number of places. The worst it is probably going to do is cause something to panic, or produce invalid output, but the potential is there and I'd sleep happier not having it being used 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same question above -- if we are going to claim something is unsound I think we should justify why and provide some hints for an alternative
Fair, I have softened the wording. For a simple example of the unpredictable behaviour of this
If one extends this to IPC it gets wilder
This fails with an incomprehensible assertion failure, as the display implementation assumes the field is consistent.
Ultimately lots of places make assumptions about schema consistency, and I struggle to come up with a coherent way to use this API.
I'm not sure there is a viable alternative, this is fundamental property of arrow that we can't really fudge around as appealing as that might be were it possible. |
Thanks @tustvold for experimenting with it. Probably having attached a link to your detailed comment above to the deprecation notice would be explanatory |
Which issue does this PR close?
Closes #.
Rationale for this change
Noticed whilst working on #7405, this option is potentially unsound.
I did a quick scan of downstream projects and couldn't see any usage of this feature and so I think it is fine to just deprecate it. This will also potentially allow removing the rather cumbersome RecordBatchOptions.
The other option would be to make this method unsafe, but given the other checks within the various arrays I struggle to see how this would be usable reliably.
What changes are included in this PR?
Are there any user-facing changes?
Tagging @nevi-me as I think this API was last touched by you 3 or so years ago 😅