Skip to content

Deprecate RecordBatchOptions::with_match_field_names #7406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Apr 11, 2025

Which issue does this PR close?

Closes #.

Rationale for this change

Noticed whilst working on #7405, this option is potentially unsound.

I did a quick scan of downstream projects and couldn't see any usage of this feature and so I think it is fine to just deprecate it. This will also potentially allow removing the rather cumbersome RecordBatchOptions.

The other option would be to make this method unsafe, but given the other checks within the various arrays I struggle to see how this would be usable reliably.

What changes are included in this PR?

Are there any user-facing changes?

Tagging @nevi-me as I think this API was last touched by you 3 or so years ago 😅

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Apr 11, 2025
@@ -742,6 +740,7 @@ impl RecordBatch {
#[non_exhaustive]
pub struct RecordBatchOptions {
/// Match field names of structs and lists. If set to `true`, the names must match.
#[deprecated(note = "match_field_names is unsound")]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is marked non_exhaustive so the churn should be fairly minimal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we mark it unsound, I think it would help to provide a link / reference with an explanation about why it is unsound

For example, it is not clear to me why mismatched names is unsound 🤔

RecordBatch::try_new_with_options(
schema,
columns,
&RecordBatchOptions::new().with_match_field_names(false),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this was here, as the below code will always create the correct types AFAICT - perhaps a workaround for a since fixed bug?

@@ -764,6 +764,8 @@ impl RecordBatchOptions {
}

/// Sets the `match_field_names` of `RecordBatchOptions` and returns this [`RecordBatch`]
#[deprecated(note = "match_field_names is unsound")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be there any reason it is unsound so the user can accept exact risks when using it?

Copy link
Contributor Author

@tustvold tustvold Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh I've not sat down and worked out a precise exploit chain, if I had this would flagged as a security vulnerability. However, it is breaking a pretty fundamental invariant that is assumed in a number of places. The worst it is probably going to do is cause something to panic, or produce invalid output, but the potential is there and I'd sleep happier not having it being used 😆

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same question above -- if we are going to claim something is unsound I think we should justify why and provide some hints for an alternative

@tustvold
Copy link
Contributor Author

tustvold commented Apr 11, 2025

Fair, I have softened the wording. For a simple example of the unpredictable behaviour of this

let schema = Arc::new(Schema::new(vec![Field::new_list(
    "a",
    Field::new("item", DataType::Boolean, true),
    true,
)]));
let col = Arc::new(ListArray::new_null(
    Arc::new(Field::new("bananas", DataType::Boolean, true)),
    2,
));

RecordBatch::try_new(schema.clone(), vec![col.clone()]).unwrap_err();

let options = RecordBatchOptions::default().with_match_field_names(false);
let batch =
    RecordBatch::try_new_with_options(schema.clone(), vec![col.clone()], &options).unwrap();

// This panics
batch.project(&[0]).unwrap();

// This panics
StructArray::from(batch).to_data().validate().unwrap()

If one extends this to IPC it gets wilder

let mut buf = Vec::new();
let mut writer = crate::writer::FileWriter::try_new(&mut buf, batch.schema_ref()).unwrap();
writer.write(&batch).unwrap();
writer.finish().unwrap();

let mut reader = FileReader::try_new(std::io::Cursor::new(buf), None).unwrap();
let out = reader.next().unwrap().unwrap();
assert_eq!(batch, out);

This fails with an incomprehensible assertion failure, as the display implementation assumes the field is consistent.

assertion `left == right` failed
  left: RecordBatch { schema: Schema { fields: [Field { name: "a", data_type: List(Field { name: "item", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [ListArray
[
  null,
  null,
]], row_count: 2 }
 right: RecordBatch { schema: Schema { fields: [Field { name: "a", data_type: List(Field { name: "item", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [ListArray
[
  null,
  null,
]], row_count: 2 }

Ultimately lots of places make assumptions about schema consistency, and I struggle to come up with a coherent way to use this API.

provide some hints for an alternative

I'm not sure there is a viable alternative, this is fundamental property of arrow that we can't really fudge around as appealing as that might be were it possible.

@comphead
Copy link
Contributor

comphead commented Apr 11, 2025

Thanks @tustvold for experimenting with it. Probably having attached a link to your detailed comment above to the deprecation notice would be explanatory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants