Skip to content

Replace RecordBatch::with_schema_unchecked with RecordBatch::new_unchecked #7405

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 13, 2025

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Apr 11, 2025

Which issue does this PR close?

Closes #.

Rationale for this change

These schema checks exist for a reason, it should not be possible to bypass them without using unsafe, as allows constructing invalid arrays, via conversion to StructArray, which in turn could easily lead to UB.

What changes are included in this PR?

Are there any user-facing changes?

No this method was added yesterday and has not been released - #7402

@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 11, 2025
@tustvold
Copy link
Contributor Author

tustvold commented Apr 11, 2025

Tbh I am pretty lukewarm on this API at all, if schema management is a hot path in your codebase something has gone very wrong... I figured making it unsafe would be less controversial than simply removing it, but that would be my preference...

@berkaysynnada
Copy link
Contributor

Tbh I am pretty lukewarm on this API at all, if schema management is a hot path in your codebase something has gone very wrong... I figured making it unsafe would be less controversial than simply removing it, but that would be my preference...

I truly agree. This should be an unsafe method, at least

/// # Safety
///
/// `schema` must be a superset of the current schema as determined by [`Schema::contains`]
pub unsafe fn with_schema_unchecked(self, schema: SchemaRef) -> Result<Self, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this API seems awkward and not consistent with the other APIs in this crate

What I suggest is:

  1. Remove this API
  2. Add APIs for destructing a RecordBatch into its constituent parts:
  3. Add an unsafe construction API (inline with the other parts of arrow)

So something like

impl RecordBatch {
  /// Creates a RecordBatch without any consistency checks
  pub unsafe fn new_unchecked(schema: Schema, arrays: vec<ArrayRef>, row_count) -> Self {
   ...
  }
  
  /// Return the inner fields of this RecordBatch, consuming sekf
  pub fn into_parts(self) -> (Schema, Vec<ArrayRef>, usize) {
   ...
  }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, there was a previous attempt to add something like this, but we were not able to merge it.

#6855

@comphead created his PR independently of my efforts, further demonstrating the need for some kind of solution in downstream projects.

Copy link
Contributor Author

@tustvold tustvold Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this as a performance optimisation, and therefore different from wanting to remove the constraints full stop as #6855, which was rejected for being unsound

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, if arrow users want to make RecordBatches and think they know when they can avoid the existing checks, offering an unsafe API to do it is the right thing to do. We should caveat it, etc, but at the end of the day I don't think we should be standing in the way of allowing people to make RecordBatches as they see fit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it is no different from people relying on undocumented behaviour, it's ultimately their choice to make. However, we should not compromise on soundness, and so any such API must be unsafe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I also think we should add the following API. Can be a follow on

  /// Return the inner fields of this RecordBatch, consuming sekf
  pub fn into_parts(self) -> (Schema, Vec<ArrayRef>, usize) {
   ...
  }

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tustvold and @alamb for making it unsafe is awesome idea to double emphasize risks. Similar to https://doc.rust-lang.org/std/primitive.slice.html#method.get_unchecked where bounds checks are not done by STD lib and moved to user caller site instead

@tustvold
Copy link
Contributor Author

As discussed in #7405 (comment) I have reverted with_schema_unchecked and added a new_unchecked constructor

@tustvold tustvold changed the title Make with_schema_unchecked unsafe Replace with_schema_unchecked with new_unchecked Apr 11, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@tustvold
Copy link
Contributor Author

FYI I also considered using RecordBatchOptions instead of just row_count, however, the only other option is irrelevant to an unchecked constructor - and is also something we probably should remove - #7406

@tustvold tustvold merged commit 7f3907e into apache:main Apr 13, 2025
26 checks passed
@alamb alamb changed the title Replace with_schema_unchecked with new_unchecked Replace RecordBatch::with_schema_unchecked with RecordBatch::new_unchecked Apr 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants