-
Notifications
You must be signed in to change notification settings - Fork 915
Replace RecordBatch::with_schema_unchecked
with RecordBatch::new_unchecked
#7405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tbh I am pretty lukewarm on this API at all, if schema management is a hot path in your codebase something has gone very wrong... I figured making it unsafe would be less controversial than simply removing it, but that would be my preference... |
I truly agree. This should be an unsafe method, at least |
arrow-array/src/record_batch.rs
Outdated
/// # Safety | ||
/// | ||
/// `schema` must be a superset of the current schema as determined by [`Schema::contains`] | ||
pub unsafe fn with_schema_unchecked(self, schema: SchemaRef) -> Result<Self, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this API seems awkward and not consistent with the other APIs in this crate
What I suggest is:
- Remove this API
- Add APIs for destructing a RecordBatch into its constituent parts:
- Add an
unsafe
construction API (inline with the other parts of arrow)
So something like
impl RecordBatch {
/// Creates a RecordBatch without any consistency checks
pub unsafe fn new_unchecked(schema: Schema, arrays: vec<ArrayRef>, row_count) -> Self {
...
}
/// Return the inner fields of this RecordBatch, consuming sekf
pub fn into_parts(self) -> (Schema, Vec<ArrayRef>, usize) {
...
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read this as a performance optimisation, and therefore different from wanting to remove the constraints full stop as #6855, which was rejected for being unsound
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, if arrow users want to make RecordBatches and think they know when they can avoid the existing checks, offering an unsafe
API to do it is the right thing to do. We should caveat it, etc, but at the end of the day I don't think we should be standing in the way of allowing people to make RecordBatches as they see fit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it is no different from people relying on undocumented behaviour, it's ultimately their choice to make. However, we should not compromise on soundness, and so any such API must be unsafe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I also think we should add the following API. Can be a follow on
/// Return the inner fields of this RecordBatch, consuming sekf
pub fn into_parts(self) -> (Schema, Vec<ArrayRef>, usize) {
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tustvold and @alamb for making it unsafe is awesome idea to double emphasize risks. Similar to https://doc.rust-lang.org/std/primitive.slice.html#method.get_unchecked where bounds checks are not done by STD lib and moved to user caller site instead
As discussed in #7405 (comment) I have reverted with_schema_unchecked and added a new_unchecked constructor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
FYI I also considered using RecordBatchOptions instead of just row_count, however, the only other option is irrelevant to an unchecked constructor - and is also something we probably should remove - #7406 |
RecordBatch::with_schema_unchecked
with RecordBatch::new_unchecked
Which issue does this PR close?
Closes #.
Rationale for this change
These schema checks exist for a reason, it should not be possible to bypass them without using unsafe, as allows constructing invalid arrays, via conversion to StructArray, which in turn could easily lead to UB.
What changes are included in this PR?
Are there any user-facing changes?
No this method was added yesterday and has not been released - #7402