-
Notifications
You must be signed in to change notification settings - Fork 119
fix: Switch Kernel to use Arrow LargeStringArray as default string representation #1427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: Switch Kernel to use Arrow LargeStringArray as default string representation #1427
Conversation
7379c44 to
de8c5a2
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1427 +/- ##
==========================================
- Coverage 84.85% 84.78% -0.08%
==========================================
Files 119 119
Lines 30862 30923 +61
Branches 30862 30923 +61
==========================================
+ Hits 26188 26218 +30
- Misses 3395 3425 +30
- Partials 1279 1280 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
d81b046 to
02f6519
Compare
02f6519 to
63e8150
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Mostly looks good. it would be great if we could reduce the "try large otherwise try small" patterns by using some generics, but it's certainly not trivial.
| Float(val) => append_val_as!(array::Float32Builder, *val), | ||
| Double(val) => append_val_as!(array::Float64Builder, *val), | ||
| String(val) => append_val_as!(array::StringBuilder, val), | ||
| String(val) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we know that the only place that calls into this is above in to_array (which it is afaict), and to_array figures the type out by doing let data_type = ArrowDataType::try_from_kernel(&self.data_type())?; then we know that the builder is always of a large type if that's what our schema conversion does, so we should never need to handle the small case?
Alternately, if we want to have an option to do both, let's check what type we converted into in to_array and pass an extra is_large argument or something that let's us do this without having to just try and see what works.
| let array_ref = apply_schema_to(&array_ref, output_type)?; | ||
| let arrow_type = ArrowDataType::try_from_kernel(output_type)?; | ||
| let schema = ArrowSchema::new(vec![ArrowField::new("output", arrow_type, true)]); | ||
| // Use the actual data type of the array, not the converted kernel type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when does using output_type cause an issue? I feel like we should not do this, because we'll get unexpected behavior where the output doesn't actually match what we asked for
kernel/src/engine/arrow_data.rs
Outdated
| // Try both i32 (StringArray) and i64 (LargeStringArray) offsets | ||
| if let Some(sarry) = arry.as_string_opt::<i32>() { | ||
| sarry.value(index).to_string() | ||
| } else if let Some(sarry) = arry.as_string_opt::<i64>() { | ||
| sarry.value(index).to_string() | ||
| } else { | ||
| String::new() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should be able to do:
| // Try both i32 (StringArray) and i64 (LargeStringArray) offsets | |
| if let Some(sarry) = arry.as_string_opt::<i32>() { | |
| sarry.value(index).to_string() | |
| } else if let Some(sarry) = arry.as_string_opt::<i64>() { | |
| sarry.value(index).to_string() | |
| } else { | |
| String::new() | |
| } | |
| let sarry = arry.as_string::<OffsetSize>(); | |
| sarry.value(index).to_string() |
But I see that causes some tests to fail. We might be able to sort it out by changing things at the call-site. I can have a look as well when I have some time.
What changes are proposed in this pull request?
Arrow's standard StringArray uses i32 offsets to index into the underlying byte buffer, which limits the total string data to 2GB per array. Delta tables with large string columns cause overflow errors when processed by delta-kernel-rs. See delta-io/delta-rs#3790 for details.
To address this, we change default string type from Utf8 to LargeUtf8 in the arrow conversion code.
Best effort was taken to keep code generic across Utf8 and LargeUtf8 for a future where we can dynamically select between the two types depending on client choice.
How was this change tested?
Existing unit tests