Skip to content

Conversation

@DrakeLin
Copy link
Collaborator

@DrakeLin DrakeLin commented Oct 24, 2025

What changes are proposed in this pull request?

Arrow's standard StringArray uses i32 offsets to index into the underlying byte buffer, which limits the total string data to 2GB per array. Delta tables with large string columns cause overflow errors when processed by delta-kernel-rs. See delta-io/delta-rs#3790 for details.

To address this, we change default string type from Utf8 to LargeUtf8 in the arrow conversion code.

Best effort was taken to keep code generic across Utf8 and LargeUtf8 for a future where we can dynamically select between the two types depending on client choice.

How was this change tested?

Existing unit tests

@DrakeLin DrakeLin force-pushed the drake-lin_data/large-string-overflow branch from 7379c44 to de8c5a2 Compare October 27, 2025 21:40
@codecov
Copy link

codecov bot commented Oct 27, 2025

Codecov Report

❌ Patch coverage is 42.85714% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.78%. Comparing base (2b49385) to head (de8c5a2).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/engine/arrow_utils.rs 44.68% 24 Missing and 2 partials ⚠️
kernel/src/engine/arrow_get_data.rs 0.00% 5 Missing ⚠️
kernel/src/engine/arrow_data.rs 75.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1427      +/-   ##
==========================================
- Coverage   84.85%   84.78%   -0.08%     
==========================================
  Files         119      119              
  Lines       30862    30923      +61     
  Branches    30862    30923      +61     
==========================================
+ Hits        26188    26218      +30     
- Misses       3395     3425      +30     
- Partials     1279     1280       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@DrakeLin DrakeLin force-pushed the drake-lin_data/large-string-overflow branch 3 times, most recently from d81b046 to 02f6519 Compare October 28, 2025 17:47
@DrakeLin DrakeLin force-pushed the drake-lin_data/large-string-overflow branch from 02f6519 to 63e8150 Compare October 28, 2025 21:19
@DrakeLin DrakeLin changed the title [WIP] Fix overflow fix: Switch Kernel to use Arrow LargeStringArray as default string representation Oct 28, 2025
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Mostly looks good. it would be great if we could reduce the "try large otherwise try small" patterns by using some generics, but it's certainly not trivial.

Float(val) => append_val_as!(array::Float32Builder, *val),
Double(val) => append_val_as!(array::Float64Builder, *val),
String(val) => append_val_as!(array::StringBuilder, val),
String(val) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we know that the only place that calls into this is above in to_array (which it is afaict), and to_array figures the type out by doing let data_type = ArrowDataType::try_from_kernel(&self.data_type())?; then we know that the builder is always of a large type if that's what our schema conversion does, so we should never need to handle the small case?

Alternately, if we want to have an option to do both, let's check what type we converted into in to_array and pass an extra is_large argument or something that let's us do this without having to just try and see what works.

let array_ref = apply_schema_to(&array_ref, output_type)?;
let arrow_type = ArrowDataType::try_from_kernel(output_type)?;
let schema = ArrowSchema::new(vec![ArrowField::new("output", arrow_type, true)]);
// Use the actual data type of the array, not the converted kernel type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when does using output_type cause an issue? I feel like we should not do this, because we'll get unexpected behavior where the output doesn't actually match what we asked for

Comment on lines 103 to 110
// Try both i32 (StringArray) and i64 (LargeStringArray) offsets
if let Some(sarry) = arry.as_string_opt::<i32>() {
sarry.value(index).to_string()
} else if let Some(sarry) = arry.as_string_opt::<i64>() {
sarry.value(index).to_string()
} else {
String::new()
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should be able to do:

Suggested change
// Try both i32 (StringArray) and i64 (LargeStringArray) offsets
if let Some(sarry) = arry.as_string_opt::<i32>() {
sarry.value(index).to_string()
} else if let Some(sarry) = arry.as_string_opt::<i64>() {
sarry.value(index).to_string()
} else {
String::new()
}
let sarry = arry.as_string::<OffsetSize>();
sarry.value(index).to_string()

But I see that causes some tests to fail. We might be able to sort it out by changing things at the call-site. I can have a look as well when I have some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants