fix: Switch Kernel to use Arrow LargeStringArray as default string representation #1427

DrakeLin · 2025-10-24T18:04:51Z

What changes are proposed in this pull request?

Arrow's standard StringArray uses i32 offsets to index into the underlying byte buffer, which limits the total string data to 2GB per array. Delta tables with large string columns cause overflow errors when processed by delta-kernel-rs. See delta-io/delta-rs#3790 for details.

To address this, we change default string type from Utf8 to LargeUtf8 in the arrow conversion code.

Best effort was taken to keep code generic across Utf8 and LargeUtf8 for a future where we can dynamically select between the two types depending on client choice.

How was this change tested?

Existing unit tests

codecov · 2025-10-27T21:43:23Z

Codecov Report

❌ Patch coverage is 42.85714% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.78%. Comparing base (2b49385) to head (de8c5a2).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
kernel/src/engine/arrow_utils.rs	44.68%	24 Missing and 2 partials ⚠️
kernel/src/engine/arrow_get_data.rs	0.00%	5 Missing ⚠️
kernel/src/engine/arrow_data.rs	75.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1427      +/-   ##
==========================================
- Coverage   84.85%   84.78%   -0.08%     
==========================================
  Files         119      119              
  Lines       30862    30923      +61     
  Branches    30862    30923      +61     
==========================================
+ Hits        26188    26218      +30     
- Misses       3395     3425      +30     
- Partials     1279     1280       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nicklan

Thanks. Mostly looks good. it would be great if we could reduce the "try large otherwise try small" patterns by using some generics, but it's certainly not trivial.

nicklan · 2025-10-29T00:24:53Z

kernel/src/engine/arrow_expression/mod.rs

            Float(val) => append_val_as!(array::Float32Builder, *val),
            Double(val) => append_val_as!(array::Float64Builder, *val),
-            String(val) => append_val_as!(array::StringBuilder, val),
+            String(val) => {


if we know that the only place that calls into this is above in to_array (which it is afaict), and to_array figures the type out by doing let data_type = ArrowDataType::try_from_kernel(&self.data_type())?; then we know that the builder is always of a large type if that's what our schema conversion does, so we should never need to handle the small case?

Alternately, if we want to have an option to do both, let's check what type we converted into in to_array and pass an extra is_large argument or something that let's us do this without having to just try and see what works.

nicklan · 2025-10-29T00:26:43Z

kernel/src/engine/arrow_expression/mod.rs

                let array_ref = apply_schema_to(&array_ref, output_type)?;
-                let arrow_type = ArrowDataType::try_from_kernel(output_type)?;
-                let schema = ArrowSchema::new(vec![ArrowField::new("output", arrow_type, true)]);
+                // Use the actual data type of the array, not the converted kernel type


when does using output_type cause an issue? I feel like we should not do this, because we'll get unexpected behavior where the output doesn't actually match what we asked for

nicklan · 2025-10-29T00:29:02Z

kernel/src/engine/arrow_data.rs

+        // Try both i32 (StringArray) and i64 (LargeStringArray) offsets
+        if let Some(sarry) = arry.as_string_opt::<i32>() {
+            sarry.value(index).to_string()
+        } else if let Some(sarry) = arry.as_string_opt::<i64>() {
+            sarry.value(index).to_string()
+        } else {
+            String::new()
+        }


I feel like we should be able to do:

Suggested change

// Try both i32 (StringArray) and i64 (LargeStringArray) offsets

if let Some(sarry) = arry.as_string_opt::<i32>() {

sarry.value(index).to_string()

} else if let Some(sarry) = arry.as_string_opt::<i64>() {

sarry.value(index).to_string()

} else {

String::new()

}

let sarry = arry.as_string::<OffsetSize>();

sarry.value(index).to_string()

But I see that causes some tests to fail. We might be able to sort it out by changing things at the call-site. I can have a look as well when I have some time.

github-actions bot assigned DrakeLin Oct 24, 2025

DrakeLin force-pushed the drake-lin_data/large-string-overflow branch from 7379c44 to de8c5a2 Compare October 27, 2025 21:40

DrakeLin force-pushed the drake-lin_data/large-string-overflow branch 3 times, most recently from d81b046 to 02f6519 Compare October 28, 2025 17:47

DrakeLin added 2 commits October 28, 2025 17:52

fix

a209b5e

fix_tests

63e8150

DrakeLin force-pushed the drake-lin_data/large-string-overflow branch from 02f6519 to 63e8150 Compare October 28, 2025 21:19

Everything but test_str_arrays

5b533d1

DrakeLin changed the title ~~[WIP] Fix overflow~~ fix: Switch Kernel to use Arrow LargeStringArray as default string representation Oct 28, 2025

DrakeLin requested review from nicklan and zachschuermann October 28, 2025 22:25

clipy

2d39b3c

nicklan reviewed Oct 29, 2025

View reviewed changes

DrakeLin added 4 commits October 29, 2025 21:05

fix-golden

8a9b91c

fix

baf2a0d

use

1bb9e13

generic

bb2d274

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Switch Kernel to use Arrow LargeStringArray as default string representation #1427

fix: Switch Kernel to use Arrow LargeStringArray as default string representation #1427

Uh oh!

DrakeLin commented Oct 24, 2025 •

edited

Loading

Uh oh!

codecov bot commented Oct 27, 2025

Uh oh!

nicklan left a comment

Uh oh!

nicklan Oct 29, 2025

Uh oh!

nicklan Oct 29, 2025

Uh oh!

nicklan Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Switch Kernel to use Arrow LargeStringArray as default string representation #1427

Are you sure you want to change the base?

fix: Switch Kernel to use Arrow LargeStringArray as default string representation #1427

Uh oh!

Conversation

DrakeLin commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How was this change tested?

Uh oh!

codecov bot commented Oct 27, 2025

Codecov Report

Uh oh!

nicklan left a comment

Choose a reason for hiding this comment

Uh oh!

nicklan Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

nicklan Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

nicklan Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DrakeLin commented Oct 24, 2025 •

edited

Loading