Skip to content

feat: Support Utf8View in JSON reader #7263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 13, 2025
Merged

Conversation

zhuqi-lucas
Copy link
Contributor

Which issue does this PR close?

Closes #7244

Rationale for this change

Support Utf8View in JSON reader

What changes are included in this PR?

Support Utf8View in JSON reader

Are there any user-facing changes?

Support Utf8View in JSON reader

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 11, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @zhuqi-lucas!

I think this PR just needs

  1. some adjustment on data allocation

  2. A performance benchmark. Perhaps you could extend this one:

https://github.com/apache/arrow-rs/blob/a75da00eed762f8ab201c6cb4388921ad9b67e7e/arrow/benches/json_reader.rs#L45-L44

assert_eq!(col1.null_count(), 2);
assert_eq!(col1.value(0), "1");
assert_eq!(col1.value(1), "hello");
assert_eq!(col1.value(2), "\nfoobar😀asfgÿ");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this value has more than 12 bytes so it will exercise the longer string view path

TapeElement::String(idx) => {
data_capacity += tape.get_string(idx).len();
}
TapeElement::Null => { /* 不增加容量 */ }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to use english in the comments?

fn decode(&mut self, tape: &Tape<'_>, pos: &[u32]) -> Result<ArrayData, ArrowError> {
let coerce = self.coerce_primitive;
let mut data_capacity = 0;
for &p in pos {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringView is different that StringArray in that only "long" strings (longer than 12 bytes) contributed to the data

Thus I think these calculations should be adjusted:

  1. only increase capacity for strings if the data is over 12 bytes
  2. don't increase for boolean
  3. For I32 probably we can use zero as well (as the longest such value is -2147483647 whcih is less than 12 bytes)
  4. For I64 maybe we could be more sophisticated and only add data length if the value is over 999999999999 etc.
  5. For F32 and F64 I am not sure what hte maximum length of a string representation is so we should probably keep the existing estimate

More details on the layout are here: https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good explain and guide, thank you @alamb!

_ => unreachable!(),
},
TapeElement::I32(n) if coerce => {
builder.append_value(n.to_string());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allocation is quite unfortunate (to_string() allocates a string) but I see this is consistent with the other JSON string implementation:

https://github.com/apache/arrow-rs/blob/a0c3186c55ac8ed3f6b8a15d1305548fd6305ebb/arrow-json/src/reader/string_array.rs#L111-L110

@alamb
Copy link
Contributor

alamb commented Mar 11, 2025

(BTW the MSRV test is fixed on main thanks to @tustvold -- you should be able to merge up from main and the CI will pass)

@zhuqi-lucas
Copy link
Contributor Author

Thank you very much @zhuqi-lucas!

I think this PR just needs

  1. some adjustment on data allocation
  2. A performance benchmark. Perhaps you could extend this one:

https://github.com/apache/arrow-rs/blob/a75da00eed762f8ab201c6cb4388921ad9b67e7e/arrow/benches/json_reader.rs#L45-L44

Thank you @alamb for review, addressed the comments now.

And also the benchmark result for utf8view seems better:

small_bench_primitive   time:   [6.2293 µs 6.2540 µs 6.2810 µs]
                        change: [+0.6034% +1.1278% +1.6922%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  3 (3.00%) high severe
small_bench_primitive_with_utf8view
                        time:   [6.0220 µs 6.0420 µs 6.0649 µs]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild

@zhuqi-lucas zhuqi-lucas requested a review from alamb March 12, 2025 10:16
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @zhuqi-lucas -- I think this is a really nice contribution.

I left some small code comment suggestions but I can also add them as a follow on PR as well

Thanks again!

let coerce = self.coerce_primitive;
let mut data_capacity = 0;
for &p in pos {
match tape.get(p) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a little context on the rationale here would be helpful:

Suggested change
match tape.get(p) {
// note that StringView is different that StringArray in that only
// "long" strings (longer than 12 bytes) are stored in the buffer.
// "short" strings are inlined into a fixed length structure.
match tape.get(p) {

data_capacity += s.len();
}
}
// For I64, only add capacity if the absolute value is greater than 999,999,999,999
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// For I64, only add capacity if the absolute value is greater than 999,999,999,999
// For I64, only add capacity if the absolute value is greater than 999,999,999,999
// (the largest number that can fit in 12 bytes)

TapeElement::Null => {
// Do not increase capacity for null values
}
// For booleans, do not increase capacity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// For booleans, do not increase capacity
// For booleans, do not increase capacity (both "true" and "false" are less than
// 12 bytes)

TapeElement::I32(low) => {
let val = ((high as i64) << 32) | (low as u32) as i64;
tmp_buf.clear();
// Reuse the temporary buffer instead of allocating a new String
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@zhuqi-lucas
Copy link
Contributor Author

Thank you very much @zhuqi-lucas -- I think this is a really nice contribution.

I left some small code comment suggestions but I can also add them as a follow on PR as well

Thanks again!

Thank you @alamb for review! I also addressed the suggestions in latest PR, thanks!

@alamb alamb merged commit c26e427 into apache:main Mar 13, 2025
23 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 13, 2025

Thanks again @zhuqi-lucas

@alamb alamb added parquet Changes to the parquet crate arrow Changes to the arrow crate and removed arrow Changes to the arrow crate parquet Changes to the parquet crate labels Mar 15, 2025
PinkCrow007 pushed a commit to PinkCrow007/arrow-rs that referenced this pull request Mar 20, 2025
* feat: Support Utf8View in JSON reader

* Add code

* Fix fmt

* Address comments

* Add benchmark

* Add benchmark

* Fix lint

* Clean up comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Support Utf8View in JSON reader
2 participants