-
Notifications
You must be signed in to change notification settings - Fork 918
feat: Support Utf8View in JSON reader #7263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @zhuqi-lucas!
I think this PR just needs
-
some adjustment on data allocation
-
A performance benchmark. Perhaps you could extend this one:
assert_eq!(col1.null_count(), 2); | ||
assert_eq!(col1.value(0), "1"); | ||
assert_eq!(col1.value(1), "hello"); | ||
assert_eq!(col1.value(2), "\nfoobar😀asfgÿ"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this value has more than 12 bytes so it will exercise the longer string view path
TapeElement::String(idx) => { | ||
data_capacity += tape.get_string(idx).len(); | ||
} | ||
TapeElement::Null => { /* 不增加容量 */ } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be possible to use english in the comments?
fn decode(&mut self, tape: &Tape<'_>, pos: &[u32]) -> Result<ArrayData, ArrowError> { | ||
let coerce = self.coerce_primitive; | ||
let mut data_capacity = 0; | ||
for &p in pos { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StringView is different that StringArray in that only "long" strings (longer than 12 bytes) contributed to the data
Thus I think these calculations should be adjusted:
- only increase capacity for strings if the data is over 12 bytes
- don't increase for boolean
- For I32 probably we can use zero as well (as the longest such value is
-2147483647
whcih is less than 12 bytes) - For I64 maybe we could be more sophisticated and only add data length if the value is over
999999999999
etc. - For F32 and F64 I am not sure what hte maximum length of a string representation is so we should probably keep the existing estimate
More details on the layout are here: https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good explain and guide, thank you @alamb!
_ => unreachable!(), | ||
}, | ||
TapeElement::I32(n) if coerce => { | ||
builder.append_value(n.to_string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allocation is quite unfortunate (to_string()
allocates a string) but I see this is consistent with the other JSON string implementation:
(BTW the MSRV test is fixed on main thanks to @tustvold -- you should be able to merge up from main and the CI will pass) |
Thank you @alamb for review, addressed the comments now. And also the benchmark result for utf8view seems better: small_bench_primitive time: [6.2293 µs 6.2540 µs 6.2810 µs]
change: [+0.6034% +1.1278% +1.6922%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
1 (1.00%) low mild
7 (7.00%) high mild
3 (3.00%) high severe small_bench_primitive_with_utf8view
time: [6.0220 µs 6.0420 µs 6.0649 µs]
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
2 (2.00%) high mild |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @zhuqi-lucas -- I think this is a really nice contribution.
I left some small code comment suggestions but I can also add them as a follow on PR as well
Thanks again!
let coerce = self.coerce_primitive; | ||
let mut data_capacity = 0; | ||
for &p in pos { | ||
match tape.get(p) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a little context on the rationale here would be helpful:
match tape.get(p) { | |
// note that StringView is different that StringArray in that only | |
// "long" strings (longer than 12 bytes) are stored in the buffer. | |
// "short" strings are inlined into a fixed length structure. | |
match tape.get(p) { |
data_capacity += s.len(); | ||
} | ||
} | ||
// For I64, only add capacity if the absolute value is greater than 999,999,999,999 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// For I64, only add capacity if the absolute value is greater than 999,999,999,999 | |
// For I64, only add capacity if the absolute value is greater than 999,999,999,999 | |
// (the largest number that can fit in 12 bytes) |
TapeElement::Null => { | ||
// Do not increase capacity for null values | ||
} | ||
// For booleans, do not increase capacity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// For booleans, do not increase capacity | |
// For booleans, do not increase capacity (both "true" and "false" are less than | |
// 12 bytes) |
TapeElement::I32(low) => { | ||
let val = ((high as i64) << 32) | (low as u32) as i64; | ||
tmp_buf.clear(); | ||
// Reuse the temporary buffer instead of allocating a new String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
Thank you @alamb for review! I also addressed the suggestions in latest PR, thanks! |
Thanks again @zhuqi-lucas |
* feat: Support Utf8View in JSON reader * Add code * Fix fmt * Address comments * Add benchmark * Add benchmark * Fix lint * Clean up comments
Which issue does this PR close?
Closes #7244
Rationale for this change
Support Utf8View in JSON reader
What changes are included in this PR?
Support Utf8View in JSON reader
Are there any user-facing changes?
Support Utf8View in JSON reader