-
Notifications
You must be signed in to change notification settings - Fork 918
Improve speed of row converter by skipping utf8 checks #6058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
take |
Hi @alamb and @XiangpengHao, I have some observations for this issue. The uf8 validation only happens when row.config.validate_utf8 is true. The validate_utf8 is only set to true when initialized from a RowParser Lines 781 to 788 in 49840ec
I find the only usage of Line 759 in 49840ec
|
That's an excellent observation, I also double checked the RowConverter in DataFusion and also did not find any reference to I think utf-8 is not being validated (and is expected) in DataFusion, so we are not slowed by utf-8 validation. But we probably have to keep that utf-8 check because other users may use |
Perfect. Thank you for the investigation @xinlifoobar and the confirmation @XiangpengHao |
@alamb @XiangpengHao Is utf8 validation in parquet reader necessary? I found a large proportion of |
I think it depends on how much you trust your input files to be valid. If you trust the files to only contain valid utf8 data, the disabling UTF8 validation is certainly an option However, I think disabling this check would be somewhat cheating on benchmarks as real systems should be validating all user supplied input for safety. Here is a ticket describing a proposal to turn if off |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Part of #5374
@XiangpengHao implemented optimized row format --> ByteView (StringView / BinaryView) encoding/decoding in #5945 / #6044
It also adds benchmarks so we can test🎉
However, as mentioned in https://github.com/apache/arrow-rs/pull/6044/files#r1676804033 if we know that the
Row
value was created from valid utf8 values, re-validating utf8 is unnecessary.Describe the solution you'd like
Consider an API that would allow skipping utf8 validation
This would need to be justified by performance benchmarks showing it made a significant difference in performance
Describe alternatives you've considered
Perhaps it would be an
unsafe
option on the RowConverterAdditional context
The text was updated successfully, but these errors were encountered: