Improve speed of row converter by skipping utf8 checks #6058

alamb · 2024-07-15T11:09:26Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Part of #5374

@XiangpengHao implemented optimized row format --> ByteView (StringView / BinaryView) encoding/decoding in #5945 / #6044

It also adds benchmarks so we can test🎉

However, as mentioned in https://github.com/apache/arrow-rs/pull/6044/files#r1676804033 if we know that the Row value was created from valid utf8 values, re-validating utf8 is unnecessary.

Describe the solution you'd like

Consider an API that would allow skipping utf8 validation

This would need to be justified by performance benchmarks showing it made a significant difference in performance

Describe alternatives you've considered

Perhaps it would be an unsafe option on the RowConverter

let converter = RowConverter::new(...);

// Safety: only decoding Rows that came from valid String arrays
let converter = unsafe {
  converter.with_validate_utf8(false)
}

Additional context

The text was updated successfully, but these errors were encountered:

xinlifoobar · 2024-08-05T09:28:41Z

take

xinlifoobar · 2024-08-08T08:39:04Z

Hi @alamb and @XiangpengHao, I have some observations for this issue.

The uf8 validation only happens when row.config.validate_utf8 is true.

https://github.com/apache/arrow-rs/blob/49840ec0f110da5e9a21ce97affd32313d0b720f/arrow-row/src/lib.rs#L1302C1-L1303C1

The validate_utf8 is only set to true when initialized from a RowParser

arrow-rs/arrow-row/src/lib.rs

Lines 781 to 788 in 49840ec

    
           fn new(fields: Arc<[SortField]>) -> Self { 
        
               Self { 
        
                   config: RowConfig { 
        
                       fields, 
        
                       validate_utf8: true, 
        
                   }, 
        
               } 
        
           }

I find the only usage of RowParse is here in RowConverter and test, did this mean the validate_utf8 will never set to true in the current implementation of arrow-rs and we would have the additional validation?

arrow-rs/arrow-row/src/lib.rs

Line 759 in 49840ec

pub fn parser(&self) -> RowParser {

XiangpengHao · 2024-08-08T09:25:49Z

did this mean the validate_utf8 will never set to true

That's an excellent observation, I also double checked the RowConverter in DataFusion and also did not find any reference to .parser() or RowParser, for example in GroupValuesRows: https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/aggregates/group_values/row.rs#L74-L80

I think utf-8 is not being validated (and is expected) in DataFusion, so we are not slowed by utf-8 validation. But we probably have to keep that utf-8 check because other users may use RowParser

alamb · 2024-08-08T11:25:48Z

Perfect. Thank you for the investigation @xinlifoobar and the confirmation @XiangpengHao

wForget · 2025-03-28T11:24:57Z

@alamb @XiangpengHao Is utf8 validation in parquet reader necessary? I found a large proportion of parquet::arrow::buffer::offset_buffer::OffsetBuffer<I>::check_valid_utf8 when profiling datafusion-comet native scan.

alamb · 2025-03-28T15:16:10Z

@alamb @XiangpengHao Is utf8 validation in parquet reader necessary? I found a large proportion of parquet::arrow::buffer::offset_buffer::OffsetBuffer<I>::check_valid_utf8 when profiling datafusion-comet native scan.

I think it depends on how much you trust your input files to be valid. If you trust the files to only contain valid utf8 data, the disabling UTF8 validation is certainly an option

However, I think disabling this check would be somewhat cheating on benchmarks as real systems should be validating all user supplied input for safety.

Here is a ticket describing a proposal to turn if off

Proposal: Add unsafe option to disable UTF8 validation on parquet read #6701

alamb added the enhancement Any new improvement worthy of a entry in the changelog label Jul 15, 2024

This was referenced Jul 15, 2024

[EPIC] Implement StringViewArray and BinaryViewArray #5374

Closed

Directly decode String/BinaryView types from arrow-row format #6044

Merged

alamb mentioned this issue Jul 31, 2024

[Epic] A collection of StringView / BinaryView improvements #6163

Open

9 tasks

alamb closed this as not planned Won't fix, can't repro, duplicate, stale Aug 8, 2024

alamb added the arrow Changes to the arrow crate label Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve speed of row converter by skipping utf8 checks #6058

Improve speed of row converter by skipping utf8 checks #6058

alamb commented Jul 15, 2024

xinlifoobar commented Aug 5, 2024

xinlifoobar commented Aug 8, 2024

XiangpengHao commented Aug 8, 2024

alamb commented Aug 8, 2024

wForget commented Mar 28, 2025

alamb commented Mar 28, 2025 •

edited

Loading

Improve speed of row converter by skipping utf8 checks #6058

Improve speed of row converter by skipping utf8 checks #6058

Comments

alamb commented Jul 15, 2024

xinlifoobar commented Aug 5, 2024

xinlifoobar commented Aug 8, 2024

XiangpengHao commented Aug 8, 2024

alamb commented Aug 8, 2024

wForget commented Mar 28, 2025

alamb commented Mar 28, 2025 • edited Loading

alamb commented Mar 28, 2025 •

edited

Loading