[EPIC] Faster performance for parquet predicate evaluation for non selective filters #7456
Open
1 of 7 tasks
Labels
enhancement
Any new improvement worthy of a entry in the changelog
parquet
Changes to the parquet crate
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When evaluating filters on data stored in parquet, you can:
with_row_filter
API to apply predicates during the scanfilter
kernel afterwardsCurrently, it is faster to use
with_row_filter
for some predicates andfilter
for others. In DataFusion we have a configuration setting to choose between the strategies (filter_pushdown
, see apache/datafusion#3463) but that is a bad UX as itmeans the user must somehow know which strategy to choose, but the strategy changes
In general the queries that are slower when
with_row_filter
is used:More Background:
The predicates are provides as a
RowFilter
(see docs for more details)Describe the solution you'd like
I would like the evaluation of predicates in
RowFilter
(aka pushed down predicates) to never be worse than decoding the columns first and then filtering them with thefilter
kernelWe have added a benchmark #7401, which hopefully can
Describe alternatives you've considered
This goal will likely require several changes to the codebase. Here are some options:
skip
fromRowSelector
#7450RowSelection::and_then
is slow #7458Arc<dyn Array>
in parquet record batch reader. #4864The text was updated successfully, but these errors were encountered: