Skip to content

[EPIC] Faster performance for parquet predicate evaluation for non selective filters #7456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 7 tasks
alamb opened this issue Apr 29, 2025 · 2 comments
Open
1 of 7 tasks
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Apr 29, 2025

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When evaluating filters on data stored in parquet, you can:

  1. Use the with_row_filter API to apply predicates during the scan
  2. Read the data and apply the predicate using the filter kernel afterwards

Currently, it is faster to use with_row_filter for some predicates and filter for others. In DataFusion we have a configuration setting to choose between the strategies (filter_pushdown, see apache/datafusion#3463) but that is a bad UX as it
means the user must somehow know which strategy to choose, but the strategy changes

In general the queries that are slower when with_row_filter is used:

  1. The predicates are not very selective (e.g. they pass more than 1% of the rows)
  2. The filters are applied to columns which are also used in the query result (e.g. the a filter column is also in the projection)

More Background:

The predicates are provides as a RowFilter (see docs for more details)

RowFilter applies predicates in order, after decoding only the columns required. As predicates eliminate rows, fewer rows from subsequent columns may be required, thus potentially reducing IO and decode.

Describe the solution you'd like

I would like the evaluation of predicates in RowFilter (aka pushed down predicates) to never be worse than decoding the columns first and then filtering them with the filter kernel

We have added a benchmark #7401, which hopefully can

cargo bench --all-features --bench arrow_reader_row_filter

Describe alternatives you've considered
This goal will likely require several changes to the codebase. Here are some options:

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Apr 29, 2025
@alamb alamb added the parquet Changes to the parquet crate label Apr 29, 2025
@alamb
Copy link
Contributor Author

alamb commented Apr 30, 2025

I just spoke with @XiangpengHao -- from my perspective the current status is:

  1. Parquet decoder / decoded page Cache #7363: blocked on getting some benchmark results that show the decoded page cache improves performance; Then we can proceed / merge the page cache change
  2. In paralell / then we can move on to working on a better representation for RowFilter (Adaptive Parquet Predicate Pushdown Evaluation #5523 / Consider removing skip from RowSelector #7450 / RowSelection::and_then is slow #7458)

@alamb
Copy link
Contributor Author

alamb commented May 8, 2025

Fascinatingly, Clickbench released a blog post recently about their parquet pushdown work

https://clickhouse.com/blog/clickhouse-and-parquet-a-foundation-for-fast-lakehouse-analytics

Possibly even more interesting is that they link to a master's thesis from Peter Boncz's group about how to quickly evaluate predicates during Parquet Decoding: https://homepages.cwi.nl/~boncz/msc/2018-BoudewijnBraams.pdf

This thesis directly addresses some of the work we are considering (though they only consider Selection masks (bitmask) and Selection Vector (selected indices)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

1 participant