refactor filter pushdown APIs #16642

adriangb · 2025-07-01T16:06:40Z

As discussed in that issue one of the reasons to hold off on refactoring the APIs was waiting until they were used in more places so we could get a better picture of what was needed.
Working on #16445 necessitated adding new APIs which led me to wanting to do this refactor.

The refactor of the APIs focused on two key things:

Removing thin helper methods that could easily be implemented by the caller. This adds a bit of boilerplate but makes the code more transparent and reduces the number of APIs and structs to juggle (I completely binned PredicateSupports in favor of Vec<PredicateSupport>. I think the fact that the PR is net negative LOC indicates removing these helpers and abstractions was worth it.
Bringing the logic that would have now been repeated in the HashJoinExec implementation and FilterExec implementations of gather_filters_for_pushdown into a single place, which simplifies away several APIs and lifts a lot of complexity from ExecutionPlan implementations into the pushdown module itself. Notable instead of complex logic w/ projections and such we have a simple 2 step approach: (1) check if the filter only references columns in the child and (2) reassign column indexes to match the child's schema using the existing utility function.

All of this makes implementing parent filter pushdown for joins only a couple LOC (which I will do in a followup PR).

alamb

Thank you @adriangb -- this seems like a clear improvement to me from an API perspective: Allows more features with less code 👍 . Thank you for driving this along

cc @ozankabak and @berkaysynnada in case you would like to review the changes to the filter APIs

datafusion/physical-plan/src/filter_pushdown.rs

alamb · 2025-07-02T17:38:03Z

datafusion/physical-plan/src/sorts/sort.rs

-                return Ok(FilterDescription::new_with_child_count(1)
-                    .all_parent_filters_supported(parent_filters)
-                    .with_self_filter(filter));
+                child =


I agree this looks much better

kosiew

Great job on the refactor!

Left a few comments for your consideration.

kosiew · 2025-07-03T08:05:09Z

datafusion/physical-plan/src/filter_pushdown.rs

+        let child_column_names: HashSet<&str> = child_schema
+            .fields()
+            .iter()
+            .map(|f| f.name().as_str())
+            .collect();


We could be redundantly rebuilding the entire HashSet of column names for the exact same ExecutionPlan node multiple times during the optimization process.

Here’s a scenario to illustrate:

Imagine a complex query plan. A specific node, let's say ParquetScan(file1.parquet), exists as a shared reference (Arc) and might be evaluated in different contexts as the optimizer walks the plan tree.

First Encounter: The optimizer analyzes a FilterNode that has ParquetScan(file1.parquet) as a child. It calls
ChildFilterDescription::from_child. This function iterates through the schema of the parquet scan and builds a HashSet of its column names for the first time.

Second Encounter: Later in the same optimization pass, another node—perhaps a JoinNode—also needs to analyze what can be pushed down to that very same ParquetScan(file1.parquet) instance.

Without caching, the from_child function would be called again for the same ParquetScan node, and it would re-build the exact same HashSet of column names from scratch.

Perhaps, consider implement a caching mechanism?

I agree that performance of optimizer rules and planning is a concern but I think that needs to be solved at a higher level (e.g. caching of plan trees or subtrees).

datafusion/physical-plan/src/filter_pushdown.rs

kosiew · 2025-07-03T09:05:14Z

datafusion/physical-plan/src/filter_pushdown.rs

    /// Create a new [`FilterPushdownPropagation`] with the specified filter support.
-    pub fn with_filters(filters: PredicateSupports) -> Self {
+    pub fn with_filters(filters: Vec<PredicateSupport>) -> Self {


The old .unsupported(...) helper was very explicit. Now, to produce an “all unsupported” result we must write:

FilterPushdownPropagation::with_filters( filters.into_iter().map(PredicateSupport::Unsupported).collect() )

While this is idiomatic Rust, it forces the developer to think about the mechanism (map, collect) rather than the intent ("all of these filters were rejected").

How about adding a helper function makes the intent crystal clear at the call site.

FilterPushdownPropagation::all_rejected(filters)

This is better because: * It's self-documenting. The name of the function tells you exactly what's happening. * It's higher-level. It allows developers to work at the level of "what they want to do" rather than "how to do it."

On the balance we had way too many methods before. I think I'd rather go with less methods for now and then we can have focused PRs in the future to add helper methods where it makes sense.

This is also mostly used in default implementations, I think it's unlikely actual implementers will write code like this (if you're implementing the methods you probably want to allow some filters through or all filters through hence FilterPushdownPropagation::transparent).

I see your reasons for this and do not disagree.

kosiew · 2025-07-03T09:21:02Z

datafusion/physical-plan/src/execution_plan.rs

+        for _child in self.children() {
+            let child_filters = parent_filters
+                .iter()
+                .map(|f| PredicateSupport::Unsupported(Arc::clone(f)))
+                .collect();
+            desc = desc.with_child(ChildFilterDescription {
+                parent_filters: child_filters,
+                self_filters: vec![],
+            });
+        }


Opportunity for reducing DRY.

This loop is effectively a broadcast of “unsupported” to every child. That’s exactly what FilterDescription::from_children could do if you passed a helper that always returns Unsupported.

Would you suggest adding an optional helper, or making a new method?

This loop is only happening in one place atm. I refactored it a bit to hoist code out of the loop. I think for things like this that are only called in 1 place we should err on the side of not adding new public APIs for 1 call site and wait until people actually implementing these methods on their ExecutionPlan's and request helper methods.

datafusion/physical-plan/src/filter_pushdown.rs

alamb · 2025-07-03T21:10:14Z

Looks like has one more failure (and maybe we can merge up too)

…elpers to simplify APIs

Co-authored-by: Andrew Lamb <[email protected]>

adriangb · 2025-07-03T21:27:22Z

Looks like has one more failure (and maybe we can merge up too)

done!

alamb · 2025-07-07T11:29:27Z

Is this PR ready to merge?

adriangb · 2025-07-07T12:24:13Z

Yes by me!

alamb · 2025-07-07T15:42:51Z

@kosiew are you happy with this PR too?

kosiew · 2025-07-08T08:11:56Z

yep

alamb · 2025-07-08T13:38:18Z

🎉 thank you @adriangb and @kosiew

adriangb requested review from alamb and berkaysynnada and removed request for alamb July 1, 2025 16:06

github-actions bot added optimizer Optimizer rules core Core DataFusion crate datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate labels Jul 1, 2025

adriangb requested review from xudong963 and kosiew July 1, 2025 16:06

adriangb changed the title ~~Pushdown hash join~~ implement filter passthrough for HashJoinExec and refactor filter pushdown APIs Jul 1, 2025

adriangb mentioned this pull request Jul 2, 2025

Add dynamic filter (bounds) pushdown to HashJoinExec #16445

Open

adriangb force-pushed the pushdown-hash-join branch from 5c6c119 to 2b0c438 Compare July 2, 2025 02:56

adriangb changed the title ~~implement filter passthrough for HashJoinExec and refactor filter pushdown APIs~~ refactor filter pushdown APIs Jul 2, 2025

alamb approved these changes Jul 2, 2025

View reviewed changes

kosiew reviewed Jul 3, 2025

View reviewed changes

adriangb and others added 13 commits July 3, 2025 16:17

wip

94318f3

implement filter passthrough for joins and refactor fitler pushdown h…

91176f9

…elpers to simplify APIs

fix docs

3924772

cleanup

c5f1e79

lint

110ac64

remove hash join bits

adf8080

nit

32932e1

lint

54f4c95

Update datafusion/physical-plan/src/filter_pushdown.rs

7e78336

Co-authored-by: Andrew Lamb <[email protected]>

reduce clones

9c81ab0

move vec out of loop

963a0a9

more docstrings

4436495

fmt

099b808

adriangb added 2 commits July 3, 2025 16:17

fmt

2b3b757

fmt

2f4817a

adriangb force-pushed the pushdown-hash-join branch from 1e49157 to 2f4817a Compare July 3, 2025 21:17

kosiew merged commit a089eff into apache:main Jul 8, 2025
27 checks passed

jonathanc-n mentioned this pull request Jul 9, 2025

fix: Fix CI failing due to #16686 #16718

Merged

adriangb mentioned this pull request Jul 10, 2025

Refactor filter pushdown APIs to enable joins to pass through filters #16732

Merged

refactor filter pushdown APIs #16642

refactor filter pushdown APIs #16642

Conversation

adriangb commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Jul 3, 2025

Uh oh!

adriangb commented Jul 3, 2025

Uh oh!

alamb commented Jul 7, 2025

Uh oh!

adriangb commented Jul 7, 2025

Uh oh!

alamb commented Jul 7, 2025

Uh oh!

kosiew commented Jul 8, 2025

Uh oh!

Uh oh!

alamb commented Jul 8, 2025

Uh oh!

Uh oh!

adriangb commented Jul 1, 2025 •

edited

Loading