-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources #49961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9d3bbc2
to
d4a757a
Compare
6cff481
to
b81c535
Compare
.../src/main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonScanBuilder.scala
Show resolved
Hide resolved
override def build(): Scan = new PythonScan(ds, shortName, outputSchema, options) | ||
options: CaseInsensitiveStringMap) | ||
extends ScanBuilder | ||
with SupportsPushDownFilters { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because Python data source is new, we should use SupportsPushDownV2Filters
first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided to only support V1 filters to keep the Python API simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @cloud-fan
For DS v2, the scan workflow is:
Python data source API is a bit different from DS v2 and there is no 1-1 mapping. I think the current mapping is
Now to push down filters, we need to create the python batch reader earlier, which means one more round of Python worker communication in the optimizer. I'm wondering that once we finish pushdown, shall we do the planning work immediately and keep |
...scala/org/apache/spark/sql/execution/datasources/v2/python/UserDefinedPythonDataSource.scala
Show resolved
Hide resolved
.../src/main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonScanBuilder.scala
Show resolved
Hide resolved
|
||
// Optionally called by DSv2 once to push down filters before the scan is built. | ||
override def pushFilters(filters: Array[Filter]): Array[Filter] = { | ||
if (!SQLConf.get.pythonFilterPushDown) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means we should always pushdown filters if PythonScanBuilder
supports SupportsPushDownFilters
.
Why do we need this config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd like to avoid the new code path (serializing filters, running new python worker, ...) for existing Python data sources that don't implement pushdown. So in case there's a crash or a performance issue in the new code path, its impact is limited.
However we currently don't have a good way to detect whether user has implemented pushFilters()
in Python DataSourceReader
before ScanBuilder.pushFilters()
is called. This is because we don't know whether it's a streaming read or a batch read at this point (the optimizer knows but the data source doesn't get this info) so it's not safe to call Python DataSource.reader()
to get the batch reader instance.
So we instead add a conf to turn off the new code path. But also if the user imeplements pushFilters()
and this conf is disabled then we throw an error to let the user know that they must turn on the conf to enable filter pushdown.
In the future if we figure out how to check whether the Python reader imeplements pushFilters
then we can set this conf to enabled by default and deprecate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for explanation.
Good idea to avoid the extra round of worker. I think that would require some refactoring of the plan_read worker so I'll implement that in a new PR since this PR is already very large. Also when we add column pruning support we should get partitions in column pruning worker rather than filter pushdown worker. |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonScan.scala
Outdated
Show resolved
Hide resolved
.../src/main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonScanBuilder.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonScan.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonScan.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM expect some minor comments.
...src/main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonDataSourceV2.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Left some minor comments
sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonDataSourceSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonDataSourceSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonScan.scala
Outdated
Show resolved
Hide resolved
@@ -4673,6 +4673,13 @@ object SQLConf { | |||
.booleanConf | |||
.createWithDefault(false) | |||
|
|||
val PYTHON_FILTER_PUSHDOWN_ENABLED = buildConf("spark.sql.python.filterPushdown.enabled") | |||
.doc("When true, enable filter pushdown to Python datasource, at the cost of running " + | |||
"Python worker one additional time during planning.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still have additional planning now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this is still true. I have a separate PR to combine filter pushdown & plan read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add docstring port fixes from serialization pr monkey patch data source to avoid changing existing code remove worker_main improve documentation and rename to be consistent with DSv2 add comments fix lint add pushed filter info to plan description update error message conf for filter pushdown check that pushFilters is called from explain() address review Update sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonScan.scala Co-authored-by: Jiaan Geng <[email protected]> Update sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonDataSourceV2.scala Co-authored-by: Jiaan Geng <[email protected]> use withSQLConf revert PythonScan description change
59d48d7
to
2c584a1
Compare
Thanks, merging to master. |
…ter pushdown Follow up of #49961 ### What changes were proposed in this pull request? This PR adds the serialization and deserialization required to pass V1 data source filters from JVM to Python. Also adds the equivalent Python dataclass representation of the filters. To ensure that filter values are properly converted to Python values, we use `VariantVal` to serialize catalyst values into binary, then deserialize into Python `VariantVal`, then convert to Python values. #### Examples Supported filters | SQL filter | Representation | |---------------------|--------------------------------------------| | `a.b.c = 1` | `EqualTo(("a", "b", "c"), 1)` | | `a = 1` | `EqualTo(("a",), 1)` | | `a = 'hi'` | `EqualTo(("a",), "hi")` | | `a = array(1, 2)` | `EqualTo(("a",), [1, 2])` | | `a` | `EqualTo(("a",), True)` | | `not a` | `Not(EqualTo(("a",), True))` | | `a <> 1` | `Not(EqualTo(("a",), 1))` | | `a > 1` | `GreaterThan(("a",), 1)` | | `a >= 1` | `GreaterThanOrEqual(("a",), 1)` | | `a < 1` | `LessThan(("a",), 1)` | | `a <= 1` | `LessThanOrEqual(("a",), 1)` | | `a in (1, 2, 3)` | `In(("a",), (1, 2, 3))` | | `a is null` | `IsNull(("a",))` | | `a is not null` | `IsNotNull(("a",))` | | `a like 'abc%'` | `StringStartsWith(("a",), "abc")` | | `a like '%abc'` | `StringEndsWith(("a",), "abc")` | | `a like '%abc%'` | `StringContains(("a",), "abc")` | Unsupported filters - `a = b` - `f(a, b) = 1` - `a % 2 = 1` - `a[0] = 1` - `a < 0 or a > 1` - `a like 'c%c%'` - `a ilike 'hi'` - `a = 'hi' collate zh` ### Why are the changes needed? The base PR #49961 only supported EqualTo int. This PR adds support for many other useful filter types making Python Data Source filter pushdown API actually useful. ### Does this PR introduce _any_ user-facing change? Yes. Python Data Source now supports more pushdown filter types. ### How was this patch tested? End-to-end tests in `test_python_datasource.py`. ### Was this patch authored or co-authored using generative AI tooling? No Closes #50252 from wengh/pyds-filter-serialization. Authored-by: Haoyu Weng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
… workers Follow up of #49961 ### What changes were proposed in this pull request? As pointed out by #49961 (comment), at the time of filter pushdown we already have enough information to also plan read partitions. So this PR changes the filter pushdown worker to also get partitions, reducing the number of exchanges between Python and Scala. Changes: - Extract part of `plan_data_source_read.py` that is responsible for sending the partitions and the read function to JVM. - Use the extracted logic to also send the partitions and read function when doing filter pushdown in `data_source_pushdown_filters.py`. - Update the Scala code accordingly. ### Why are the changes needed? To improve Python Data Source performance when filter pushdown configuration is enabled. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests in `test_python_datasource.py` ### Was this patch authored or co-authored using generative AI tooling? No Closes #50340 from wengh/pyds-combine-pushdown-plan. Authored-by: Haoyu Weng <[email protected]> Signed-off-by: Allison Wang <[email protected]>
What changes were proposed in this pull request?
This PR adds support for filter pushdown to Python Data Source batch read, with a API similar to
SupportsPushDownFilters
interface. The user can implementDataSourceReader.pushFilters
to receive filters that may be pushed down, decide which filters to push down, remember them, and return the remaining filters to be applied by Spark.Note that filter pushdown is only supported for batch read, not for streaming read. This is also the case for the Scala API. Therefore the new API is added to
DataSourceReader
and not toDataSource
orDataSourceStreamReader
.To keep the Python API simple, we will only support V1 filters that have a column, a boolean operator, and a literal value. The filter serialization is a placeholder and will be implemented in a future PR.
Roadmap
Suggested reivew order (from high level to details)
datasource.py
: add filter pushdown to Python Data Source APItest_python_datasource.py
: tests for filter pushdownPythonScanBuilder.scala
: implement filter pushdown API in ScalaUserDefinedPythonDataSource.scala
,data_source_pushdown_filters.py
: communication between Python and Scala and filter pushdown logicChanges to interactions between Python and Scala
Original sequence:
Updated sequence (new interactions are highlighted in yellow):
Why are the changes needed?
Filter pushdown allows reducing the amount of data produced by the reader, by filtering rows directly in the data source scan. The reduction in the amount of data can improve query performance. This PR implements filter pushdown for Python Data Sources API using the existing Scala DS filter pushdown API. An upcoming PR will implement the actual filter types and the serialization of filters.
Does this PR introduce any user-facing change?
Yes. New API are added. See
datasource.py
for details.The new API is optional to implement. If not implemented, the reader will behave as before.
The feature is also controlled by the new
spark.sql.python.filterPushdown.enabled
configuration which is disabled by default.If the conf is enabled, the new code path for filter pushdown is used. Otherwise, the code path is skipped and we throw an exception if the user implements
DataSourceReader.pushFilters()
so that it's not ignored silently.How was this patch tested?
Tests added to
test_python_datasource.py
to check that:Was this patch authored or co-authored using generative AI tooling?
No