Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources #49961

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

wengh
Copy link
Contributor

@wengh wengh commented Feb 14, 2025

What changes were proposed in this pull request?

Suggested reivew order (from high level to details)

  1. datasource.py: add filter pushdown to Python Data Source API
  2. test_python_datasource.py: tests for filter pushdown
  3. PythonScanBuilder.scala: implement filter pushdown API in Scala
  4. UserDefinedPythonDataSource.scala (UserDefinedPythonDataSourceFilterPushdownRunner), data_source_pushdown_filters.py: communication between Python and Scala
    • Note that the current filter serialization is a placeholder. An upcoming PR will implement the actual serialization.
  5. remaining changes in sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/python and in python/pyspark/sql: Changes to the sequence of interactions between Python and Scala to accommodate filter pushdown

Changes to interactions between Python and Scala

Original sequence:
pyds old-2025-02-12-001557

Updated sequence (new interactions are highlighted in yellow):

  • Note that to allow for filter pushdown, we split the data source -> (partitions, read function) transformation (plan_data_source_read.py) into two steps: data source -> reader (data_source_get_reader.py) and reader -> (partitions, read function) (plan_data_source_read.py).
    pyds new-2025-02-14-001300

Why are the changes needed?

Filter pushdown allows reducing the amount of data produced by the reader, by filtering rows directly in the data source scan. The reduction in the amount of data can improve query performance. This PR implements filter pushdown for Python Data Sources API using the existing Scala DS filter pushdown API. An upcoming PR will implement the actual filter types and the serialization of filters.

Does this PR introduce any user-facing change?

Yes. New API are added. See datasource.py for details.

How was this patch tested?

Tests added to test_python_datasource.py.

Was this patch authored or co-authored using generative AI tooling?

No

@wengh wengh force-pushed the pyds-filter-pushdown branch 4 times, most recently from 95b2256 to 9565c2d Compare February 15, 2025 00:33
@wengh wengh force-pushed the pyds-filter-pushdown branch from 9565c2d to d061a7e Compare February 15, 2025 00:41
@wengh wengh force-pushed the pyds-filter-pushdown branch from 9d3bbc2 to d4a757a Compare February 15, 2025 01:02
@wengh wengh changed the title [WIP][PYTHON] Add filter pushdown API to Python Data Sources [SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources Feb 20, 2025
@wengh wengh marked this pull request as ready for review February 20, 2025 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant