[SPARK-51713][PYTHON] Add column pruning API to Python Data Sources #50472

wengh · 2025-04-01T00:45:03Z

Related: #49961

What changes were proposed in this pull request?

This PR adds support for column pruning in Python Data Source batch reader.

class DataSourceReader(ABC):
    ...
    def pruneColumns(self, requiredSchema: StructType) -> Optional[StructType]:
        """
        Returns the actual schema after pruning. :meth:`DataSourceReader.read` must return data
        following this schema.

        The returned schema must be a superset of the required schema and a subset of the full
        schema.

        This method is called once during query planning. By default, it returns None,
        not performing any pruning. Subclasses can override this to implement column pruning.

        Implementation should try its best to prune the unnecessary columns or nested fields, but
        it's also OK to do the pruning partially, e.g., a data source may not be able to prune
        nested fields, and only prune top-level columns.

        Parameters
        ----------
        requiredSchema : :class:`StructType`
            The schema of the data source that is required by the query.

            This is a subset of the full schema.
            All fields that are not in this schema are unnecessary and can be pruned.

        Returns
        -------
        :class:`StructType` or None
            The pruned schema, or None if pruning is not supported.

        Side effects
        ------------
        This method is allowed to modify `self`. The object must remain picklable.
        Modifications to `self` are visible to the `partitions()` and `read()` methods.

        Examples
        --------
        Implement pushFilters to support top-level column pruning, and save all required
        columns in `self.required` for later use:

        >>> def pruneColumns(self, requiredSchema):
        ...     self.required = requiredSchema.fieldNames()
        ...     required = set(requiredSchema.fieldNames())
        ...     schema = StructType([f for f in self.schema.fields if f.name in required])
        ...     return schema

        Implement pushFilters to support nested column pruning:

        >>> def pruneColumns(self, requiredSchema):
        ...     self.schema = requiredSchema
        ...     return self.schema
        """
        return None

Why are the changes needed?

Column pruning allows improved query performance by reducing the amount of data scanned and processed.
This PR adds the API to allow custom data sources to implement column pruning.

Does this PR introduce any user-facing change?

Yes. New API are added. See datasource.py for details.

The new API is optional to implement. If not implemented, the reader will behave as before.

The feature is also controlled by the spark.sql.python.filterPushdown.enabled configuration which is disabled by default.
If the conf is enabled, the new code path for filter pushdown is used. Otherwise, the code path is skipped and we throw an exception if the user implements DataSourceReader.pruneColumns() so that it's not ignored silently.

How was this patch tested?

Tests added to test_python_datasource.py

Was this patch authored or co-authored using generative AI tooling?

No

github-actions · 2025-07-19T00:29:53Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added SQL CORE PYTHON labels Apr 1, 2025

wengh added 5 commits April 1, 2025 09:13

column pruning

1df1dda

combine workers

c52183b

fix lint

057ed0e

fix lint

f697717

remove new conf

b85b32a

wengh force-pushed the pyds-column-pruning branch from 9c63c08 to b85b32a Compare April 1, 2025 17:29

wengh added 7 commits April 1, 2025 11:11

add tests

66d9397

more tests

0997f8a

fix

3203660

improve errors

bbe50b5

add tests for combined workers

2b0c385

remove unused imports in data_source_prune_columns.py

0ab7817

remove unused import in serializers.py

dfbe2bc

github-actions bot removed the CORE label Apr 1, 2025

wengh added 3 commits April 1, 2025 16:09

fix error message

ed2279f

comment

3ae2409

add comments

1dbcc55

wengh force-pushed the pyds-column-pruning branch from 88e4db1 to 1dbcc55 Compare April 2, 2025 23:36

wengh changed the title ~~[WIP] Add column pruning API to Python Data Sources~~ [WIP][SPARK-51713][PYTHON] Add column pruning API to Python Data Sources Apr 3, 2025

wengh added 3 commits April 4, 2025 08:33

match DSV2 API

d38e9ae

fix lint

f247de7

remove requiredSchemaNormalized

5427986

wengh changed the title ~~[WIP][SPARK-51713][PYTHON] Add column pruning API to Python Data Sources~~ [SPARK-51713][PYTHON] Add column pruning API to Python Data Sources Apr 8, 2025

wengh added 2 commits April 8, 2025 16:02

add docstring and ensure partitions is not called twice

fcd7c47

more comments

4ba8eda

wengh marked this pull request as ready for review April 8, 2025 23:32

fix typo

d720836

fix typo

cc74836

github-actions bot added the Stale label Jul 19, 2025

github-actions bot closed this Jul 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-51713][PYTHON] Add column pruning API to Python Data Sources #50472

[SPARK-51713][PYTHON] Add column pruning API to Python Data Sources #50472

wengh commented Apr 1, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 19, 2025

Uh oh!

Uh oh!

[SPARK-51713][PYTHON] Add column pruning API to Python Data Sources #50472

[SPARK-51713][PYTHON] Add column pruning API to Python Data Sources #50472

Conversation

wengh commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Jul 19, 2025

Uh oh!

Uh oh!

wengh commented Apr 1, 2025 •

edited

Loading