Skip to content

[SPARK-51713][PYTHON] Add column pruning API to Python Data Sources #50472

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 22 commits into from

Conversation

wengh
Copy link
Contributor

@wengh wengh commented Apr 1, 2025

Related: #49961

What changes were proposed in this pull request?

This PR adds support for column pruning in Python Data Source batch reader.

class DataSourceReader(ABC):
    ...
    def pruneColumns(self, requiredSchema: StructType) -> Optional[StructType]:
        """
        Returns the actual schema after pruning. :meth:`DataSourceReader.read` must return data
        following this schema.

        The returned schema must be a superset of the required schema and a subset of the full
        schema.

        This method is called once during query planning. By default, it returns None,
        not performing any pruning. Subclasses can override this to implement column pruning.

        Implementation should try its best to prune the unnecessary columns or nested fields, but
        it's also OK to do the pruning partially, e.g., a data source may not be able to prune
        nested fields, and only prune top-level columns.

        Parameters
        ----------
        requiredSchema : :class:`StructType`
            The schema of the data source that is required by the query.

            This is a subset of the full schema.
            All fields that are not in this schema are unnecessary and can be pruned.

        Returns
        -------
        :class:`StructType` or None
            The pruned schema, or None if pruning is not supported.

        Side effects
        ------------
        This method is allowed to modify `self`. The object must remain picklable.
        Modifications to `self` are visible to the `partitions()` and `read()` methods.

        Examples
        --------
        Implement pushFilters to support top-level column pruning, and save all required
        columns in `self.required` for later use:

        >>> def pruneColumns(self, requiredSchema):
        ...     self.required = requiredSchema.fieldNames()
        ...     required = set(requiredSchema.fieldNames())
        ...     schema = StructType([f for f in self.schema.fields if f.name in required])
        ...     return schema

        Implement pushFilters to support nested column pruning:

        >>> def pruneColumns(self, requiredSchema):
        ...     self.schema = requiredSchema
        ...     return self.schema
        """
        return None

Why are the changes needed?

Column pruning allows improved query performance by reducing the amount of data scanned and processed.
This PR adds the API to allow custom data sources to implement column pruning.

Does this PR introduce any user-facing change?

Yes. New API are added. See datasource.py for details.

The new API is optional to implement. If not implemented, the reader will behave as before.

The feature is also controlled by the spark.sql.python.filterPushdown.enabled configuration which is disabled by default.
If the conf is enabled, the new code path for filter pushdown is used. Otherwise, the code path is skipped and we throw an exception if the user implements DataSourceReader.pruneColumns() so that it's not ignored silently.

How was this patch tested?

Tests added to test_python_datasource.py

Was this patch authored or co-authored using generative AI tooling?

No

@wengh wengh force-pushed the pyds-column-pruning branch from 9c63c08 to b85b32a Compare April 1, 2025 17:29
@github-actions github-actions bot removed the CORE label Apr 1, 2025
@wengh wengh force-pushed the pyds-column-pruning branch from 88e4db1 to 1dbcc55 Compare April 2, 2025 23:36
@wengh wengh changed the title [WIP] Add column pruning API to Python Data Sources [WIP][SPARK-51713][PYTHON] Add column pruning API to Python Data Sources Apr 3, 2025
@wengh wengh changed the title [WIP][SPARK-51713][PYTHON] Add column pruning API to Python Data Sources [SPARK-51713][PYTHON] Add column pruning API to Python Data Sources Apr 8, 2025
@wengh wengh marked this pull request as ready for review April 8, 2025 23:32
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jul 19, 2025
@github-actions github-actions bot closed this Jul 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant