Skip to content

[SPARK-51330][PYTHON] Enable spark.sql.execution.pythonUDTF.arrow.enabled by default #50096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR enables spark.sql.execution.pythonUDTF.arrow.enabled by default.

Why are the changes needed?

We enabled Arrow optimization #49482 and #50036. We should also enable it for UDTF too.

Does this PR introduce any user-facing change?

It will fallback to non-optimized code path so it impact will be minimized. Users will leverage Arrow optimization by default.

How was this patch tested?

Existing tests in the CI.

Was this patch authored or co-authored using generative AI tooling?

No

@HyukjinKwon
Copy link
Member Author

cc @allisonwang-db

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistent with Python UDFs, yes we should enable this. But arrow code path does not necessarily have performance improvements (or it can even lead to perf regressions). It only helps when the output table size is large:
https://spark.apache.org/docs/latest/api/python/user_guide/sql/python_udtf.html#arrow-optimization
I think https://issues.apache.org/jira/browse/SPARK-44856 needs to be worked on first to make arrow code path more performant for small output size.
cc @ueshin @wengh

@HyukjinKwon
Copy link
Member Author

Made a draft PR for SPARK-44856 #50099

@HyukjinKwon HyukjinKwon marked this pull request as draft February 28, 2025 00:06
@HyukjinKwon HyukjinKwon marked this pull request as ready for review February 28, 2025 05:58
HyukjinKwon added a commit that referenced this pull request Feb 28, 2025
…pInPandas/mapInArrow batched in byte size

### What changes were proposed in this pull request?

This PR is a followup of #50096 that reverts unrelated changes and mark mapInPandas/mapInArrow batched in byte size

### Why are the changes needed?

To make the original change self-contained, and mark mapInPandas/mapInArrow batched in byte size to be consistent.

### Does this PR introduce _any_ user-facing change?

No, the main change has not been released out yet.

### How was this patch tested?

Manually.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #50111 from HyukjinKwon/SPARK-51316-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Feb 28, 2025
…pInPandas/mapInArrow batched in byte size

### What changes were proposed in this pull request?

This PR is a followup of #50096 that reverts unrelated changes and mark mapInPandas/mapInArrow batched in byte size

### Why are the changes needed?

To make the original change self-contained, and mark mapInPandas/mapInArrow batched in byte size to be consistent.

### Does this PR introduce _any_ user-facing change?

No, the main change has not been released out yet.

### How was this patch tested?

Manually.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #50111 from HyukjinKwon/SPARK-51316-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 5b45671)
Signed-off-by: Hyukjin Kwon <[email protected]>
a0x8o added a commit to a0x8o/spark that referenced this pull request Feb 28, 2025
…pInPandas/mapInArrow batched in byte size

### What changes were proposed in this pull request?

This PR is a followup of apache/spark#50096 that reverts unrelated changes and mark mapInPandas/mapInArrow batched in byte size

### Why are the changes needed?

To make the original change self-contained, and mark mapInPandas/mapInArrow batched in byte size to be consistent.

### Does this PR introduce _any_ user-facing change?

No, the main change has not been released out yet.

### How was this patch tested?

Manually.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #50111 from HyukjinKwon/SPARK-51316-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@HyukjinKwon
Copy link
Member Author

let me close this for now. I think behaviour diff is too much.

@HyukjinKwon HyukjinKwon closed this Mar 5, 2025
Pajaraja pushed a commit to Pajaraja/spark that referenced this pull request Mar 6, 2025
…pInPandas/mapInArrow batched in byte size

### What changes were proposed in this pull request?

This PR is a followup of apache#50096 that reverts unrelated changes and mark mapInPandas/mapInArrow batched in byte size

### Why are the changes needed?

To make the original change self-contained, and mark mapInPandas/mapInArrow batched in byte size to be consistent.

### Does this PR introduce _any_ user-facing change?

No, the main change has not been released out yet.

### How was this patch tested?

Manually.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#50111 from HyukjinKwon/SPARK-51316-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants