[SPARK-48516][PYTHON][CONNECT] Turn on Arrow optimization for Python UDFs by default #49482

xinrong-meng · 2025-01-14T08:55:42Z

What changes were proposed in this pull request?

Turn on Arrow optimization for Python UDFs by default

Why are the changes needed?

Arrow optimization was introduced in 3.4.0. See SPARK-40307 for more context.

Arrow-optimized Python UDF is approximately 1.6 times faster than the original pickled Python UDF. More details can be found in this blog post.

In version 4.0.0, we propose enabling the optimization by default. If PyArrow is not installed, it will fall back to the original pickled Python UDF.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Existing tests

Was this patch authored or co-authored using generative AI tooling?

No

zhengruifeng · 2025-01-14T09:00:43Z

python/pyspark/sql/utils.py

@@ -71,6 +71,15 @@
    pass


+has_arrow: bool = False


I think we can use from pyspark.testing.utils import have_pyarrow

Can we address this comment @xinrong-meng

Good point! Resolved, thank you

There will be a circular import if we do that. Let me follow up with a separate PR instead.

probably we can import it inside the function/class body

dongjoon-hyun

It seems that there are many unit test failures unfortunately, @xinrong-meng .

Feel free to have your time. I believe that the community agree that this is worthy for Apache Spark 4.0.0. We can backport this when this PR is ready.

Let's make CIs pass first.

dongjoon-hyun · 2025-01-22T19:30:56Z

Gentle ping, @xinrong-meng .

If this is targeting Apache Spark 4.0, we had better have this before February 1st.

https://spark.apache.org/versioning-policy.html

xinrong-meng · 2025-01-23T19:30:12Z

Thanks @dongjoon-hyun I just got some free cycles for that and will resolve it ASAP.

xinrong-meng · 2025-01-24T01:44:31Z

A quick update: the PR is blocked by UDT support in Arrow Python UDF, which I’m currently working on

dongjoon-hyun · 2025-01-24T05:18:06Z

Thank you for the updated context.

dongjoon-hyun

Given the status, I believe this contribution is proper for Apache Spark 4.1.0 because we need more testing and the community verification.

xinrong-meng · 2025-02-04T01:11:22Z

Thanks @dongjoon-hyun for attention! The current proposal is to fall back to the existing (non-Arrow-optimized) Python UDF when UDT is involved. My understanding is that no further testing is needed and the code change is minimal (just an if-else), but I respect the community’s decision.

xinrong-meng · 2025-02-04T01:12:22Z

I marked it as WIP because I wanted to file a separate PR for the fallback mechanism with tests. Once that PR is in, this PR will be unblocked immediately.

HyukjinKwon · 2025-02-04T11:18:53Z

Let's make this ready ASAP for 4.0

…t and output types ### What changes were proposed in this pull request? Introduce a fallback mechanism for Arrow-optimized Python UDFs when either the input or return types contain User-Defined Types (UDTs). If UDTs are detected, the system logs a warning and switches to currently default, non-Arrow-optimized UDF. ### Why are the changes needed? To unblock enabling Arrow-optimized Python UDFs by default, see [pr](#49482) ### Does this PR introduce _any_ user-facing change? Yes. UDT input and output types will not fail Arrow Python UDF anymore, as shown below: ```py >>> import pyspark.sql.functions as F >>> from pyspark.sql import Row >>> from pyspark.testing.sqlutils import ExamplePoint, ExamplePointUDT # UDT intput >>> from pyspark.sql.types import * >>> row = Row(label=1.0, point=ExamplePoint(1.0, 2.0)) >>> df = spark.createDataFrame([row]) >>> >>> udf1 = F.udf(lambda p: p.y, DoubleType(), useArrow=True) >>> df.select(udf1(df.point)).show() 25/02/03 17:49:57 WARN ExtractPythonUDFs: Arrow optimization disabled due to UDT input or return type. Falling back to non-Arrow-optimized UDF execution. +---------------+ |<lambda>(point)| +---------------+ | 2.0| +---------------+ # UDT output >>> row = Row(value=3.0) >>> df = spark.createDataFrame([row]) >>> udf_with_udt_output = F.udf(lambda v: ExamplePoint(v, v + 1), ExamplePointUDT(), useArrow=True) >>> df.select(udf_with_udt_output(df.value)).show() 25/02/03 17:51:43 WARN ExtractPythonUDFs: Arrow optimization disabled due to UDT input or return type. Falling back to non-Arrow-optimized UDF execution. +---------------+ |<lambda>(value)| +---------------+ | (3.0, 4.0)| +---------------+ ``` ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49786 from xinrong-meng/udt_arrow_udf. Authored-by: Xinrong Meng <[email protected]> Signed-off-by: Xinrong Meng <[email protected]>

…t and output types ### What changes were proposed in this pull request? Introduce a fallback mechanism for Arrow-optimized Python UDFs when either the input or return types contain User-Defined Types (UDTs). If UDTs are detected, the system logs a warning and switches to currently default, non-Arrow-optimized UDF. ### Why are the changes needed? To unblock enabling Arrow-optimized Python UDFs by default, see [pr](#49482) ### Does this PR introduce _any_ user-facing change? Yes. UDT input and output types will not fail Arrow Python UDF anymore, as shown below: ```py >>> import pyspark.sql.functions as F >>> from pyspark.sql import Row >>> from pyspark.testing.sqlutils import ExamplePoint, ExamplePointUDT # UDT intput >>> from pyspark.sql.types import * >>> row = Row(label=1.0, point=ExamplePoint(1.0, 2.0)) >>> df = spark.createDataFrame([row]) >>> >>> udf1 = F.udf(lambda p: p.y, DoubleType(), useArrow=True) >>> df.select(udf1(df.point)).show() 25/02/03 17:49:57 WARN ExtractPythonUDFs: Arrow optimization disabled due to UDT input or return type. Falling back to non-Arrow-optimized UDF execution. +---------------+ |<lambda>(point)| +---------------+ | 2.0| +---------------+ # UDT output >>> row = Row(value=3.0) >>> df = spark.createDataFrame([row]) >>> udf_with_udt_output = F.udf(lambda v: ExamplePoint(v, v + 1), ExamplePointUDT(), useArrow=True) >>> df.select(udf_with_udt_output(df.value)).show() 25/02/03 17:51:43 WARN ExtractPythonUDFs: Arrow optimization disabled due to UDT input or return type. Falling back to non-Arrow-optimized UDF execution. +---------------+ |<lambda>(value)| +---------------+ | (3.0, 4.0)| +---------------+ ``` ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49786 from xinrong-meng/udt_arrow_udf. Authored-by: Xinrong Meng <[email protected]> Signed-off-by: Xinrong Meng <[email protected]> (cherry picked from commit ac2f3a4) Signed-off-by: Xinrong Meng <[email protected]>

xinrong-meng · 2025-02-05T02:02:18Z

The Arrow fallback PR is in so this PR should be unblocked. I’ll keep an eye on testing and make it ready ASAP!

Thank you @HyukjinKwon @dongjoon-hyun

HyukjinKwon · 2025-02-05T02:03:52Z

can we address #49482 (comment) @xinrong-meng ?

HyukjinKwon · 2025-02-06T00:40:54Z

Can you take a look at the test failure to make sure? I think those failures look related.

…t and output types ### What changes were proposed in this pull request? Introduce a fallback mechanism for Arrow-optimized Python UDFs when either the input or return types contain User-Defined Types (UDTs). If UDTs are detected, the system logs a warning and switches to currently default, non-Arrow-optimized UDF. ### Why are the changes needed? To unblock enabling Arrow-optimized Python UDFs by default, see [pr](apache#49482) ### Does this PR introduce _any_ user-facing change? Yes. UDT input and output types will not fail Arrow Python UDF anymore, as shown below: ```py >>> import pyspark.sql.functions as F >>> from pyspark.sql import Row >>> from pyspark.testing.sqlutils import ExamplePoint, ExamplePointUDT # UDT intput >>> from pyspark.sql.types import * >>> row = Row(label=1.0, point=ExamplePoint(1.0, 2.0)) >>> df = spark.createDataFrame([row]) >>> >>> udf1 = F.udf(lambda p: p.y, DoubleType(), useArrow=True) >>> df.select(udf1(df.point)).show() 25/02/03 17:49:57 WARN ExtractPythonUDFs: Arrow optimization disabled due to UDT input or return type. Falling back to non-Arrow-optimized UDF execution. +---------------+ |<lambda>(point)| +---------------+ | 2.0| +---------------+ # UDT output >>> row = Row(value=3.0) >>> df = spark.createDataFrame([row]) >>> udf_with_udt_output = F.udf(lambda v: ExamplePoint(v, v + 1), ExamplePointUDT(), useArrow=True) >>> df.select(udf_with_udt_output(df.value)).show() 25/02/03 17:51:43 WARN ExtractPythonUDFs: Arrow optimization disabled due to UDT input or return type. Falling back to non-Arrow-optimized UDF execution. +---------------+ |<lambda>(value)| +---------------+ | (3.0, 4.0)| +---------------+ ``` ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#49786 from xinrong-meng/udt_arrow_udf. Authored-by: Xinrong Meng <[email protected]> Signed-off-by: Xinrong Meng <[email protected]>

xinrong-meng · 2025-02-07T23:41:09Z

Failed tests seem irrelevant:

[info] MySQLNamespaceSuite:
[info] org.apache.spark.sql.jdbc.v2.MySQLNamespaceSuite *** ABORTED *** (10 seconds, 369 milliseconds)
[info]   com.github.dockerjava.api.exception.InternalServerErrorException: Status 500: {"message":"driver failed programming external connectivity on endpoint condescending_lumiere (7a051f139d30436d8a1e231e3f4aeb991784c60ddc1f391ee8c1fc587c8ce2ca): Error starting userland proxy: listen tcp4 0.0.0.0:39901: bind: address already in use"}

Retriggering tests https://github.com/xinrong-meng/spark/runs/36869403425

dongjoon-hyun

+1, LGTM since the failed Oracle JDBC CI looks irrelevant.

Merged to master/4.0.

…UDFs by default ### What changes were proposed in this pull request? Turn on Arrow optimization for Python UDFs by default ### Why are the changes needed? Arrow optimization was introduced in 3.4.0. See [SPARK-40307](https://issues.apache.org/jira/browse/SPARK-40307) for more context. Arrow-optimized Python UDF is approximately 1.6 times faster than the original pickled Python UDF. More details can be found in [this blog post](https://www.databricks.com/blog/arrow-optimized-python-udfs-apache-sparktm-35). In version 4.0.0, we propose enabling the optimization by default. If PyArrow is not installed, it will fall back to the original pickled Python UDF. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #49482 from xinrong-meng/arrow_on. Authored-by: Xinrong Meng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 59dd406) Signed-off-by: Dongjoon Hyun <[email protected]>

xinrong-meng · 2025-02-12T05:12:26Z

Thank you @dongjoon-hyun !

…Arrow-optimized Python UDF enabled by default ### What changes were proposed in this pull request? This PR is a followup of #49482 that updates migration guide. ### Why are the changes needed? In order for users to migrate to Spark 4.0 seamlessly ### Does this PR introduce _any_ user-facing change? No, it fixes the migration guide documentation. ### How was this patch tested? Manually checked. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50034 from HyukjinKwon/SPARK-48510-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…Arrow-optimized Python UDF enabled by default ### What changes were proposed in this pull request? This PR is a followup of #49482 that updates migration guide. ### Why are the changes needed? In order for users to migrate to Spark 4.0 seamlessly ### Does this PR introduce _any_ user-facing change? No, it fixes the migration guide documentation. ### How was this patch tested? Manually checked. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50034 from HyukjinKwon/SPARK-48510-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 45900c4) Signed-off-by: Hyukjin Kwon <[email protected]>

HyukjinKwon · 2025-03-05T02:10:16Z

There are some subtle diff found. I will revert this for now, and enable it back later after improving this more.

…t found but Arrow-optimized Python UDFs enabled ### What changes were proposed in this pull request? This PR extracts legitimate improvement in #49482. Falls back regular Python UDF when Arrow is not found but Arrow-optimized Python UDFs enabled. ### Why are the changes needed? To minimize end user impact. ### Does this PR introduce _any_ user-facing change? Yes, it falls back to regular Python UDF when Arrow is not found but Arrow-optimized Python UDFs enabled. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50160 from HyukjinKwon/SPARK-51393. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…t found but Arrow-optimized Python UDFs enabled ### What changes were proposed in this pull request? This PR extracts legitimate improvement in #49482. Falls back regular Python UDF when Arrow is not found but Arrow-optimized Python UDFs enabled. ### Why are the changes needed? To minimize end user impact. ### Does this PR introduce _any_ user-facing change? Yes, it falls back to regular Python UDF when Arrow is not found but Arrow-optimized Python UDFs enabled. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50160 from HyukjinKwon/SPARK-51393. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 74293cc) Signed-off-by: Hyukjin Kwon <[email protected]>

…Arrow-optimized Python UDF enabled by default ### What changes were proposed in this pull request? This PR is a followup of apache#49482 that updates migration guide. ### Why are the changes needed? In order for users to migrate to Spark 4.0 seamlessly ### Does this PR introduce _any_ user-facing change? No, it fixes the migration guide documentation. ### How was this patch tested? Manually checked. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50034 from HyukjinKwon/SPARK-48510-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…t found but Arrow-optimized Python UDFs enabled ### What changes were proposed in this pull request? This PR extracts legitimate improvement in apache#49482. Falls back regular Python UDF when Arrow is not found but Arrow-optimized Python UDFs enabled. ### Why are the changes needed? To minimize end user impact. ### Does this PR introduce _any_ user-facing change? Yes, it falls back to regular Python UDF when Arrow is not found but Arrow-optimized Python UDFs enabled. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50160 from HyukjinKwon/SPARK-51393. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

github-actions bot added SQL DOCS PYTHON CONNECT labels Jan 14, 2025

zhengruifeng reviewed Jan 14, 2025

View reviewed changes

dongjoon-hyun reviewed Jan 15, 2025

View reviewed changes

xinrong-meng changed the title ~~[SPARK-48516][PYTHON][CONNECT] Turn on Arrow optimization for Python UDFs by default~~ [WIP][SPARK-48516][PYTHON][CONNECT] Turn on Arrow optimization for Python UDFs by default Feb 4, 2025

dongjoon-hyun reviewed Feb 4, 2025

View reviewed changes

xinrong-meng mentioned this pull request Feb 4, 2025

[SPARK-51076][PYTHON][CONNECT] Arrow Python UDF fallback for UDT input and output types #49786

Closed

conf true

e25b0eb

xinrong-meng force-pushed the arrow_on branch from 00d4e03 to e25b0eb Compare February 5, 2025 02:00

xinrong-meng changed the title ~~[WIP][SPARK-48516][PYTHON][CONNECT] Turn on Arrow optimization for Python UDFs by default~~ [SPARK-48516][PYTHON][CONNECT] Turn on Arrow optimization for Python UDFs by default Feb 5, 2025

xinrong-meng marked this pull request as ready for review February 5, 2025 02:00

HyukjinKwon approved these changes Feb 5, 2025

View reviewed changes

xinrong-meng force-pushed the arrow_on branch from ac09220 to e25b0eb Compare February 5, 2025 19:57

TRIGGER TEST

dc0aa66

fallback

84c7df1

github-actions bot added the ML label Feb 6, 2025

xinrong-meng added 2 commits February 6, 2025 13:51

comment

5b1731f

ml useArrow=False

06d7d83

dongjoon-hyun approved these changes Feb 10, 2025

View reviewed changes

dongjoon-hyun closed this in 59dd406 Feb 10, 2025

HyukjinKwon mentioned this pull request Feb 21, 2025

[SPARK-48516][PYTHON][FOLLOW-UP] Add a note in migration guide about Arrow-optimized Python UDF enabled by default #50034

Closed

HyukjinKwon mentioned this pull request Feb 27, 2025

[SPARK-51330][PYTHON] Enable spark.sql.execution.pythonUDTF.arrow.enabled by default #50096

Closed

HyukjinKwon mentioned this pull request Mar 5, 2025

[SPARK-51393][PYTHON] Fallback to regular Python UDF when Arrow is not found but Arrow-optimized Python UDFs enabled #50160

Closed

[SPARK-48516][PYTHON][CONNECT] Turn on Arrow optimization for Python UDFs by default #49482

[SPARK-48516][PYTHON][CONNECT] Turn on Arrow optimization for Python UDFs by default #49482

Uh oh!

Conversation

xinrong-meng commented Jan 14, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Feb 5, 2025

Choose a reason for hiding this comment

Uh oh!

xinrong-meng Feb 5, 2025

Choose a reason for hiding this comment

Uh oh!

xinrong-meng Feb 5, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Feb 8, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jan 22, 2025

Uh oh!

xinrong-meng commented Jan 23, 2025

Uh oh!

xinrong-meng commented Jan 24, 2025

Uh oh!

dongjoon-hyun commented Jan 24, 2025

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

xinrong-meng commented Feb 4, 2025

Uh oh!

xinrong-meng commented Feb 4, 2025

Uh oh!

HyukjinKwon commented Feb 4, 2025

Uh oh!

xinrong-meng commented Feb 5, 2025

Uh oh!

HyukjinKwon commented Feb 5, 2025

Uh oh!

HyukjinKwon commented Feb 6, 2025

Uh oh!

xinrong-meng commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

xinrong-meng commented Feb 12, 2025

Uh oh!

HyukjinKwon commented Mar 5, 2025

Uh oh!

Uh oh!

xinrong-meng commented Feb 7, 2025 •

edited

Loading