Skip to content

[SPARK-48516][PYTHON][CONNECT] Turn on Arrow optimization for Python UDFs by default #49482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

xinrong-meng
Copy link
Member

What changes were proposed in this pull request?

Turn on Arrow optimization for Python UDFs by default

Why are the changes needed?

Arrow optimization was introduced in 3.4.0. See SPARK-40307 for more context.

Arrow-optimized Python UDF is approximately 1.6 times faster than the original pickled Python UDF. More details can be found in this blog post.

In version 4.0.0, we propose enabling the optimization by default. If PyArrow is not installed, it will fall back to the original pickled Python UDF.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Existing tests

Was this patch authored or co-authored using generative AI tooling?

No

@@ -71,6 +71,15 @@
pass


has_arrow: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use from pyspark.testing.utils import have_pyarrow

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we address this comment @xinrong-meng

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Resolved, thank you

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be a circular import if we do that. Let me follow up with a separate PR instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably we can import it inside the function/class body

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that there are many unit test failures unfortunately, @xinrong-meng .

Feel free to have your time. I believe that the community agree that this is worthy for Apache Spark 4.0.0. We can backport this when this PR is ready.

Let's make CIs pass first.

@dongjoon-hyun
Copy link
Member

Gentle ping, @xinrong-meng .

If this is targeting Apache Spark 4.0, we had better have this before February 1st.

@xinrong-meng
Copy link
Member Author

Thanks @dongjoon-hyun I just got some free cycles for that and will resolve it ASAP.

@xinrong-meng
Copy link
Member Author

A quick update: the PR is blocked by UDT support in Arrow Python UDF, which I’m currently working on

@dongjoon-hyun
Copy link
Member

Thank you for the updated context.

@xinrong-meng xinrong-meng changed the title [SPARK-48516][PYTHON][CONNECT] Turn on Arrow optimization for Python UDFs by default [WIP][SPARK-48516][PYTHON][CONNECT] Turn on Arrow optimization for Python UDFs by default Feb 4, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the status, I believe this contribution is proper for Apache Spark 4.1.0 because we need more testing and the community verification.

@xinrong-meng
Copy link
Member Author

Thanks @dongjoon-hyun for attention! The current proposal is to fall back to the existing (non-Arrow-optimized) Python UDF when UDT is involved. My understanding is that no further testing is needed and the code change is minimal (just an if-else), but I respect the community’s decision.

@xinrong-meng
Copy link
Member Author

I marked it as WIP because I wanted to file a separate PR for the fallback mechanism with tests. Once that PR is in, this PR will be unblocked immediately.

@HyukjinKwon
Copy link
Member

Let's make this ready ASAP for 4.0

xinrong-meng added a commit that referenced this pull request Feb 5, 2025
…t and output types

### What changes were proposed in this pull request?
Introduce a fallback mechanism for Arrow-optimized Python UDFs when either the input or return types contain User-Defined Types (UDTs). If UDTs are detected, the system logs a warning and switches to currently default, non-Arrow-optimized UDF.

### Why are the changes needed?
To unblock enabling Arrow-optimized Python UDFs by default, see [pr](#49482)

### Does this PR introduce _any_ user-facing change?
Yes. UDT input and output types will not fail Arrow Python UDF anymore, as shown below:

```py
>>> import pyspark.sql.functions as F
>>> from pyspark.sql import Row
>>> from pyspark.testing.sqlutils import ExamplePoint, ExamplePointUDT

# UDT intput
>>> from pyspark.sql.types import *
>>> row = Row(label=1.0, point=ExamplePoint(1.0, 2.0))
>>> df = spark.createDataFrame([row])
>>>
>>> udf1 = F.udf(lambda p: p.y, DoubleType(), useArrow=True)
>>> df.select(udf1(df.point)).show()
25/02/03 17:49:57 WARN ExtractPythonUDFs: Arrow optimization disabled due to UDT input or return type. Falling back to non-Arrow-optimized UDF execution.
+---------------+
|<lambda>(point)|
+---------------+
|            2.0|
+---------------+

# UDT output
>>> row = Row(value=3.0)
>>> df = spark.createDataFrame([row])
>>> udf_with_udt_output = F.udf(lambda v: ExamplePoint(v, v + 1), ExamplePointUDT(), useArrow=True)
>>> df.select(udf_with_udt_output(df.value)).show()
25/02/03 17:51:43 WARN ExtractPythonUDFs: Arrow optimization disabled due to UDT input or return type. Falling back to non-Arrow-optimized UDF execution.
+---------------+
|<lambda>(value)|
+---------------+
|     (3.0, 4.0)|
+---------------+
```

### How was this patch tested?
Unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #49786 from xinrong-meng/udt_arrow_udf.

Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: Xinrong Meng <[email protected]>
xinrong-meng added a commit that referenced this pull request Feb 5, 2025
…t and output types

### What changes were proposed in this pull request?
Introduce a fallback mechanism for Arrow-optimized Python UDFs when either the input or return types contain User-Defined Types (UDTs). If UDTs are detected, the system logs a warning and switches to currently default, non-Arrow-optimized UDF.

### Why are the changes needed?
To unblock enabling Arrow-optimized Python UDFs by default, see [pr](#49482)

### Does this PR introduce _any_ user-facing change?
Yes. UDT input and output types will not fail Arrow Python UDF anymore, as shown below:

```py
>>> import pyspark.sql.functions as F
>>> from pyspark.sql import Row
>>> from pyspark.testing.sqlutils import ExamplePoint, ExamplePointUDT

# UDT intput
>>> from pyspark.sql.types import *
>>> row = Row(label=1.0, point=ExamplePoint(1.0, 2.0))
>>> df = spark.createDataFrame([row])
>>>
>>> udf1 = F.udf(lambda p: p.y, DoubleType(), useArrow=True)
>>> df.select(udf1(df.point)).show()
25/02/03 17:49:57 WARN ExtractPythonUDFs: Arrow optimization disabled due to UDT input or return type. Falling back to non-Arrow-optimized UDF execution.
+---------------+
|<lambda>(point)|
+---------------+
|            2.0|
+---------------+

# UDT output
>>> row = Row(value=3.0)
>>> df = spark.createDataFrame([row])
>>> udf_with_udt_output = F.udf(lambda v: ExamplePoint(v, v + 1), ExamplePointUDT(), useArrow=True)
>>> df.select(udf_with_udt_output(df.value)).show()
25/02/03 17:51:43 WARN ExtractPythonUDFs: Arrow optimization disabled due to UDT input or return type. Falling back to non-Arrow-optimized UDF execution.
+---------------+
|<lambda>(value)|
+---------------+
|     (3.0, 4.0)|
+---------------+
```

### How was this patch tested?
Unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #49786 from xinrong-meng/udt_arrow_udf.

Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: Xinrong Meng <[email protected]>
(cherry picked from commit ac2f3a4)
Signed-off-by: Xinrong Meng <[email protected]>
@xinrong-meng xinrong-meng changed the title [WIP][SPARK-48516][PYTHON][CONNECT] Turn on Arrow optimization for Python UDFs by default [SPARK-48516][PYTHON][CONNECT] Turn on Arrow optimization for Python UDFs by default Feb 5, 2025
@xinrong-meng xinrong-meng marked this pull request as ready for review February 5, 2025 02:00
@xinrong-meng
Copy link
Member Author

The Arrow fallback PR is in so this PR should be unblocked. I’ll keep an eye on testing and make it ready ASAP!

Thank you @HyukjinKwon @dongjoon-hyun

@HyukjinKwon
Copy link
Member

can we address #49482 (comment) @xinrong-meng ?

@HyukjinKwon
Copy link
Member

Can you take a look at the test failure to make sure? I think those failures look related.

zecookiez pushed a commit to zecookiez/spark that referenced this pull request Feb 6, 2025
…t and output types

### What changes were proposed in this pull request?
Introduce a fallback mechanism for Arrow-optimized Python UDFs when either the input or return types contain User-Defined Types (UDTs). If UDTs are detected, the system logs a warning and switches to currently default, non-Arrow-optimized UDF.

### Why are the changes needed?
To unblock enabling Arrow-optimized Python UDFs by default, see [pr](apache#49482)

### Does this PR introduce _any_ user-facing change?
Yes. UDT input and output types will not fail Arrow Python UDF anymore, as shown below:

```py
>>> import pyspark.sql.functions as F
>>> from pyspark.sql import Row
>>> from pyspark.testing.sqlutils import ExamplePoint, ExamplePointUDT

# UDT intput
>>> from pyspark.sql.types import *
>>> row = Row(label=1.0, point=ExamplePoint(1.0, 2.0))
>>> df = spark.createDataFrame([row])
>>>
>>> udf1 = F.udf(lambda p: p.y, DoubleType(), useArrow=True)
>>> df.select(udf1(df.point)).show()
25/02/03 17:49:57 WARN ExtractPythonUDFs: Arrow optimization disabled due to UDT input or return type. Falling back to non-Arrow-optimized UDF execution.
+---------------+
|<lambda>(point)|
+---------------+
|            2.0|
+---------------+

# UDT output
>>> row = Row(value=3.0)
>>> df = spark.createDataFrame([row])
>>> udf_with_udt_output = F.udf(lambda v: ExamplePoint(v, v + 1), ExamplePointUDT(), useArrow=True)
>>> df.select(udf_with_udt_output(df.value)).show()
25/02/03 17:51:43 WARN ExtractPythonUDFs: Arrow optimization disabled due to UDT input or return type. Falling back to non-Arrow-optimized UDF execution.
+---------------+
|<lambda>(value)|
+---------------+
|     (3.0, 4.0)|
+---------------+
```

### How was this patch tested?
Unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#49786 from xinrong-meng/udt_arrow_udf.

Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: Xinrong Meng <[email protected]>
@github-actions github-actions bot added the ML label Feb 6, 2025
@xinrong-meng
Copy link
Member Author

xinrong-meng commented Feb 7, 2025

Failed tests seem irrelevant:

[info] MySQLNamespaceSuite:
[info] org.apache.spark.sql.jdbc.v2.MySQLNamespaceSuite *** ABORTED *** (10 seconds, 369 milliseconds)
[info]   com.github.dockerjava.api.exception.InternalServerErrorException: Status 500: {"message":"driver failed programming external connectivity on endpoint condescending_lumiere (7a051f139d30436d8a1e231e3f4aeb991784c60ddc1f391ee8c1fc587c8ce2ca): Error starting userland proxy: listen tcp4 0.0.0.0:39901: bind: address already in use"}

Retriggering tests https://github.com/xinrong-meng/spark/runs/36869403425

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM since the failed Oracle JDBC CI looks irrelevant.

Merged to master/4.0.

dongjoon-hyun pushed a commit that referenced this pull request Feb 10, 2025
…UDFs by default

### What changes were proposed in this pull request?
Turn on Arrow optimization for Python UDFs by default

### Why are the changes needed?
Arrow optimization was introduced in 3.4.0. See [SPARK-40307](https://issues.apache.org/jira/browse/SPARK-40307) for more context.

Arrow-optimized Python UDF is approximately 1.6 times faster than the original pickled Python UDF. More details can be found in [this blog post](https://www.databricks.com/blog/arrow-optimized-python-udfs-apache-sparktm-35).

In version 4.0.0, we propose enabling the optimization by default. If PyArrow is not installed, it will fall back to the original pickled Python UDF.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Existing tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #49482 from xinrong-meng/arrow_on.

Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 59dd406)
Signed-off-by: Dongjoon Hyun <[email protected]>
@xinrong-meng
Copy link
Member Author

Thank you @dongjoon-hyun !

HyukjinKwon added a commit that referenced this pull request Feb 21, 2025
…Arrow-optimized Python UDF enabled by default

### What changes were proposed in this pull request?

This PR is a followup of #49482 that updates migration guide.

### Why are the changes needed?

In order for users to migrate to Spark 4.0 seamlessly

### Does this PR introduce _any_ user-facing change?

No, it fixes the migration guide documentation.

### How was this patch tested?

Manually checked.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #50034 from HyukjinKwon/SPARK-48510-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Feb 21, 2025
…Arrow-optimized Python UDF enabled by default

### What changes were proposed in this pull request?

This PR is a followup of #49482 that updates migration guide.

### Why are the changes needed?

In order for users to migrate to Spark 4.0 seamlessly

### Does this PR introduce _any_ user-facing change?

No, it fixes the migration guide documentation.

### How was this patch tested?

Manually checked.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #50034 from HyukjinKwon/SPARK-48510-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 45900c4)
Signed-off-by: Hyukjin Kwon <[email protected]>
@HyukjinKwon
Copy link
Member

There are some subtle diff found. I will revert this for now, and enable it back later after improving this more.

HyukjinKwon added a commit that referenced this pull request Mar 6, 2025
…t found but Arrow-optimized Python UDFs enabled

### What changes were proposed in this pull request?

This PR extracts legitimate improvement in #49482. Falls back regular Python UDF when Arrow is not found but Arrow-optimized Python UDFs enabled.

### Why are the changes needed?

To minimize end user impact.

### Does this PR introduce _any_ user-facing change?

Yes, it falls back to regular Python UDF when Arrow is not found but Arrow-optimized Python UDFs enabled.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #50160 from HyukjinKwon/SPARK-51393.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Mar 6, 2025
…t found but Arrow-optimized Python UDFs enabled

### What changes were proposed in this pull request?

This PR extracts legitimate improvement in #49482. Falls back regular Python UDF when Arrow is not found but Arrow-optimized Python UDFs enabled.

### Why are the changes needed?

To minimize end user impact.

### Does this PR introduce _any_ user-facing change?

Yes, it falls back to regular Python UDF when Arrow is not found but Arrow-optimized Python UDFs enabled.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #50160 from HyukjinKwon/SPARK-51393.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 74293cc)
Signed-off-by: Hyukjin Kwon <[email protected]>
Pajaraja pushed a commit to Pajaraja/spark that referenced this pull request Mar 6, 2025
…Arrow-optimized Python UDF enabled by default

### What changes were proposed in this pull request?

This PR is a followup of apache#49482 that updates migration guide.

### Why are the changes needed?

In order for users to migrate to Spark 4.0 seamlessly

### Does this PR introduce _any_ user-facing change?

No, it fixes the migration guide documentation.

### How was this patch tested?

Manually checked.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#50034 from HyukjinKwon/SPARK-48510-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Pajaraja pushed a commit to Pajaraja/spark that referenced this pull request Mar 6, 2025
…t found but Arrow-optimized Python UDFs enabled

### What changes were proposed in this pull request?

This PR extracts legitimate improvement in apache#49482. Falls back regular Python UDF when Arrow is not found but Arrow-optimized Python UDFs enabled.

### Why are the changes needed?

To minimize end user impact.

### Does this PR introduce _any_ user-facing change?

Yes, it falls back to regular Python UDF when Arrow is not found but Arrow-optimized Python UDFs enabled.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#50160 from HyukjinKwon/SPARK-51393.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
anoopj pushed a commit to anoopj/spark that referenced this pull request Mar 15, 2025
…t found but Arrow-optimized Python UDFs enabled

### What changes were proposed in this pull request?

This PR extracts legitimate improvement in apache#49482. Falls back regular Python UDF when Arrow is not found but Arrow-optimized Python UDFs enabled.

### Why are the changes needed?

To minimize end user impact.

### Does this PR introduce _any_ user-facing change?

Yes, it falls back to regular Python UDF when Arrow is not found but Arrow-optimized Python UDFs enabled.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#50160 from HyukjinKwon/SPARK-51393.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants