[SPARK-52877][PYTHON] Improve Python UDF Arrow Serializer Performance #51225

asl3 · 2025-06-19T20:33:30Z

What changes were proposed in this pull request?

This PR removes pandas <> Arrow conversion in Arrow-optimized Python UDF by directly using PyArrow, to improve arrow-optimized Python UDF performance and memory usage.

Why are the changes needed?

Python UDF arrow serializer has a lot of overhead from converting arrow batches into pandas series and converting UDF results back to a pandas dataframe.

We can instead convert Python object directly into arrow to avoid the expensive pandas conversion.

Does this PR introduce any user-facing change?

Legacy type coercion:

# +-----------------------------+--------------+------------------+--------------------+------+--------------------+-----------------------------+------------+----------------------+------------------+------------------+----------------------------+--------------------+--------------+  # noqa
    # |SQL Type \ Python Value(Type)|None(NoneType)|        True(bool)|              1(int)|a(str)|    1970-01-01(date)|1970-01-01 00:00:00(datetime)|  1.0(float)|array('i', [1])(array)|         [1](list)|       (1,)(tuple)|bytearray(b'ABC')(bytearray)|          1(Decimal)|{'a': 1}(dict)|  # noqa
    # +-----------------------------+--------------+------------------+--------------------+------+--------------------+-----------------------------+------------+----------------------+------------------+------------------+----------------------------+--------------------+--------------+  # noqa
    # |                      boolean|          None|              True|                True|     X|                   X|                            X|        True|                     X|                 X|                 X|                           X|                   X|             X|  # noqa
    # |                      tinyint|          None|                 1|                   1|     X|                   X|                            X|           1|                     X|                 X|                 X|                           X|                   1|             X|  # noqa
    # |                     smallint|          None|                 1|                   1|     X|                   X|                            X|           1|                     X|                 X|                 X|                           X|                   1|             X|  # noqa
    # |                          int|          None|                 1|                   1|     X|                   0|                            X|           1|                     X|                 X|                 X|                           X|                   1|             X|  # noqa
    # |                       bigint|          None|                 1|                   1|     X|                   X|                            0|           1|                     X|                 X|                 X|                           X|                   1|             X|  # noqa
    # |                       string|          None|            'True'|                 '1'|   'a'|        '1970-01-01'|         '1970-01-01 00:00...|       '1.0'|     "array('i', [1])"|             '[1]'|            '(1,)'|         "bytearray(b'ABC')"|                 '1'|    "{'a': 1}"|  # noqa
    # |                         date|          None|                 X|                   X|     X|datetime.date(197...|         datetime.date(197...|           X|                     X|                 X|                 X|                           X|datetime.date(197...|             X|  # noqa
    # |                    timestamp|          None|                 X|datetime.datetime...|     X|                   X|         datetime.datetime...|           X|                     X|                 X|                 X|                           X|datetime.datetime...|             X|  # noqa
    # |                        float|          None|               1.0|                 1.0|     X|                   X|                            X|         1.0|                     X|                 X|                 X|                           X|                 1.0|             X|  # noqa
    # |                       double|          None|               1.0|                 1.0|     X|                   X|                            X|         1.0|                     X|                 X|                 X|                           X|                 1.0|             X|  # noqa
    # |                       binary|          None|bytearray(b'\x00')|  bytearray(b'\x00')|     X|                   X|                            X|           X|  bytearray(b'\x01\...|bytearray(b'\x01')|bytearray(b'\x01')|           bytearray(b'ABC')|                   X|             X|  # noqa
    # |                decimal(10,0)|          None|                 X|                   X|     X|                   X|                            X|Decimal('1')|                     X|                 X|                 X|                           X|        Decimal('1')|             X|  # noqa
    # +-----------------------------+--------------+------------------+--------------------+------+--------------------+-----------------------------+------------+----------------------+------------------+------------------+----------------------------+--------------------+--------------+  # noqa

New type coercion:

# +-----------------------------+--------------+----------+--------------------+------+--------------------+-----------------------------+--------------------+----------------------+---------+-----------+----------------------------+--------------------+--------------+  # noqa
    # |SQL Type \ Python Value(Type)|None(NoneType)|True(bool)|              1(int)|a(str)|    1970-01-01(date)|1970-01-01 00:00:00(datetime)|          1.0(float)|array('i', [1])(array)|[1](list)|(1,)(tuple)|bytearray(b'ABC')(bytearray)|          1(Decimal)|{'a': 1}(dict)|  # noqa
    # +-----------------------------+--------------+----------+--------------------+------+--------------------+-----------------------------+--------------------+----------------------+---------+-----------+----------------------------+--------------------+--------------+  # noqa
    # |                      boolean|          None|      True|                True|     X|                   X|                            X|                True|                     X|        X|          X|                           X|                   X|             X|  # noqa
    # |                      tinyint|          None|         X|                   1|     X|                   X|                            X|                   1|                     X|        X|          X|                           X|                   1|             X|  # noqa
    # |                     smallint|          None|         X|                   1|     X|                   X|                            X|                   1|                     X|        X|          X|                           X|                   1|             X|  # noqa
    # |                          int|          None|         X|                   1|     X|                   0|                            X|                   1|                     X|        X|          X|                           X|                   1|             X|  # noqa
    # |                       bigint|          None|         X|                   1|     X|                   X|                            0|                   1|                     X|        X|          X|                           X|                   1|             X|  # noqa
    # |                       string|          None|    'true'|                 '1'|   'a'|        '1970-01-01'|         '1970-01-01 00:00...|               '1.0'|     "array('i', [1])"|    '[1]'|     '(1,)'|         "bytearray(b'ABC')"|                 '1'|    "{'a': 1}"|  # noqa
    # |                         date|          None|         X|datetime.date(197...|     X|datetime.date(197...|         datetime.date(197...|datetime.date(197...|                     X|        X|          X|                           X|datetime.date(197...|             X|  # noqa
    # |                    timestamp|          None|         X|                   X|     X|                   X|         datetime.datetime...|                   X|                     X|        X|          X|                           X|                   X|             X|  # noqa
    # |                        float|          None|       1.0|                 1.0|     X|                   X|                            X|                 1.0|                     X|        X|          X|                           X|                 1.0|             X|  # noqa
    # |                       double|          None|       1.0|                 1.0|     X|                   X|                            X|                 1.0|                     X|        X|          X|                           X|                 1.0|             X|  # noqa
    # |                       binary|          None|         X|                   X|     X|                   X|                            X|                   X|                     X|        X|          X|           bytearray(b'ABC')|                   X|             X|  # noqa
    # |                decimal(10,0)|          None|         X|                   X|     X|                   X|                            X|                   X|                     X|        X|          X|                           X|        Decimal('1')|             X|  # noqa
    # +-----------------------------+--------------+----------+--------------------+------+--------------------+-----------------------------+--------------------+----------------------+---------+-----------+----------------------------+--------------------+--------------+  # noqa

Table diff:
Only display differing values, <table1_value> vs. <table2_value>

# +-----------------------------+--------------+------------------------+------------------------------+------+----------------------------+-----------------------------+--------------------+----------------------+------------------------+------------------------+----------------------------+------------------------------+--------------+  # noqa
# |SQL Type \ Python Value(Type)|None(NoneType)|        True(bool)      |            1(int)            |a(str)|    1970-01-01(date)        |1970-01-01 00:00:00(datetime)|    1.0(float)      |array('i', [1])(array)|       [1](list)        |     (1,)(tuple)       |bytearray(b'ABC')(bytearray)|      1(Decimal)             |{'a': 1}(dict)|  # noqa
# +-----------------------------+--------------+------------------------+------------------------------+------+----------------------------+-----------------------------+--------------------+----------------------+------------------------+------------------------+----------------------------+------------------------------+--------------+  # noqa
# |                      tinyint|              | 1 vs. X                |                              |      |                            |                             |                    |                      |                        |                        |                            |                              |              |  # noqa
# |                     smallint|              | 1 vs. X                |                              |      |                            |                             |                    |                      |                        |                        |                            |                              |              |  # noqa
# |                          int|              | 1 vs. X                |                              |      |                            |                             |                    |                      |                        |                        |                            |                              |              |  # noqa
# |                       bigint|              | 1 vs. X                |                              |      |                            | X vs. 0                     |                    |                      |                        |                        |                            |                              |              |  # noqa
# |                       string|              | 'True' vs. 'true'      |                              |      |                            |                             |                    |                      |                        |                        |                            |                              |              |  # noqa
# |                         date|              |                        | X vs. datetime.date(...)     |      |                            |                             | X vs. datetime.date(...)|                  |                        |                        |                            |                              |              |  # noqa
# |                    timestamp|              |                        | datetime.datetime(...) vs. X |      |                            |                             |                    |                      |                        |                        |                            |                              |              |  # noqa
# |                       binary|              | bytearray(...) vs. X   | bytearray(...) vs. X         |      |                            |                             |                    |                      | bytearray(...) vs. X   | bytearray(...) vs. X   |                            |                              |              |  # noqa
# |                decimal(10,0)|              |                        |                             |      |                            |                             | Decimal('1') vs. X |                      |                        |                        |                            |                              |              |  # noqa
# +-----------------------------+--------------+------------------------+------------------------------+------+----------------------------+-----------------------------+--------------------+----------------------+------------------------+------------------------+----------------------------+------------------------------+--------------+  # noqa

How was this patch tested?

Correctness:
Added tests for both the legacy and new codepath, for arrow-batch eval.

Memory usage improvement:

According to manual benchmark, ~ 1.25x improvement in memory usage comparing the new path with the legacy pandas<>arrow conversion serialization.

Sample output from memory_profiler:

Legacy path:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    70    220.9 MiB    220.9 MiB           1       @profile
    71                                             def process_data_profiled(values, metadata):
    72    220.9 MiB      0.0 MiB           1           return complex_computation(values, metadata)


New path:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    93    175.1 MiB    175.1 MiB           1       @profile
    94                                             def process_data_profiled(values, metadata):
    95    175.1 MiB      0.0 MiB           1           return complex_computation(values, metadata)

Benchmark details here

Was this patch authored or co-authored using generative AI tooling?

No

HyukjinKwon

From a cursory look, seems making sense

python/pyspark/worker.py

python/pyspark/sql/tests/arrow/test_arrow_python_udf.py

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

python/pyspark/sql/pandas/serializers.py

viirya

I think we may need to update corresponding doc especially for the type coercion difference. Maybe migration guide.

viirya · 2025-07-21T07:12:44Z

For the type coercion table, I remember we have document it in Python code doc. We may also need to update it.

python/pyspark/sql/pandas/serializers.py

python/pyspark/worker.py

HyukjinKwon · 2025-07-23T00:01:21Z

Merged to master.

asl3 added 12 commits June 18, 2025 23:07

wrap_arrow_udf and serializer

9dfe326

evaltype

779fea9

nit

a3cded6

test

33f6f67

tmp

8440f9c

skip variant tests

05d87c8

arrow batch serializer

1dd0ceb

refactor

ffa562a

rename

e075c0d

refactor

93b44ca

nit

b1f8965

spacing

8ef0726

github-actions bot added SQL DOCS CORE PYTHON labels Jun 19, 2025

asl3 changed the title ~~[DRAFT][PYTHON] Improve Python UDTF arrow serializer performance~~ [DRAFT][PYTHON] Improve Python UDF arrow serializer performance Jun 19, 2025

asl3 changed the title ~~[DRAFT][PYTHON] Improve Python UDF arrow serializer performance~~ [DRAFT][PYTHON] Improve Python UDF Arrow Serializer Performance Jun 19, 2025

asl3 added 7 commits June 22, 2025 11:52

fmt

b871ab1

update test

d9570a1

scalar arrow

8f40352

spacing

fc844c2

spacing

8f9420c

comment

915f919

sql scalar arrow iter udf

fc26618

HyukjinKwon marked this pull request as draft June 23, 2025 02:52

HyukjinKwon reviewed Jun 23, 2025

View reviewed changes

python/pyspark/worker.py Outdated Show resolved Hide resolved

asl3 added 2 commits June 22, 2025 21:07

whitespace

03cac02

restore

81f977a

extend arrowstreamarrowudfserializer

3e02a1f

asl3 changed the title ~~[PYTHON] Improve Python UDF Arrow Serializer Performance~~ [SPARK-52877][PYTHON] Improve Python UDF Arrow Serializer Performance Jul 19, 2025

asl3 marked this pull request as ready for review July 19, 2025 02:09

lint

157e07a

github-actions bot removed the DOCS label Jul 21, 2025

zhengruifeng reviewed Jul 21, 2025

View reviewed changes

python/pyspark/sql/tests/arrow/test_arrow_python_udf.py Show resolved Hide resolved

zhengruifeng approved these changes Jul 21, 2025

View reviewed changes

parity tests

d8530e5

github-actions bot added the CONNECT label Jul 21, 2025

HyukjinKwon approved these changes Jul 21, 2025

View reviewed changes

viirya reviewed Jul 21, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Show resolved Hide resolved

python/pyspark/sql/pandas/serializers.py Show resolved Hide resolved

viirya reviewed Jul 21, 2025

View reviewed changes

python/pyspark/sql/pandas/serializers.py Outdated Show resolved Hide resolved

viirya approved these changes Jul 21, 2025

View reviewed changes

ueshin reviewed Jul 21, 2025

View reviewed changes

python/pyspark/worker.py Outdated Show resolved Hide resolved

python/pyspark/worker.py Outdated Show resolved Hide resolved

asl3 added 2 commits July 21, 2025 12:23

docs

a6dfcff

reformat

ba28a76

asl3 requested a review from ueshin July 21, 2025 20:32

migration guide

48c6397

github-actions bot added the DOCS label Jul 21, 2025

asl3 added 5 commits July 21, 2025 15:41

Merge remote-tracking branch 'origin/master' into arrowudf

e1ea064

batch.itercolumns

90a1b71

scalastyle

c7888b8

lint

8dbccee

error class

00cde8f

HyukjinKwon closed this in 4de8661 Jul 23, 2025

asl3 mentioned this pull request Jul 25, 2025

[SPARK-52952][PYTHON] Add PySpark UDF Type Coercion Dev Script #51663

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52877][PYTHON] Improve Python UDF Arrow Serializer Performance #51225

[SPARK-52877][PYTHON] Improve Python UDF Arrow Serializer Performance #51225

Uh oh!

asl3 commented Jun 19, 2025 •

edited

Loading

Uh oh!

HyukjinKwon left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

viirya left a comment

Uh oh!

viirya commented Jul 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Jul 23, 2025

Uh oh!

Uh oh!

[SPARK-52877][PYTHON] Improve Python UDF Arrow Serializer Performance #51225

[SPARK-52877][PYTHON] Improve Python UDF Arrow Serializer Performance #51225

Uh oh!

Conversation

asl3 commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Jul 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Jul 23, 2025

Uh oh!

Uh oh!

asl3 commented Jun 19, 2025 •

edited

Loading