[SPARK-53029][PYTHON] Support return type coercion for Arrow Python UDTFs #52140

shujingyang-db · 2025-08-27T06:55:19Z

What changes were proposed in this pull request?

Support return type coercion for Arrow Python UDTFs by doing arrow_cast by default

Why are the changes needed?

Consistent behavior across Arrow UDFs and Arrow UDTFs

Does this PR introduce any user-facing change?

No, Arrow UDTF is not a public API yet

How was this patch tested?

New and existing UTs

Was this patch authored or co-authored using generative AI tooling?

No

dismiss

zhengruifeng · 2025-08-27T07:46:17Z

python/pyspark/sql/pandas/serializers.py

    """

-    def __init__(self, table_arg_offsets=None):
+    def __init__(self, table_arg_offsets=None, arrow_cast=False):


the default value should be True?

yep! I changed it to True and add a SQLConf to gate it

zhengruifeng · 2025-08-27T07:47:46Z

python/pyspark/sql/tests/arrow/test_arrow_udtf.py

                yield pa.Table.from_struct_array(pa.array([{}] * 3))

-        assertDataFrameEqual(EmptyResultUDTF(), [Row(), Row(), Row()])
+        assertDataFrameEqual(EmptyResultUDTF(), [None, None, None])


I guess this change is unexpected?

Good catch! I have reverted it and create an empty batch with the number of rows set

python/pyspark/sql/pandas/serializers.py

allisonwang-db

Thanks for supporting this!

allisonwang-db · 2025-08-29T00:47:51Z

python/pyspark/sql/pandas/serializers.py

+                if batch.num_columns == 0:
+                    # When batch has no column, it should still create
+                    # an empty batch with the number of rows set.
+                    struct = pa.array([{}] * batch.num_rows)
+                    coerced_batch = pa.RecordBatch.from_arrays([struct], ["_0"])


I don't think we need to handle this case? cc @ueshin

This is to ensure the test case "test_arrow_udtf_with_empty_column_result" to work. Please refer to #52140 (comment) comment for the unexpected behavior change.

I guess this will be done in super().dump_stream(), too?

allisonwang-db · 2025-08-29T00:48:34Z

python/pyspark/sql/pandas/serializers.py

    """

-    def __init__(self, table_arg_offsets=None):
+    def __init__(self, table_arg_offsets=None, arrow_cast=True):


Let's enable arrow_cast by default for ArrowUDTFs (it's a new feature) so we don't need a flag here.

allisonwang-db · 2025-08-29T00:50:17Z

python/pyspark/sql/pandas/serializers.py

+        if arr.type == arrow_type:
+            return arr
+        elif self._arrow_cast:
+            return arr.cast(target_type=arrow_type, safe=True)


what's the difference between safe=True vs False

It will only allow casts that are guaranteed not to lose information. Truncation (floats to ints), narrowing (int64 → int8), or precision loss are not allowed. Will add a comment

cc @zhengruifeng is this the same behavior as Arrow UDFs?

Arrow UDF has the same implementation. cc: @zhengruifeng please keep me honest.

Update: we now change it to RecordBatch.cast, which should have same behavior as arr.cast but is more performant. cc: @ueshin

allisonwang-db · 2025-08-29T00:51:13Z

python/pyspark/sql/pandas/serializers.py

+                assert isinstance(
+                    batch, pa.RecordBatch
+                ), f"Expected pa.RecordBatch, got {type(batch)}"


I think we already check this worker.py so no need to duplicate this check :)

allisonwang-db · 2025-08-29T00:52:02Z

python/pyspark/sql/pandas/serializers.py

+                            raise PySparkRuntimeError(
+                                errorClass="UDTF_RETURN_SCHEMA_MISMATCH",
+                                messageParameters={
+                                    "expected": str(len(arrow_return_type)),
+                                    "actual": str(batch.num_columns),
+                                    "func": "ArrowUDTF",
+                                },
+                            )


ditto. I think we already checked if the return column mismatch the expected return schema in worker.py. Would you mind double check?

Do you mean verify_arrow_result in worker.py? I removed it since verify_arrow_result requires return type to strictly match arrow_return_type in the conversion of pa.Table.from_batches([result], schema=pa.schema(list(arrow_return_type))).

verify_arrow_result( pa.Table.from_batches([result], schema=pa.schema(list(arrow_return_type))), assign_cols_by_name=False, expected_cols_and_types=[ (col.name, to_arrow_type(col.dataType)) for col in return_type.fields ], )

The column length is checked before it. Please take a look at:

if result.num_columns != return_type_size: ...

in verify_result.

allisonwang-db · 2025-08-29T00:52:39Z

python/pyspark/sql/pandas/serializers.py

+        if arr.type == arrow_type:
+            return arr
+        elif self._arrow_cast:
+            return arr.cast(target_type=arrow_type, safe=True)


Also, it would be great to list he type coercion rule here!

added a comment

allisonwang-db · 2025-08-29T21:20:37Z

python/pyspark/sql/tests/arrow/test_arrow_udtf.py

-        with self.assertRaisesRegex(PythonException, "Schema at index 0 was different"):
-            result_df = MismatchedSchemaUDTF()
-            result_df.collect()
+        if self.spark.conf.get("spark.sql.execution.pythonUDTF.typeCoercion.enabled").lower() == "false":


you can use with self.sql_conf("...")

allisonwang-db · 2025-08-29T21:22:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  val PYTHON_TABLE_UDF_TYPE_CORERION_ENABLED =
+    buildConf("spark.sql.execution.pythonUDTF.typeCoercion.enabled")


Let's enable Arrow cast for Arrow Python UDTFs by default so we don't need this config :)

sure, on it

Update: done

allisonwang-db · 2025-08-29T21:23:09Z

python/pyspark/sql/tests/arrow/test_arrow_udtf.py

                    {
-                        "wrong_col": pa.array([1], type=pa.int32()),
-                        "another_wrong_col": pa.array([2.5], type=pa.float64()),
+                        "col_with_arrow_cast": pa.array([1], type=pa.int32()),


What if we have input to be int64 and output to be int32? Does arrow cast throw exception in this case?

Yes, it will. We had a test case "test_return_type_coercion_overflow"

allisonwang-db · 2025-08-29T21:23:43Z

python/pyspark/sql/pandas/serializers.py

+        if arr.type == arrow_type:
+            return arr
+        elif self._arrow_cast:
+            return arr.cast(target_type=arrow_type, safe=True)


cc @zhengruifeng is this the same behavior as Arrow UDFs?

allisonwang-db · 2025-08-29T21:24:52Z

python/pyspark/sql/tests/arrow/test_arrow_udtf.py

+                result_df = MismatchedSchemaUDTF()
+                result_df.collect()
+        else:
+            with self.assertRaisesRegex(PythonException, "Failed to parse string: 'wrong_col' as a scalar of type int32"):


Hmm looks like without arrow cast, the error message looks better.

I added a try-catch block to polish the error message with the arrow cast

ueshin · 2025-08-29T22:41:44Z

python/pyspark/sql/pandas/serializers.py

+            for packed in iterator:
+                batch, arrow_return_type = packed


nit:

Suggested change

for packed in iterator:

batch, arrow_return_type = packed

for batch, arrow_return_type in iterator:

ueshin · 2025-08-29T22:46:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .booleanConf
      .createWithDefault(false)

+


nit: revert this?

ueshin · 2025-08-29T22:47:29Z

python/pyspark/worker.py

+        ser = ArrowStreamArrowUDTFSerializer(
+            table_arg_offsets=table_arg_offsets
+        )


Looks like an unnecessary change?

ueshin · 2025-08-29T22:49:10Z

python/pyspark/sql/tests/arrow/test_arrow_udtf.py

    pass


+


ditto.

Could you run:

./dev/reformat-python

to make the linter happy?

ueshin · 2025-08-29T22:54:21Z

python/pyspark/sql/pandas/serializers.py

+                            raise PySparkRuntimeError(
+                                errorClass="UDTF_RETURN_SCHEMA_MISMATCH",
+                                messageParameters={
+                                    "expected": str(len(arrow_return_type)),
+                                    "actual": str(batch.num_columns),
+                                    "func": "ArrowUDTF",
+                                },
+                            )


The column length is checked before it. Please take a look at:

if result.num_columns != return_type_size: ...

in verify_result.

ueshin · 2025-08-29T22:56:50Z

python/pyspark/sql/pandas/serializers.py

+                if should_write_start_length:
+                    write_int(SpecialLengths.START_ARROW_STREAM, stream)
+                    should_write_start_length = False
+
+                yield coerced_batch


These are done in the super().dump_stream(). What we should do here is just type-casting.

I'm just wondering whether we can use RecordBatch.cast for this instead of casting each column?

make sense! I changed it to RecordBatch.cast

…rion

ueshin · 2025-09-02T22:25:48Z

python/pyspark/sql/pandas/serializers.py

+
+                yield coerced_batch, arrow_return_type
+
+        return super(ArrowStreamArrowUDTFSerializer, self).dump_stream(


nit: super().dump_stream ...

ueshin · 2025-09-02T22:29:40Z

python/pyspark/sql/pandas/serializers.py

+                        if expected_type and actual_type:
+                            error_msg = f"Expected: {expected_type}, but got: {actual_type} in field '{expected_field.name}'."
+                        else:
+                            error_msg = f"Expected: {target_schema}, but got: {batch.schema}."


I guess this case is enough as an error message?

Allison was asking for a better error message :)
#52140 (comment)

ueshin · 2025-09-02T22:30:47Z

python/pyspark/sql/pandas/serializers.py

+                            error_msg = f"Expected: {target_schema}, but got: {batch.schema}."
+
+                        raise PySparkTypeError(
+                            "Arrow UDTFs require the return type to match the expected Arrow type."


nit: ... Arrow type. " to have a space between this and error_msg.

ueshin

LGTM, pending tests.

ueshin · 2025-09-03T02:05:03Z

python/pyspark/sql/pandas/serializers.py

+                if batch.num_columns == 0:
+                    coerced_batch = batch  # skip type coercion
+                else:
+                    expected_field_names = [field.name for field in arrow_return_type]


nit: we can use arrow_return_type.names instead?

ueshin · 2025-09-03T02:07:35Z

python/pyspark/sql/pandas/serializers.py

+                        coerced_array = self._create_array(original_array, field.type)
+                        coerced_arrays.append(coerced_array)
+                    coerced_batch = pa.RecordBatch.from_arrays(
+                        coerced_arrays, names=arrow_return_type.names


nit: expected_field_names or actual_field_names?

they are same here :)

ueshin · 2025-09-03T02:10:47Z

python/pyspark/sql/tests/arrow/test_arrow_udtf.py

                result_table = pa.table(
                    {
-                        "id": pa.array(["abc", "def", "xyz"], type=pa.string()),
+                        "id": pa.array(["1", "2", "xyz"], type=pa.string()),


Does this work if it's pa.array(["1", "2", "3"]? Shall we have the test to confirm?

Casting from "1" to 1 should work. I added a test_arrow_udtf_type_coercion_string_to_int_safe.

allisonwang-db · 2025-09-03T20:16:03Z

python/pyspark/sql/pandas/serializers.py

+                raise PySparkTypeError(
+                    "Arrow UDTFs require the return type to match the expected Arrow type. "
+                    f"Expected: {arrow_type}, but got: {arr.type}."


Nit: Can we use an error class here?

Added :) The exception thrown is not structured though. I use assertRaisesRegex to check the error

An exception was thrown from the Python worker. Please see the stack trace below.

allisonwang-db · 2025-09-03T20:22:28Z

python/pyspark/sql/pandas/serializers.py

+                    expected_field_names = arrow_return_type.names
+                    actual_field_names = batch.schema.names
+
+                    if expected_field_names != actual_field_names:
+                        raise PySparkTypeError(
+                            "Target schema's field names are not matching the record batch's "
+                            "field names. "
+                            f"Expected: {expected_field_names}, but got: {actual_field_names}."
+                        )


Hmm we didn't check this in verify_result? It's a much better error message than before but we should take a look on how to merge this with verify_result.

Do you mean verify_arrow_result in worker.py? I removed it since verify_arrow_result requires return type to strictly match arrow_return_type in the conversion of pa.Table.from_batches([result], schema=pa.schema(list(arrow_return_type))).

verify_arrow_result( pa.Table.from_batches([result], schema=pa.schema(list(arrow_return_type))), assign_cols_by_name=False, expected_cols_and_types=[ (col.name, to_arrow_type(col.dataType)) for col in return_type.fields ], )

Can I refactor verify_arrow_result in a follow-up PR and integrate the results verification to unblock this PR?

The current verify_arrow_result is primarily for Arrow UDFs

allisonwang-db · 2025-09-03T20:26:58Z

python/pyspark/sql/pandas/serializers.py

+                        original_array = batch.column(i)
+                        coerced_array = self._create_array(original_array, field.type)
+                        coerced_arrays.append(coerced_array)


Can we directly use batch.cast(arrow_return_type)? and the default safe parameter should be True.

I discussed this with @ueshin offline. Unfortunately, RecordBatch.cast isn’t available in the currently required minimal PyArrow version. We’ll need to bump the minimum requirement to support it.

Got it. Can we add a comment here mentioning why we don't use record batch.cast an d what's the minimum pyarrow version to support it?

allisonwang-db

Thanks for supporting this!

allisonwang-db · 2025-09-05T02:26:30Z

python/pyspark/errors/error-conditions.json

  },
+  "RESULT_COLUMNS_MISMATCH_FOR_ARROW_UDTF": {
+    "message": [
+      "Column names of the returned pyarrow.Table do not match specified schema. Expected: <expected> Actual: <actual>"


This is not necessarily pyarrow.Table (it can be columnar batch). How about let's just say, Column names of the returned table do not match ...

allisonwang-db · 2025-09-05T02:27:14Z

python/pyspark/sql/pandas/serializers.py

+                        original_array = batch.column(i)
+                        coerced_array = self._create_array(original_array, field.type)
+                        coerced_arrays.append(coerced_array)


Got it. Can we add a comment here mentioning why we don't use record batch.cast an d what's the minimum pyarrow version to support it?

ueshin · 2025-09-08T18:03:41Z

python/pyspark/sql/pandas/serializers.py

+                # when safe is True, the cast will fail if there's a overflow or other
+                # unsafe conversion.
+                # RecordBatch.cast(...) isn't used as minimum PyArrow version
+                # required for RecordBatch.cast(...) is v21.0.0


RecordBatch.cast is available since 16.0.
https://arrow.apache.org/docs/16.0/python/generated/pyarrow.RecordBatch.html#pyarrow.RecordBatch.cast

ueshin · 2025-09-08T22:47:30Z

Thanks! merging to master.

shujingyang-db added 4 commits August 26, 2025 17:46

init

83ad11a

ckp

f27409a

revert changes

25a9742

tests

f6ff4c7

github-actions bot added SQL PYTHON labels Aug 27, 2025

zhengruifeng previously approved these changes Aug 27, 2025

View reviewed changes

zhengruifeng reviewed Aug 27, 2025

View reviewed changes

xinrong-meng reviewed Aug 27, 2025

View reviewed changes

python/pyspark/sql/pandas/serializers.py Outdated Show resolved Hide resolved

shujingyang-db added 4 commits August 27, 2025 15:48

lint

481a3a5

default to true

6df8f57

add a sqlconf for spark.sql.execution.pythonUDTF.typeCoercion.enabled

e29f5d1

add udtfTypeCoercion

44a46b0

github-actions bot added the CORE label Aug 28, 2025

handle empty rows

505db61

shujingyang-db requested review from zhengruifeng and xinrong-meng August 28, 2025 20:36

fix tests

65fa7a1

allisonwang-db reviewed Aug 29, 2025

View reviewed changes

shujingyang-db added 4 commits August 29, 2025 14:35

rm conf - ckp

e49e59b

rm sql conf

0b5337b

polish error message and rm checks

4ed23a4

add comments

8174be8

shujingyang-db requested a review from allisonwang-db August 29, 2025 22:47

ueshin reviewed Aug 29, 2025

View reviewed changes

shujingyang-db added 2 commits September 2, 2025 12:07

ckp

b1602fb

errorMsg

5fcf52e

shujingyang-db added 4 commits September 2, 2025 14:38

clean up

02c9592

Merge remote-tracking branch 'spark/master' into arrow-udtf-type-core…

4586fc3

…rion

clean up

581fcbb

format + fix tests

ecafd74

ueshin reviewed Sep 2, 2025

View reviewed changes

shujingyang-db added 3 commits September 2, 2025 18:06

revert to _create_array

1c4d5d5

lint

0d6755a

nit

7abebc6

shujingyang-db requested a review from ueshin September 3, 2025 01:41

ueshin approved these changes Sep 3, 2025

View reviewed changes

ueshin reviewed Sep 3, 2025

View reviewed changes

shujingyang-db added 4 commits September 2, 2025 21:37

test_arrow_udtf_type_coercion_string_to_int_safe

10a28cd

naming

2c01a3b

lint

fee5c23

fix lint

ff33015

allisonwang-db reviewed Sep 3, 2025

View reviewed changes

error class

5c292ba

shujingyang-db requested a review from allisonwang-db September 4, 2025 21:42

shujingyang-db added 3 commits September 4, 2025 15:07

add error class

17923b0

lint

3a97fa6

fix lint

f6771ba

allisonwang-db approved these changes Sep 5, 2025

View reviewed changes

shujingyang-db added 2 commits September 5, 2025 12:54

commenst and error messages

c90574d

fix

97ac86c

ueshin reviewed Sep 8, 2025

View reviewed changes

Update serializers.py

98f97c1

ueshin closed this in d16e92d Sep 8, 2025

		val PYTHON_TABLE_UDF_TYPE_CORERION_ENABLED =
		buildConf("spark.sql.execution.pythonUDTF.typeCoercion.enabled")

	for packed in iterator:
	batch, arrow_return_type = packed
	for batch, arrow_return_type in iterator:


		yield coerced_batch, arrow_return_type

		return super(ArrowStreamArrowUDTFSerializer, self).dump_stream(

[SPARK-53029][PYTHON] Support return type coercion for Arrow Python UDTFs #52140

[SPARK-53029][PYTHON] Support return type coercion for Arrow Python UDTFs #52140

Uh oh!

Conversation

shujingyang-db commented Aug 27, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

allisonwang-db left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shujingyang-db Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shujingyang-db Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shujingyang-db Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shujingyang-db Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

shujingyang-db Sep 2, 2025 •

edited

Loading

shujingyang-db Aug 29, 2025 •

edited

Loading

shujingyang-db Aug 29, 2025 •

edited

Loading

shujingyang-db Aug 29, 2025 •

edited

Loading