Skip to content

Conversation

allisonwang-db
Copy link
Contributor

@allisonwang-db allisonwang-db commented Aug 28, 2025

What changes were proposed in this pull request?

This PR adds more tests for various table argument support for Arrow Python UDTFs.
It also exposed some existing issues that need to be fixed:

  • SPARK-53387: Support PARTITION BY clause with Python Arrow UDTF
  • SPARK-53426: Support named table argument with asTable() API

Why are the changes needed?

To improve test coverage

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

Yes

@allisonwang-db allisonwang-db changed the title [SPARK-53425][PYTHON][TESTS] Add more able argument tests for Arrow Python UDTFs [SPARK-53425][PYTHON][TESTS] Add more table argument tests for Arrow Python UDTFs Aug 28, 2025
Comment on lines 811 to 1637
# TODO(SPARK-53426): Support named table argument with DataFrame API
# input_df = self.spark.range(3) # [0, 1, 2]
# result_df = NamedArgsUDTF(table_data=input_df.asTable(), multiplier=lit(5))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix: #52171

Copy link
Member

@xinrong-meng xinrong-meng Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ueshin for the fix!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks the test can pass now!

@allisonwang-db allisonwang-db force-pushed the spark-53425-tbl-arg-tests branch from 4b03475 to 709ab71 Compare September 12, 2025 22:55
@allisonwang-db allisonwang-db force-pushed the spark-53425-tbl-arg-tests branch from 709ab71 to d13c7ce Compare September 22, 2025 21:30
Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM.

Comment on lines +1439 to +1445
result_df = self.spark.sql(
"""
SELECT * FROM partition_sum_udtf(
TABLE(partition_test_data) PARTITION BY category
) ORDER BY partition_key
"""
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also potentially flaky as same as tests in the previous PR. Use terminate to be more stable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for pointing this out. Fixed.

Comment on lines 1485 to 1492
result_df = self.spark.sql(
"""
SELECT * FROM dept_status_count_udtf(
TABLE(SELECT * FROM employee_data)
PARTITION BY (department, status)
) ORDER BY dept, status
"""
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@allisonwang-db
Copy link
Contributor Author

Thanks! Merging to master

@zhengruifeng
Copy link
Contributor

late LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants