-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-52593][PS] Avoid CAST_INVALID_INPUT of Series.dot
and DataFrame.dot
in ANSI mode
#51310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@ueshin may I get a review please? |
Series.dot
in ANSI modeSeries.dot
and "DataFrame.dot" in ANSI mode
Series.dot
and "DataFrame.dot" in ANSI modeSeries.dot
and DataFrame.dot
in ANSI mode
if sorted(ps.Index(self.index).tolist()) != sorted( | ||
ps.Index(other.index).tolist() | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this means index.sort_values()
has CAST_INVALID_INPUT
issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's equals
>>> psidx1 = ps.Index(['x', 'y', 'z'])
>>> psidx2 = ps.Index([1, 2, 3])
>>> psidx1.sort_values().equals(psidx2.sort_values())
Traceback (most recent call last):
...
pyspark.errors.exceptions.captured.NumberFormatException: [CAST_INVALID_INPUT] The value 'x' of the type "STRING" cannot be cast to "BIGINT" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead. SQLSTATE: 22018
== DataFrame ==
"__eq__" was called from
<stdin>:1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this collects the index to the driver which is not ideal, but I haven’t found an alternative yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's fix equals
, then. We should fix it anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other comparisons, as well? like !=
, <
, <=
, >
, and >=
?
What changes were proposed in this pull request?
Avoid CAST_INVALID_INPUT of
Series.dot
andDataFrame.dot
in ANSI modeWhy are the changes needed?
Ensure pandas on Spark works well with ANSI mode on.
Part of https://issues.apache.org/jira/browse/SPARK-52556.
Does this PR introduce any user-facing change?
Yes.
Series.dot
raises expected error in ANSI, for exampleFROM
TO
SAME AS ANSI OFF
How was this patch tested?
Unit tests
Was this patch authored or co-authored using generative AI tooling?
No