Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-36771][PYTHON][3.2] Fix pop of Categorical Series #34063

Closed

Conversation

xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Sep 21, 2021

What changes were proposed in this pull request?

Fix pop of Categorical Series to be consistent with the latest pandas (1.3.2) behavior.

This is a backport of #34052.

Why are the changes needed?

As databricks/koalas#2198, pandas API on Spark behaves differently from pandas on pop of Categorical Series.

Does this PR introduce any user-facing change?

Yes, results of pop of Categorical Series change.

From

>>> psser = ps.Series(["a", "b", "c", "a"], dtype="category")
>>> psser
0    a                                                                          
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(0)
0
>>> psser
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(3)
0
>>> psser
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

To

>>> psser = ps.Series(["a", "b", "c", "a"], dtype="category")
>>> psser
0    a                                                                          
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(0)
'a'
>>> psser
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(3)
'a'
>>> psser
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

How was this patch tested?

Unit tests.

@xinrong-meng
Copy link
Member Author

CC @ueshin @HyukjinKwon @itholic

@ueshin ueshin changed the title [3.2][SPARK-36771][PYTHON] Fix pop of Categorical Series [SPARK-36771][PYTHON][3.2] Fix pop of Categorical Series Sep 21, 2021
Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending tests.

@SparkQA
Copy link

SparkQA commented Sep 21, 2021

Test build #143487 has finished for PR 34063 at commit e9d11ff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47998/

@xinrong-meng xinrong-meng changed the title [SPARK-36771][PYTHON][3.2] Fix pop of Categorical Series [SPARK-36771][PYTHON][3.2] Fix pop of Categorical Series Sep 21, 2021
@SparkQA
Copy link

SparkQA commented Sep 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47998/

@ueshin
Copy link
Member

ueshin commented Sep 22, 2021

Thanks! merging to branch-3.2.

ueshin pushed a commit that referenced this pull request Sep 22, 2021
### What changes were proposed in this pull request?
Fix `pop` of Categorical Series to be consistent with the latest pandas (1.3.2) behavior.

This is a backport of #34052.

### Why are the changes needed?
As databricks/koalas#2198, pandas API on Spark behaves differently from pandas on `pop` of Categorical Series.

### Does this PR introduce _any_ user-facing change?
Yes, results of `pop` of Categorical Series change.

#### From
```py
>>> psser = ps.Series(["a", "b", "c", "a"], dtype="category")
>>> psser
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(0)
0
>>> psser
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(3)
0
>>> psser
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']
```

#### To
```py
>>> psser = ps.Series(["a", "b", "c", "a"], dtype="category")
>>> psser
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(0)
'a'
>>> psser
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(3)
'a'
>>> psser
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

```

### How was this patch tested?
Unit tests.

Closes #34063 from xinrong-databricks/backport_cat_pop.

Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: Takuya UESHIN <[email protected]>
@ueshin ueshin closed this Sep 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants