Skip to content

fix: Filter out StringDType even when the backing array is not NumpyExtensionArray #10559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

ilan-gold
Copy link
Contributor

@ilan-gold ilan-gold commented Jul 22, 2025

The linked issue is resolved by this PR but more broadly, the issue underlying it (letting in string dtypes that are not NumpyExtensionArray) was bound to come up at some point so this PR more comprehensively fixes that issue and tries to centralize our "whitelist"

@keewis keewis added the run-upstream Run upstream CI label Jul 22, 2025
@keewis
Copy link
Collaborator

keewis commented Jul 22, 2025

I've added the run-upstream tag so it runs on all commits, but for future PRs you can also run the nightly CI by appending [test-upstream] to the first line of a commit message.

@ilan-gold ilan-gold changed the title fix: keep dtype as object for pd.StringDtype in safe_cast_to_index fix: Filter out StringDType even when the backing array is not NumpyExtensionArray Jul 23, 2025
@github-actions github-actions bot added the topic-arrays related to flexible array support label Jul 23, 2025
@@ -7293,7 +7293,7 @@ def from_dataframe(cls, dataframe: pd.DataFrame, sparse: bool = False) -> Self:
arrays = []
extension_arrays = []
for k, v in dataframe.items():
if not is_extension_array_dtype(v) or isinstance(
if not is_allowed_extension_array(v) or isinstance(
v.array, UNSUPPORTED_EXTENSION_ARRAY_TYPES
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add UNSUPPORTED_EXTENSION_ARRAY_TYPES to the new is_allowed_extension_array function because we do allow them as backing arrays to Index object, I think. Maybe should get a test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should merge these checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to account for internal duck array support i.e., that which allows preserving the dtype of extension array indices that are in this UNSUPPORTED_EXTENSION_ARRAY_TYPES whitelist. See the note here:

# This does not use the UNSUPPORTED_EXTENSION_ARRAY_TYPES whitelist because
# we do support extension arrays from datetime, for example, that need
# duck array support internally via this class.

@dcherian
Copy link
Contributor

Sadly these upstream failures seem related. @kmuehlbauer may be able to help reason through them

@ilan-gold
Copy link
Contributor Author

@dcherian I think they're only related in so far as they come from pandas>3.0. I tracked down the issue and it comes from this line:

as_series = pd.Series(values.ravel(), copy=False)
result = np.asarray(as_series).reshape(values.shape)

It appears (I have no experience with this, so correct me if I'm wrong), that the "real" type is encoded in dtype("O").metadata for certain netcdf4 variables. But this dtype.metadata is not preserved after a pd.Series call. I added a fix for it, but made it a separate PR as I'd like to ensure this PR does what it advertises first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-upstream Run upstream CI topic-arrays related to flexible array support topic-indexing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

String coords broken in pandas git tip
3 participants