Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String encoding in structured metadata arrays #3097

Open
benjeffery opened this issue Mar 7, 2025 · 5 comments
Open

String encoding in structured metadata arrays #3097

benjeffery opened this issue Mar 7, 2025 · 5 comments
Milestone

Comments

@benjeffery
Copy link
Member

#3091 introduced returning a numpy structured array from a compatible metadata buffer. The StructCodec allows the specification of string encoding in the schema, however numpy only supports bytes and utf-32 in S and U dtypes. #3091 therefore returns structured arrays with only the S dtype for each string field.

At the cost of a copy and some shuffling, it would be possible to decode these to the users specification in the schema using numpy.char.decode then reassigning the encoded string array back into the structured array. This could be implemented as an option with a boolean flag decode_strings to structured_array_from_buffer, however as ts.X_metadata is a property this couldn't be set when retrieving the array, so either an additional property on the ts (ts.X_metadata_string_decode?) or leaving the user to do this via the lower-level code.

@jeromekelleher
Copy link
Member

Does the numpy 2.0 StringDtype alter this logic much? I'm considering implementing access to the ancestral_state etc arrays via some low-level C code using the StringDtype (which we can conditionally compile for numpy 2). I figure it's OK to have new features depend on numpy 2, once old code continues to work, and string handling is something that numpy 2 should be significantly better at.

@benjeffery
Copy link
Member Author

From the docs I can't see that StringDtype changes anything about the encoding.

@jeromekelleher
Copy link
Member

OK, can we park strings for a bit so and just raise an error if the schema as a string field? I want to play with this myself on the SC2 data to make sure it all scales.

@benjeffery
Copy link
Member Author

I'm not sure we need to error - the current code deals with strings, it's just that they have the S dtype and give a bytes object when you do something like table.node_metadata['sting_name'][26].

@jeromekelleher
Copy link
Member

Right but we'd like that to be a string rather than bytes. Raising an error makes sure we don't forget this and release accidentally.

@benjeffery benjeffery added this to the Python 0.6.1 milestone Mar 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants