Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determining the resultant metadata of a function #450

Closed
jakirkham opened this issue Jun 7, 2022 · 6 comments
Closed

Determining the resultant metadata of a function #450

jakirkham opened this issue Jun 7, 2022 · 6 comments
Labels
RFC Request for comments. Feature requests and proposed changes.

Comments

@jakirkham
Copy link
Member

When wrapping other Array libraries (as happens in Dask or XArray), there is a need to determine what the result of an operation may look like in terms of its metadata. This typically happens before any real computation has begun.

For example take a.sum(axis=0), we would like to determine the data type, shape, etc. for this resultant array without computing it. Currently this is done by carrying around a a._meta attribute with a sample array that has similar characteristics, but is much smaller and easier to operate on. This a._meta object is then passed to operations (like a._meta.sum(axis=0)) and the result is inspected to ascertain what would likely happen to the result from a.sum(axis=0). This isn't perfect and some cases with UDFs can get tricky (like apply_along_axis). However it still works reasonably well for common use cases.

That said, it would be nice to have an API solution that was not reliant on doing these sample computations. Admittedly there may not be an easy answer to this use case, but wanted to raise it for discussion given this could be quite helpful when reasoning about applying operations to large arrays.

Note: While this comes up with Arrays, there is similar logic for DataFrames as well.

@rgommers
Copy link
Member

rgommers commented Jun 8, 2022

This would be quite interesting, but I think also complex to implement? It reminds me of the meta backend in PyTorch. NumPy has some functions for parts of this, like broadcast_shapes and result_type. But doing the whole thing for all functions the API supports isn't possible with NumPy primitives AFAIK.

Currently this is done by carrying around a a._meta attribute with a sample array that has similar characteristics, but is much smaller and easier to operate on.

That sounds like a decent implementation choice for Dask, although to deal with corner cases like 0-D arrays you probably need a bunch of logic (?). It doesn't make sense for other libraries probably; we'd really need a classification of operations for shape behavior ("element-wise", "reduction", etc. plus one-offs) as well as casting rules (maybe as ufunc-like signatures, ii -> f?) and then from-first-principles calculations I'd think.

It'd be very nice to see an implementation if anyone has something like this floating around somewhere.

@rgommers rgommers added the RFC Request for comments. Feature requests and proposed changes. label Jun 8, 2022
@jakirkham
Copy link
Member Author

Maybe one possible implementation would be to include an array implementation for the meta array object? This way it would be straightforward to thread that object through computations to see how it would behave to infer how an actual array would behave

Indeed. The trickiest corner cases are UDFs. We can try things there, but it is not hard to miss some subtlety of the UDF, which is better captured by the user specifying what the output should look like.

At least how Dask approached this problem with _meta is mostly captured in PR ( dask/dask#4543 ). There were some follow on PRs to make various adjustments. Though that should still give some idea of how this works.

@szha
Copy link
Member

szha commented Jun 23, 2022

Some shape inference will not be possible without actually performing the computation, such as boolean mask and unique.

@jakirkham
Copy link
Member Author

Yeah value dependent operations would yield an undefined shape. Fortunately the spec provides a way to handle the shape then:

An array dimension must be None if and only if a dimension is unknown.

@asmeurer
Copy link
Member

asmeurer commented Sep 5, 2022

For data type, most functions do type promotion but there are a few exceptions, like equal which always returns bool. These categories could be spelled out in the signatures package #411.

One challenge is that the spec only specifies a minimal set of required dtypes. It doesn't disallow libraries from implementing additional dtypes on functions.

Shape I think is harder because the result shape depends on things like axis keyword arguments, so you'd really need a function to determine it given a specific function and input keyword arguments.

I think ideally all this stuff would be encoded in the type annotations somehow.

@kgryte
Copy link
Contributor

kgryte commented Jun 29, 2023

I'll go ahead and close this issue. Feel free to reopen if anyone feels like this should be discussed further. At the moment, we don't have plans to provide APIs satisfying the request in the OP.

@kgryte kgryte closed this as completed Jun 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC Request for comments. Feature requests and proposed changes.
Projects
None yet
Development

No branches or pull requests

5 participants