Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor missing methods #2058

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Refactor missing methods #2058

wants to merge 10 commits into from

Conversation

aulemahal
Copy link
Collaborator

@aulemahal aulemahal commented Jan 23, 2025

Pull Request Checklist:

What kind of change does this PR introduce?

Refactor of the Missing objects. I tried to follow a more orthodox OOP approach. In the new way:

  • Objects are initialized with their options, called with the data (+ freq, src_timestep, indexer)
  • Subclasses should override:
    • __init__ to explicitly override the signature and document their options, but this method should not do anything.
    • validate, a static method, which returns False on invalid options (this is the same as before).
    • is_missing, which receives null, count and freq. It does the same as before.
    • (optionnaly) _validate_src_timestep, to validate the src_timestep at call time. Only useful for MissingWMO which is restricted to daily inputs.
  • Before, input validation was done in a few places, now it is almost only done in __call__, which is not meant to be overriden.
  • The methods do not receivenull as a DataArrayResample object anymore, but as a normal DataArray. This allows a bit more flexibility, which I use to optimise MissingWMO by using resample_map on the longest_run condition. Benchmarking to come.
  • New MissingTwoSteps subclass used by MissingPct and AtLeastNValid (and MissingWMO, but not in a new way). This adds a subfreq option which can be used to divide the mask computation in two steps.
    1. Compute the mask at subfreq using the given method
    2. Merge the sub-groups at the target freq using the "any" method.

Does this PR introduce a breaking change?

Yes, MissingBase and all its children have been modified in breaking ways. However, these were not exposed in the public API. The convenience functions should work as they did before.

Some users, though, might have implemented custom missing methods. These will break, sorry. I hope the new way makes more sense.

Other information:

I have yet to run mypy and tools in the like to see if I really fixed #2000.

Also, I'll had some benchmarking to see if my change impacted performance. In preliminary tests, missing_wmo ran at least 10x faster on a dataset of 100 years x 50 points. And it had 1000x fewer dask tasks.

@Zeitsperre
Copy link
Collaborator

Also, I'll had some benchmarking to see if my change impacted performance. In preliminary tests, missing_wmo ran at least 10x faster on a dataset of 100 years x 50 points. And it had 1000x fewer dask tasks.

Nice!

@aulemahal
Copy link
Collaborator Author

aulemahal commented Jan 24, 2025

Benchmarking done with synthetic data of 36525x50 (10 years) with 4-year x 10 chunks. It starts at 15 keys in the dask graph.

Test is:

N := len(xc.core.missing.missing_METH(da, 'YS').__dask_graph__())
T := %timeit xc.core.missing.missing_METH(da, 'YS').load()

Except for the old missing_wmo, that executed only once. I modified the new wmo method, the values in brackets are from before that change.

           BEFORE      AFTER         IMPROVEMENT
T   ANY    0.529       0.550           ~=
N          2089        2089            ~=
T   WMO    136 (1x)    8.24 (7.94)     x15
N          320556      6720 (5955)     x45
T   PCT    0.591       0.587           ~=
N          2929        2929            ==
T   ALN    0.520       0.282           x2
N          2400        1285            x2

Thus : No change of "ANY" and "PCT". A small improvement for "at least n" and a huge one for "WMO".

@huard I realized that the WMO method had some issues:

  1. It had a test like : count != nullr.count(dim="time"), but that means that any missing value would mask the month. I think we want to add missing and invalid values and then check against the nm argument.
  2. The "longest run" check won't count holes in the time coordinate as missing values, it will only count runs of invalid (NaN) values. Considering point 1, I thinks this needs to be fixed.

@aulemahal aulemahal marked this pull request as ready for review January 24, 2025 22:05
@aulemahal aulemahal requested a review from huard January 24, 2025 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants