Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return period calculation based on a Gumbel-distribution fit to sample data #29

Draft
wants to merge 9 commits into
base: develop
Choose a base branch
from

Conversation

chpolste
Copy link
Member

@chpolste chpolste commented Jan 6, 2025

Adds distribution and extreme_values submodules to earthkit.plots.stats for computing recurrence statistics based on a Gumbel distribution fit to a data sample. Can be used, e.g., to compute return periods of extreme precipitation of flooding events in a given time interval.

The implementation is class-based, since the fitted distribution parameters need to be stored for subsequent invocations of the statistics calculations. The interface for the distributions so far matches that of scipy's rv_continuous and the distributions are only reimplemented here to allow for fitting of multiple distributions in a vectorised fashion along a given axis (scipy only fits to 1D samples).

The Gumbel fit makes use of scipy's lmoment function, which was only added in the latest 1.15 release, so a vectorized implementation by @corentincarton is included for environments with older versions of scipy. The included implementation only works along the first array axis, but that covers the common use-case of fitting a distribution to a (time)series of fields. The scipy implementation supports application along any axis and is therefore preferred, despite being a little slower.

chpolste and others added 6 commits December 19, 2024 18:38
The plan is to exclusively rely on scipy for the functionality in the future,
but for now to provide a substitute that doesn't rely on a version of scipy
that is still under development (as of Dec 2024).

Co-authored-by: Christopher Polster <[email protected]>
@codecov-commenter
Copy link

codecov-commenter commented Jan 6, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.72%. Comparing base (756dcf2) to head (0a84c03).
Report is 10 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop      #29      +/-   ##
===========================================
+ Coverage    98.68%   98.72%   +0.04%     
===========================================
  Files            8        8              
  Lines          684      708      +24     
  Branches        26       26              
===========================================
+ Hits           675      699      +24     
  Misses           7        7              
  Partials         2        2              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@chpolste chpolste marked this pull request as draft January 22, 2025 15:25
axis: int
The axis along which to compute the parameters.
"""
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be done only once not every time fit() is called?

@chpolste
Copy link
Member Author

I've been going over the interface again, after a discussion with @corentincarton. We had agreed to convert the currently class-based MaximumStatistics into a set of functions, but I'm getting doubts about this interface as I'm implementing it. In short, I still see value in the class with regard to a metadata-aware interface.

import earthkit.meteo as ekm

E.g., changing the method MaximumStatistics.return_period_from_threshold into a function return_period gives

def return_period(dist, value, freq=1.0):
    return freq / dist.cdf(value)

to be used in the following way:

dist = ekm.stats.MaxGumbel(data)
rp = ekm.stats.return_period(dist, 30., freq=np.timedelta64(500, "s"))

I don't like how this separates the information about the frequency from the data. The frequency is a property of the data and I would see this information being automatically extracted from, e.g., a pandas DataFrame in the eventual metadata-aware interface. This makes more sense in the current class-based interface:

stats = ekm.stats.MaximumStatistics(data, freq=np.timedelta64(500, "s"))
rp = stats.return_period_of_threshold(30.)

In the future, this could just be

stats = ekm.stats.MaximumStatistics(df)
rp = stats.return_period_of_threshold(30.)

with freq being inferred from the index of the DataFrame in the constructor. Of course, we could consider carrying the frequency with the distribution object for the functional interface, but the two are not really related.

@tlmquintino
Copy link
Member

@chpolste thanks for looking into the comments from @corentincarton
I do think that the functional implementation is still more in line with the architecture of earthkit.
I see keeping the frequency a together with the data more a responsibility of the client code that calls earthkit.
Earthkit-meteo is more about implementing algorithms and making them easily accessible, and then let then client code manage their data structures.

@corentincarton
Copy link
Collaborator

Thanks @tlmquintino for clarifying! @chpolste, I'm now wondering if we shouldn't keep the frequency close to the sample, i.e. in the distribution class. Wouldn't that make more sense? Let's assume we would build a new distribution from sigma and mu (without the sample), then we would still need the frequency of the sample to use it, right? So we could build the distribution (storing mu, sigma, and freq), then we can create a series of functions taking the distribution as input. What do you think?

return np.expand_dims(arr, axis=list(range(-ndim, 0)))


class MaximumValueDistribution:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we call it GumbelDistribution instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to call it by what it means for a meteorologist rather than what it is mathematically. A meteorologist might not know what a Gumbel distribution is but they will know that they have a time series of maximum values they want to do statistics with. Before, there was an additional layer between the distribution class and return period function that did this translation of terms, but now users instantiate the distribution directly. I decided to keep the maths in the docs and the application in the class name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the Gumbel distribution is not the only distribution we can use. So we could implement another approach that does the same thing in the future which will make the name confusing. If we have another approach, MaximumValueDistribution would rather be the name of the base class and GumbelDistribution and OtherDistribution (placeholder) the inherited classes. Unless we go for a distribution_function as an argument in the constructor, but we'll probably end up with a bunch of if statements all over the place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we match scipy and distinguish between right and left-skewed Gumbel distributions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GumbelDistribution for maximum values and, e.g., MirroredGumbelDistribution for minimum values also works.

return self.mu - self.sigma * np.log(-np.log(1.0 - p))


def return_period(dist, value):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe value_to_return_period() and return_period_to_value() for the symmetry?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants