-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return period calculation based on a Gumbel-distribution fit to sample data #29
base: develop
Are you sure you want to change the base?
Conversation
The plan is to exclusively rely on scipy for the functionality in the future, but for now to provide a substitute that doesn't rely on a version of scipy that is still under development (as of Dec 2024). Co-authored-by: Christopher Polster <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #29 +/- ##
===========================================
+ Coverage 98.68% 98.72% +0.04%
===========================================
Files 8 8
Lines 684 708 +24
Branches 26 26
===========================================
+ Hits 675 699 +24
Misses 7 7
Partials 2 2 ☔ View full report in Codecov by Sentry. |
axis: int | ||
The axis along which to compute the parameters. | ||
""" | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be done only once not every time fit()
is called?
I've been going over the interface again, after a discussion with @corentincarton. We had agreed to convert the currently class-based import earthkit.meteo as ekm E.g., changing the method def return_period(dist, value, freq=1.0):
return freq / dist.cdf(value) to be used in the following way: dist = ekm.stats.MaxGumbel(data)
rp = ekm.stats.return_period(dist, 30., freq=np.timedelta64(500, "s")) I don't like how this separates the information about the frequency from the data. The frequency is a property of the data and I would see this information being automatically extracted from, e.g., a pandas DataFrame in the eventual metadata-aware interface. This makes more sense in the current class-based interface: stats = ekm.stats.MaximumStatistics(data, freq=np.timedelta64(500, "s"))
rp = stats.return_period_of_threshold(30.) In the future, this could just be stats = ekm.stats.MaximumStatistics(df)
rp = stats.return_period_of_threshold(30.) with |
@chpolste thanks for looking into the comments from @corentincarton |
Thanks @tlmquintino for clarifying! @chpolste, I'm now wondering if we shouldn't keep the frequency close to the sample, i.e. in the distribution class. Wouldn't that make more sense? Let's assume we would build a new distribution from sigma and mu (without the sample), then we would still need the frequency of the sample to use it, right? So we could build the distribution (storing mu, sigma, and freq), then we can create a series of functions taking the distribution as input. What do you think? |
return np.expand_dims(arr, axis=list(range(-ndim, 0))) | ||
|
||
|
||
class MaximumValueDistribution: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we call it GumbelDistribution instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to call it by what it means for a meteorologist rather than what it is mathematically. A meteorologist might not know what a Gumbel distribution is but they will know that they have a time series of maximum values they want to do statistics with. Before, there was an additional layer between the distribution class and return period function that did this translation of terms, but now users instantiate the distribution directly. I decided to keep the maths in the docs and the application in the class name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the Gumbel distribution is not the only distribution we can use. So we could implement another approach that does the same thing in the future which will make the name confusing. If we have another approach, MaximumValueDistribution would rather be the name of the base class and GumbelDistribution and OtherDistribution (placeholder) the inherited classes. Unless we go for a distribution_function as an argument in the constructor, but we'll probably end up with a bunch of if statements all over the place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we match scipy and distinguish between right and left-skewed Gumbel distributions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GumbelDistribution
for maximum values and, e.g., MirroredGumbelDistribution
for minimum values also works.
return self.mu - self.sigma * np.log(-np.log(1.0 - p)) | ||
|
||
|
||
def return_period(dist, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe value_to_return_period() and return_period_to_value() for the symmetry?
Adds distribution and extreme_values submodules to earthkit.plots.stats for computing recurrence statistics based on a Gumbel distribution fit to a data sample. Can be used, e.g., to compute return periods of extreme precipitation of flooding events in a given time interval.
The implementation is class-based, since the fitted distribution parameters need to be stored for subsequent invocations of the statistics calculations. The interface for the distributions so far matches that of scipy's rv_continuous and the distributions are only reimplemented here to allow for fitting of multiple distributions in a vectorised fashion along a given axis (scipy only fits to 1D samples).
The Gumbel fit makes use of scipy's lmoment function, which was only added in the latest 1.15 release, so a vectorized implementation by @corentincarton is included for environments with older versions of scipy. The included implementation only works along the first array axis, but that covers the common use-case of fitting a distribution to a (time)series of fields. The scipy implementation supports application along any axis and is therefore preferred, despite being a little slower.