Return period calculation based on a Gumbel-distribution fit to sample data #29

chpolste · 2025-01-06T14:52:01Z

Adds distribution and extreme_values submodules to earthkit.plots.stats for computing recurrence statistics based on a Gumbel distribution fit to a data sample. Can be used, e.g., to compute return periods of extreme precipitation of flooding events in a given time interval.

The implementation is class-based, since the fitted distribution parameters need to be stored for subsequent invocations of the statistics calculations. The interface for the distributions so far matches that of scipy's rv_continuous and the distributions are only reimplemented here to allow for fitting of multiple distributions in a vectorised fashion along a given axis (scipy only fits to 1D samples).

The Gumbel fit makes use of scipy's lmoment function, which was only added in the latest 1.15 release, so a vectorized implementation by @corentincarton is included for environments with older versions of scipy. The included implementation only works along the first array axis, but that covers the common use-case of fitting a distribution to a (time)series of fields. The scipy implementation supports application along any axis and is therefore preferred, despite being a little slower.

…ibution

The plan is to exclusively rely on scipy for the functionality in the future, but for now to provide a substitute that doesn't rely on a version of scipy that is still under development (as of Dec 2024). Co-authored-by: Christopher Polster <[email protected]>

codecov-commenter · 2025-01-06T14:54:53Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.72%. Comparing base (756dcf2) to head (0a84c03).
Report is 10 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop      #29      +/-   ##
===========================================
+ Coverage    98.68%   98.72%   +0.04%     
===========================================
  Files            8        8              
  Lines          684      708      +24     
  Branches        26       26              
===========================================
+ Hits           675      699      +24     
  Misses           7        7              
  Partials         2        2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sandorkertesz · 2025-01-24T07:52:38Z

src/earthkit/meteo/stats/array/distributions.py

+        axis: int
+            The axis along which to compute the parameters.
+        """
+        try:


Maybe this should be done only once not every time fit() is called?

chpolste · 2025-01-27T10:03:31Z

I've been going over the interface again, after a discussion with @corentincarton. We had agreed to convert the currently class-based MaximumStatistics into a set of functions, but I'm getting doubts about this interface as I'm implementing it. In short, I still see value in the class with regard to a metadata-aware interface.

import earthkit.meteo as ekm

E.g., changing the method MaximumStatistics.return_period_from_threshold into a function return_period gives

def return_period(dist, value, freq=1.0):
    return freq / dist.cdf(value)

to be used in the following way:

dist = ekm.stats.MaxGumbel(data)
rp = ekm.stats.return_period(dist, 30., freq=np.timedelta64(500, "s"))

I don't like how this separates the information about the frequency from the data. The frequency is a property of the data and I would see this information being automatically extracted from, e.g., a pandas DataFrame in the eventual metadata-aware interface. This makes more sense in the current class-based interface:

stats = ekm.stats.MaximumStatistics(data, freq=np.timedelta64(500, "s"))
rp = stats.return_period_of_threshold(30.)

In the future, this could just be

stats = ekm.stats.MaximumStatistics(df)
rp = stats.return_period_of_threshold(30.)

with freq being inferred from the index of the DataFrame in the constructor. Of course, we could consider carrying the frequency with the distribution object for the functional interface, but the two are not really related.

tlmquintino · 2025-01-30T15:20:28Z

@chpolste thanks for looking into the comments from @corentincarton
I do think that the functional implementation is still more in line with the architecture of earthkit.
I see keeping the frequency a together with the data more a responsibility of the client code that calls earthkit.
Earthkit-meteo is more about implementing algorithms and making them easily accessible, and then let then client code manage their data structures.

corentincarton · 2025-01-30T15:51:49Z

Thanks @tlmquintino for clarifying! @chpolste, I'm now wondering if we shouldn't keep the frequency close to the sample, i.e. in the distribution class. Wouldn't that make more sense? Let's assume we would build a new distribution from sigma and mu (without the sample), then we would still need the frequency of the sample to use it, right? So we could build the distribution (storing mu, sigma, and freq), then we can create a series of functions taking the distribution as input. What do you think?

corentincarton · 2025-02-24T08:23:31Z

src/earthkit/meteo/stats/array/extreme_values.py

+    return np.expand_dims(arr, axis=list(range(-ndim, 0)))
+
+
+class MaximumValueDistribution:


Shouldn't we call it GumbelDistribution instead?

I wanted to call it by what it means for a meteorologist rather than what it is mathematically. A meteorologist might not know what a Gumbel distribution is but they will know that they have a time series of maximum values they want to do statistics with. Before, there was an additional layer between the distribution class and return period function that did this translation of terms, but now users instantiate the distribution directly. I decided to keep the maths in the docs and the application in the class name.

Yes, but the Gumbel distribution is not the only distribution we can use. So we could implement another approach that does the same thing in the future which will make the name confusing. If we have another approach, MaximumValueDistribution would rather be the name of the base class and GumbelDistribution and OtherDistribution (placeholder) the inherited classes. Unless we go for a distribution_function as an argument in the constructor, but we'll probably end up with a bunch of if statements all over the place.

Do we match scipy and distinguish between right and left-skewed Gumbel distributions?

GumbelDistribution for maximum values and, e.g., MirroredGumbelDistribution for minimum values also works.

corentincarton · 2025-02-24T08:23:36Z

src/earthkit/meteo/stats/array/extreme_values.py

+        return self.mu - self.sigma * np.log(-np.log(1.0 - p))
+
+
+def return_period(dist, value):


Maybe value_to_return_period() and return_period_to_value() for the symmetry?

chpolste and others added 6 commits December 19, 2024 18:38

Add recurrence statistics estimation for maxima based on Gumbel distr…

909f63d

…ibution

Fixes and proper array broadcasting

3aeea78

Improve docstrings

973efad

Add unit tests for MaximumStatistics

49f89d1

Docs and input type fixes

2dbcdbb

chpolste requested review from sandorkertesz and corentincarton January 6, 2025 14:52

chpolste marked this pull request as draft January 22, 2025 15:25

sandorkertesz reviewed Jan 24, 2025

View reviewed changes

chpolste added 2 commits February 20, 2025 22:39

Redo distribution and return period interface

2e6f0f7

Add return period example to docs

e04512b

corentincarton reviewed Feb 24, 2025

View reviewed changes

Rename distribution and return period functions

0a84c03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return period calculation based on a Gumbel-distribution fit to sample data #29

Return period calculation based on a Gumbel-distribution fit to sample data #29

chpolste commented Jan 6, 2025

codecov-commenter commented Jan 6, 2025 •

edited

Loading

sandorkertesz Jan 24, 2025

chpolste commented Jan 27, 2025

tlmquintino commented Jan 30, 2025

corentincarton commented Jan 30, 2025

corentincarton Feb 24, 2025

chpolste Feb 24, 2025

corentincarton Feb 24, 2025

chpolste Feb 24, 2025

chpolste Feb 24, 2025

corentincarton Feb 24, 2025

		return np.expand_dims(arr, axis=list(range(-ndim, 0)))


		class MaximumValueDistribution:

		return self.mu - self.sigma * np.log(-np.log(1.0 - p))


		def return_period(dist, value):

Return period calculation based on a Gumbel-distribution fit to sample data #29

Are you sure you want to change the base?

Return period calculation based on a Gumbel-distribution fit to sample data #29

Conversation

chpolste commented Jan 6, 2025

codecov-commenter commented Jan 6, 2025 • edited Loading

Codecov Report

sandorkertesz Jan 24, 2025

Choose a reason for hiding this comment

chpolste commented Jan 27, 2025

tlmquintino commented Jan 30, 2025

corentincarton commented Jan 30, 2025

corentincarton Feb 24, 2025

Choose a reason for hiding this comment

chpolste Feb 24, 2025

Choose a reason for hiding this comment

corentincarton Feb 24, 2025

Choose a reason for hiding this comment

chpolste Feb 24, 2025

Choose a reason for hiding this comment

chpolste Feb 24, 2025

Choose a reason for hiding this comment

corentincarton Feb 24, 2025

Choose a reason for hiding this comment

codecov-commenter commented Jan 6, 2025 •

edited

Loading