Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

selecting terminology: steps, horizons, and intervals #1

Closed
cboettig opened this issue Aug 17, 2022 · 23 comments
Closed

selecting terminology: steps, horizons, and intervals #1

cboettig opened this issue Aug 17, 2022 · 23 comments

Comments

@cboettig
Copy link
Contributor

Thanks @m-mohr for starting this, looks like an excellent beginning here. Hope it's okay to open a thread to discuss some terminology. I really like how you have both forecast:datetime and forecast:step, with one of the two being required.

I'm not sure that step is the ideal term. The term 'step' to me at least implies a uniform interval, i.e. that the forecast is using a 3H step size, rather than that the forecast:datetime occurs 3H after the forecast:start_datetime. Also some forecasts, like NOAA's Global Ensemble Forecast System (GEFS), use a variable step size, moving from 3H to 6H interval after the first 10 days. I think it is common to use the term "horizon" as the difference between start time and current time, and step as the difference between subsequent observations.

More substantively, it's not clear how to report forecasts of values that are defined only over intervals of time rather than instants. For instance, in GEFS, some values are forecast at the step, whereas others refer to interval predictions, such as the amount of rain accumulating in a 3H-6H interval, or the average, min, or max value during an interval. GEFS calls this column "forecast valid", and uses a text based set of values like "3hr forecast", "0 -3hr acc", or "0-3 hr ave" to distinguish, which I agree is not very machine readable (see https://www.nco.ncep.noaa.gov/pmb/products/gens/gep01.t00z.pgrb2a.0p50.f003.shtml). Still, we certainly want to be able to express things like 'rainfall accumulation" in this standard. (while such values could be converted to a rate, that's really not the same, and obviously NOAA has but some thought into this). Maybe an additional field is necessary in these contexts.

@aaronspring
Copy link

The http://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html use forecast_period as an attribute:

Forecast period is the time interval between the forecast reference time and the validity time. A period is an interval of time, or the time-period of an oscillation.

In https://climpred.readthedocs.io/en/stable/index.html we call this dimension lead, in s2s-ai https://s2s-ai-challenge.github.io/#data lead_time.

@aaronspring
Copy link

The cf conventions also allow for intervals via bnds/ boundaries http://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html#cell-boundaries

@m-mohr
Copy link
Contributor

m-mohr commented Aug 18, 2022

I'm happy to adopt other names, I really had no clue when choosing them and I was inspired by an ECMWF implementation with step.

Maybe it would help (me) to clarify the different types of temporal information that are available. Do I understand correctly that:

  1. There's a date/time at which the forecast is made / "captured" (here: datetime). Example: today at 0:00. Q: Can this be an interval? So does it make sense to allow start_datetime and end_datetime
  2. There's a date/time for which the forecast predicts something. This can be an interval. Example: today at 12:00 or today from 12:00 until 18:00. For a specific instance this is currently forecast:datetime or forecast:step, but we may want to add a new field or changed functionality for an interval?
  3. The "valid" datetime is currently expires. Does this need to have a specific field or is that always equal to one of the other times?

@aaronspring
Copy link

aaronspring commented Aug 18, 2022

  1. is equivalent to netcdf forecast_reference_time (when forecast was done/initialised/started) and is only datetime, not interval
  2. is valid_time= forecast_reference_time+step/forecast_period when forecast predicts something. Can be datetime for instantaneous temperature or interval for precip accumulation
  3. EDITED: forecast_period means time since forecast_reference_time and can be instantaneous Timedelta (1 day) or timedelta bounds [1 day, 2 day].
  4. Haven’t seems anything like valid in terms of expired or not

m-mohr added a commit that referenced this issue Aug 18, 2022
@m-mohr
Copy link
Contributor

m-mohr commented Aug 18, 2022

Thanks, I've updated the extension a bit. I renamed the step to horizon and added forecast:accumulation_period. Not sure whether it's the best term and I'm happy to change it. Does this reflect at least some of the discussions here?

@aaronspring I did not understand your point 3, sorry. Can you give an example? How can there be a time before the forecast is started?

Disclaimer: I'm from another domain and have basically no clue about forecasts...

@aaronspring
Copy link

Edited 3.

Example instantaneous:
valid_time = forecast_refernce_time + forecast_period
02-01-2000 = 01-01-2000 + 1 day

@aaronspring
Copy link

https://confluence.ecmwf.int/display/COPSRV/Guide+to+NetCDF+encoding+for+C3S+providers under time coordinates is a similar overview

@cboettig
Copy link
Contributor Author

Thanks @m-mohr , this is looking very good. I really like your documentation as well.

If I follow correctly, you're mapping the standard STAC notion of datetime to the "forecast_reference_time", i.e. the date-time the forecast was produced. I think it would make more sense to map the existing stac term of datetime (i.e. that we use for observational data), to the "valid time", or what is currently called forecast:datetime in the proposal. The date the forecast was produced would then be the timescale which gets a new name under the standard, e.g. maybe forecast:reference_datetime?

This would more closely match the CF conventions @aaronspring cites above -- in which the valid time is called time, to distinguish it from the forecast_reference_time:

The forecast reference time in NWP is the "data time", the time of the analysis from which the forecast was made. It is not the time for which the forecast is valid; the standard name of time should be used for that time.

Perhaps more importantly, this would mean that the datetime in a forecast matches the datetime on observational data you would compare the forecast against (once the observation was available). (i.e. mechanically, in a relational data sense you could imagine literally doing an JOIN on the observation data and the forecast by the datetime column).

@cboettig
Copy link
Contributor Author

Not to nitpick, but I'm not sold on the term forecast:accumulation_period in the proposal. In the GEFS metadata, "accumulation" is taken to mean explicitly a sum over an interval (as in "0 -3hr acc" of rainfall, vs a "0 -3hr max" of temp). And from what @aaronspring says, it sounds like there's no CF term for this concept, but the term forecast_period has already been used in CF to indicate what I was calling "horizon" (i.e. forecast_reference_time - time as Aaron notes above.)

I do really like how you note this field formatted as an 8601 duration, that makes it clear what the units are. Could we call it forecast:duration instead of forecast:accumulation_period, or is that too ambiguous?

Minor additional comment: I think it the standard should be explicit about whether this period applies to the interval before or after the valid time (forecast:datetime). I would argue that we standardize on "before", i.e. following NOAA GEFS, "18 -21hr acc" would become "horizon": "PT21H" (or datetime), and "duration": "PT3H", yes?)

@m-mohr
Copy link
Contributor

m-mohr commented Aug 22, 2022

  • I agree with the datetime issue you raised and we can happily change that.
  • For the new field for the reference datetime I was wondering whether forecast:created would be something that would be easier to understand across domains and created is already defined in STAC as the time when the data has been created. It might be just me as a non-native speaker, but without additional context, I would not know what a "reference datetime" is.
  • forecast:accumulation_period was just a placeholder until someone comes up with something new. so I'm happy for new proposals. forecast:duration works for me, another option I just had in mind is "forecast:aggreation_period".
  • Another question is whether we want to re-use the CF terminology forecast:period instead of forecast:horizon?
  • The last point is a good thing to clarify. I'm not sure what is commonly done in this domain, but my intuitive thought was it would need to be PT18H and you add the duration on top. Then it is always additive: forecast:created + forecast:period = beginning of the forecast time / datetime and then datetime + forecast:aggregation_period = end of the forecast time. As someone who is not usually working in this domain it feels intuitive to me, but might not be the common practice. If there's another common practice, I'm happy to adopt that.

@cboettig
Copy link
Contributor Author

cboettig commented Aug 22, 2022

  • 👍 re datetime

  • agree that reference is not particularly clear, though it does have the existing definition in the CF conventions. created has some ambiguity between the time the forecast model starts (i.e. horizon=0), vs the time a given asset was serialized. (i.e. dublin core, schema.org etc have notion of created, which is typically meant to refer to when a particular asset is serialized to disk. This could potentially be quite different than the time at which the forecast starts (especially for "forecasts" that may start in the past).

  • Cool, let's hear more voices on forecast:duration or something else. aggregation_period is also fine with me.

  • Yup, I'm happy to stick with CF forecast:period, though personally it sounds more ambiguous to me (i.e. might be mistaken for the 'aggregation period' or whatever we're calling it). Again would be great to hear from others.

  • I do see where you are coming from with "plus" being the natural operation, e.g. datetime + aggregation_period. However, my intuition on going with "before" rather than "after" is because that's what NOAA is already doing, e.g. with GEFS, note that instantaneous values are for the current interval ("3 hour fcst"), while aggregations are for the previous period ("0-3 hour acc") for the different bands in the same asset. This makes sense if you think that the asset overall is indexed by it's "horizon", the asset with the longest horizon for instantaneous values has a horizon of 840 hrs, but if I wanted to define the accumulations after (i.e. with +, ranging from datetime to datetime + aggregation) instead of before (with -, a range of datetime - aggregation : datetime), then I'd need to accumulations to extend another interval beyond the maximum datetime produced (i.e. beyond the maximum horizon, to 846 hrs). I think it doesn't really make sense to say that the forecast of aggregations should extend farther into the future than the forecast of instantaneous values. Does that make sense?

@aaronspring
Copy link

agree that reference is not particularly clear, though it does have the existing definition in the CF conventions. created has some ambiguity between the time the forecast model starts (i.e. horizon=0), vs the time a given asset was serialized. (i.e. dublin core, schema.org etc have notion of created, which is typically meant to refer to when a particular asset is serialized to disk. This could potentially be quite different than the time at which the forecast starts (especially for "forecasts" that may start in the past).

so created could also means the file creation timestamp.

aggregation

what I am still thinking about is multiple ways to express and think of aggregation. I think for aggregation we should add or include bounds as in 3-6 hour acc (which could also be a coordinate lead_bnds(lead) in xarray). Just wondering whether this string format fits or another keyword is needed. Otherwise with just 6h acc it could mean 0-6 hour acc or 3-6 hour acc if the previous timestep was 0-3 hours acc

Cool, let's hear more voices on forecast:duration or something else. aggregation_period is also fine with me.

regarding aggregation_period. Is this set Null or None for instantaneous variables like inst. temp? Does aggregation_period also apply to daily temp or should it rather be averaging_period in that case? If this create too many keywords, forecast_period=duration gracefully includes all (agg, avg, inst) IMO. I think an example dataset #4 or even a few would help our discussion here.

beyond the maximum horizon, to 846 hrs

what about monthly lead forecasts such as http://iridl.ldeo.columbia.edu/expert/SOURCES/.Models/.NMME/?

@rqthomas
Copy link

I am working with @cboettig on the ecological forecasting applications of these standards. Some thoughts

  • regarding the "before" vs "after" for the duration. The "after" concept does not mean that the forecast of non-instantaneous variables (e.g. precipitation rate) extends beyond the maximum horizon of the instantaneous variables (e.g. temperature). Rather it means that there will be an "NA" for the last duration of any non-instantaneous variables. NOAA uses the concept of "before" and has the NA on the first time-step (actually, they just exclude the non-instantaneous variables from the horizon = 0 grib file - which means the bands are different). So choosing "before" vs. "after" is a matter of choosing where to place the NA on non-instantaneous variables and just needs to be clearly defined.

@cboettig
Copy link
Contributor Author

cboettig commented Aug 23, 2022

Good points everyone. Regarding 'before' or 'after', @rqthomas is right that it's just about where the NA goes, sorry. As different forecast assets may already follow different conventions, a metadata standard may be better placed to have the vocabulary to describe both rather than insisting on one?

Using @m-mohr 's ISO-8601 duration format, maybe we could simply add a + or - to indicate lead or lag? e.g. "duration": "+PT30M" or "-PT30M"? @aaronspring I don't think the GEFS's purely string patterns like "0-6 hour acc" are as useful, since there are plenty of tools that can parse an ISO-8601 duration string into a difftime, but arbitrary strings would have to be handled rather manually? I don't think this is incompatible with serializations like xarray that could encode this as lead_bnds(lead) like you say -- seems okay to me for stac's JSON-based metadata to describe the concept with an ISO-8601 duration while the asset itself can embed a richer data type?

Re aggregation for instantaneous variables, I like @m-mohr 's suggestion of using PT0S (zero seconds) for instantaneous aggregations, though this field would also be optional (which would be taken to imply instantaneous values).

@aaronspring makes a good point about daily (or even longer intervals), which are implicitly durations / intervals. I think existing conventions are fine here -- e.g. datetimes would probably be listed to date-precision only. I imagine such assets would just omit the duration field (though they could arguably set it to the 'step size', P1D to be more explicit). More precisely, the convention is that 'instantaneous' is assumed to mean a duration equal to the precision of the prediction datetime unless otherwise specified.

I think we're still deciding on the term for this "duration" field? I agree with @aaronspring 's idea that 'forecast_period" would seem to cover all the aggregations (agg, ave, min, max, or any other aggregating function) -- except as Aaron points out, CF has already given a different definition to that very term, as datetime - reference_datetime (aka horizon).

I think we're also still deciding on if we're ok with reference_datetime as the date the forecast is made. I think @m-mohr 's draft suggests forecast:start_time could also be used? I actually kinda like that, it does seem a bit more intuitive than reference_datetime ....

@aaronspring
Copy link

aaronspring commented Aug 28, 2022

For the planetary computer (also in a STAC catalog) they already have step and this format P0DT1H0M0S

https://planetarycomputer.microsoft.com/dataset/era5-pds

@m-mohr
Copy link
Contributor

m-mohr commented Aug 29, 2022

It's a bit too much that we discuss at the same time here, so is a bit confusing to me but I'll try to wade through it now.

The first thing I've done is to swap the definitions for forecast and reference datetime as proposed (also in #7). I've also changed forecast:datetime to forecast:reference_datetime at the same time and clarified how start_datetime and end_datetime should be used, I think there was some confusion around it (also it doesn't have a forecast: prefix). I now also understood that created and the forecast:reference_datetime are not necessarrily the same, so let's not consider it anymore.

I agree that forecast:period is somewhat unclear, especially as there are two periods to be defined (right now the "horizon" and the "accumulation period"). We should ensure that these are easy to distinguish.

The forecast:accumulation_period seemed not so well received so I updated that to forecast:duration for now, but I'm not sure whether that was actually the consensus. I'm happy to change that again. Regarding the question whether it should/must be set to null (or None): No, it is simply not provided at all. If for whatever technical reasons you can't omit it, you can set it to PT0S though.

I must admit I did not fully get the "before" vs "after" discussion yet. I need to read through it again tomorrow when it's not so close to midnight. ;-)

Regarding the granularity of datetimes and the maximum value for the horizon or duration: I don't know where this limitation comes from, but it's actually not in the spec. The datetimes have a millisecond precision and the horizon/durations (i.e. the ISO durations) can pretty much be any length up to thousands of years if you like. ;-) So I don't see an issue there.

For the planetary computer (also in a STAC catalog) they already have step and this format P0DT1H0M0S

It's exactly the same format that we use here, they just provide some optional 0's. You cal also simplify to PT1H.

Did I forget anything or understood something wrong? Thoughts? Looking forward to them!

Thanks for all the discussions, I appreciate all of them!

@cboettig
Copy link
Contributor Author

🎉 thanks @m-mohr , I think this is looking really good.

The forecast-specific terms, forecast:reference_datetime, forecast:horizon, and forecast:duration feel good to me. Your description makes it clear that forecast:horizon is a time interval that goes from reference_datetime to reference_datetime + horizon. I think we should state that duration is defined relative to the datetime of the observation. I also propose that duration should include a + or - sign, e.g. "-PT6H" would mean an accumulation period (duration) of datetime to datetime - PT6H, i.e. accumulation is defined as starting prior to the datetime (the way GEFS does it). Other forecasts define duration as starting at the datetime and going foward, and so would use + (or possibly just be unsigned?) Does that make sense? Is it okay to tack a sign onto an ISO8601 duration? (it seems to work in the software I tested, but not sure if that is a real convention).

I think start_datetime / end_datetime are a little confusing to me -- i.e. maybe not obvious how these are distinct from the notion of an accumulation period in a forecast. are these just another way of saying the same thing? (i.e. it must always be true that start_datetime == datettime, and end_datetime == start_datetime + duration ?)

Okay, good call about datetimes with different precision really just being equivalent. So wrt to Aaron's question about 'daily' forecasts, etc, then I guess a duration of P1D would be required? (e.g. to distinguish between a forecast made 'for that day', e.g. an average temperature, vs for that precise moment of midnight that day).

Overall this is great, feeling pretty good about the metadata descriptions for all these temporal components. (Though definitely will help to flush out some examples, as in #4)

@chris-little
Copy link

Just to give a bit of background: most forecast naming conventions, whether from ECMWF, NOAA, CF-NeCDF, WMO, etc., have been devised for meteorologists communicating with meteorologists.

When we devised the OGC WMS Time and Elevation Best Practice about 10 years ago, it took some effort to realise that non-expert end users wanted a forecast/hindcast/nowcast for a datetime, or perhaps a specific interval, and are not really interested in whether it was a 2 day or 30 hour or a 15 minute forecast, started at a specific time. The primary datetime for them is the valid time.

Generally, only the expert providers of the "best data" are concerned about all the other times, and think in terms of T0+hh.

And of course the valid time could be confused with validity period, widely used in aviation, indicating when the forecast can be legally used - not too late, and not too soon either.

HTH, and apologies if the background is well-known.

@m-mohr
Copy link
Contributor

m-mohr commented Aug 30, 2022

I think we should state that duration is defined relative to the datetime of the observation. I also propose that duration should include a + or - sign, e.g. "-PT6H" would mean an accumulation period (duration) of datetime to datetime - PT6H, i.e. accumulation is defined as starting prior to the datetime (the way GEFS does it). Other forecasts define duration as starting at the datetime and going foward, and so would use + (or possibly just be unsigned?) Does that make sense? Is it okay to tack a sign onto an ISO8601 duration? (it seems to work in the software I tested, but not sure if that is a real convention).

I don't think it's a good idea to add a +/- if it's not part of the ISO8601 spec (which I don't know). On the other hand, it is also not required I think. I think this comes from not enough documentation around datetime and start/end_datetime.

The extension right now is meant to always use start/end_datetime if the duration is specified and the datetime is set to null (as of now, see #9). Hacing the start and end datetimes, you don't need these +/- because you already have the start and end defined, right?

I think start_datetime / end_datetime are a little confusing to me -- i.e. maybe not obvious how these are distinct from the notion of an accumulation period in a forecast. are these just another way of saying the same thing?

Yes, basically duration is meant to express the difference between the start and end datetimes. So basically start_datetime + duration = end_datetime. datetime itself is not set yet, but we can discuss whether we should set it to something useful, e.g. the start_datetime. In STAC, whenever start and end datetimes are set, the individual domains that author an extension need to check what could be a good use for the datetime field and propose its usage. If there's nothing that is useful, it is set to null. For example, for satellitle imagery that has a long capture time, datetime is usually set to the center datetime. I hope you get the idea, but maybe it needs more details to be easier to understand. Let me know please.

Okay, good call about datetimes with different precision really just being equivalent. So wrt to Aaron's question about 'daily' forecasts, etc, then I guess a duration of P1D would be required?

If I understand it correctly, yes.

I can't fully follow @chris-little's comments, but it feels like we already follow the proposal because we set the valid time to be datetime or start/end_datetime, which are the most prominent/used datetimes in STAC. Right?

@chris-little
Copy link

@m-mohr

it feels like we already follow the proposal because we set the valid time to be datetime or start/end_datetime, which are the most prominent/used datetimes in STAC.

Right!

Another suggestion: ISO8601 is basically a notation overlaid with calendar rules. It has many options, so the IETF restrictive profile for timestamps, RFC3339, is more useful. But RFC3339 very very carefully does not mention durations. There are sound mathematical and semantic reasons for specify intevals using the start and end points, rather than one endpoint plus or minus a duration. That is because 'subtracting' timestamps to get a duration, or adding a duiration to a startpoint to get an endpoint, relies on software to have implemented calendar algorithms completely correctly. Also, ontologies used for reasoning can work better with intervals than with durations.

When one uses durations in, for example, a forecast: T+00, +3 hours, +6, ...+72 hours, etc, technically, that is a different timescale, with origin at T+00, and a Unit of Measure of hours and normal rules of arithmetic. datetime, as in ISO8601/IETF RFC3339, is calendar based and a different 'thing', a social construct.

I find keeping that distinction between temporal 'coordinates' and Calendars helpful.

@m-mohr
Copy link
Contributor

m-mohr commented Aug 31, 2022

@chris-little Thanks a lot. The STAC datetimes are all RFC3339 and I think they should be the main source if you want the exact values. The reason why I added a representation of the duration is that it's very useful for search and that's a main focus of STAC. So if you want to search for a forecast with 6h duration it's much more easy at the API level to search for forecast:duration==PT6H (although that has other challenges, e.g. one could also specify "P0DT6H0M0S") instead of trying to do math with the timestamps. If the ISO durations are problematic we could also try to either go with numerical values, but I'm not sure what the best unit would be. hours? seconds? Either way, both seem to have issues and we need to figure out the best way forward. The most obvious one were ISO periods and that's why I chose them for now, but I'm happy to change it.

@chris-little
Copy link

@m-mohr My suggestion then is to offer both approaches, log the queries, then when stats seem reliable, configure to optimise for the popular queries. And when in doubt, keep it simple, as I am not sure who would want to search for a 6 hour forecast!

@m-mohr
Copy link
Contributor

m-mohr commented Aug 31, 2022

We'll likely offer both, but we can't log it as this is just a specification and we don't know where and how this will be implemented in the end.

We are mostly done with this issue, right? I'd like to focus on more concrete/smaller issues as they are easier to discuss instead of having one issue that tackles multiple concerns. So I'm closing this, but please open new issues if anything else needs to be discussed or improved!

@m-mohr m-mohr closed this as completed Aug 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants