Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing output schema definition and inconsistencies within a specific version. #122

Open
observingClouds opened this issue Nov 11, 2024 · 5 comments

Comments

@observingClouds
Copy link

Describe the bug
Hi everyone,

I thought I just give some datasets indexed in the anemoi-catalog a spin and first thought that they are not compatible with xarray, but after testing a few of them some seem to be compatible. However, the dataset version attribute does not seem to be an indicator whether or not a dataset is compatible. Are the dataset schema changes (or dataset schema in the first place) be documented somewhere? Or is the dataset version not the right indicator?

Version number
N/A

To Reproduce
Examples:

>>> xr.open_dataset("/home/mlx/ai-ml/datasets/experimental/cerra-rr-an-oper-0001-mars-5p5km-1984-2020-6h-v2-hmsi.zarr") #0.20 according to catalog
KeyError: 'Zarr object is missing the attribute `_ARRAY_DIMENSIONS` and the NCZarr metadata, which are required for xarray to determine variable dimensions.'

>>> xr.open_dataset("/home/mlx/ai-ml/datasets/stable/aifs-ea-an-oper-0001-mars-n320-1979-2023-6h-v2-precipitations.zarr")
KeyError: 'Zarr object is missing the attribute `_ARRAY_DIMENSIONS` and the NCZarr metadata, which are required for xarray to determine variable dimensions.'

>>> xr.open_dataset("/home/mlx/ai-ml/datasets/aifs-rd-an-wave-idz8-mars-n320-1979-2023-6h-v3-wave.zarr")
<xarray.Dataset> Size: 2TB
Dimensions:                             (variable: 14, time: 65744,
                                         ensemble: 1, cell: 542080)
Dimensions without coordinates: variable, time, ensemble, cell
Data variables: (12/36)
    count                               (variable) float64 112B ...
    data                                (time, variable, ensemble, cell) float32 2TB ...
    dates                               (time) datetime64[s] 526kB ...
    has_nans                            (variable) object 112B ...
    latitudes                           (cell) float64 4MB ...
    longitudes                          (cell) float64 4MB ...
    ...                                  ...
    statistics_tendencies_6h_minimum    (variable) float64 112B ...
    statistics_tendencies_6h_squares    (variable) float64 112B ...
    statistics_tendencies_6h_stdev      (variable) float64 112B ...
    statistics_tendencies_6h_sums       (variable) float64 112B ...
    stdev                               (variable) float64 112B ...
    sums                                (variable) float64 112B ...
Attributes: (12/27)
    allow_nans:              False
    attribution:             ECMWF
    data_request:            {'area': [89.785, 0.0, -89.785, 359.719], 'grid'...
    description:             Additional wave fields from the ERA5-forced wave...
    end_date:                2023-12-31T18:00:00
    ensemble_dimension:      1
    ...                      ...
    total_number_of_files:   65875
    total_size:              644578521631
    uuid:                    4ecaab42-0079-4a28-bcaa-a2edb9693706
    variables:               ['cdww', 'ci', 'h1012', 'h1214', 'h1417', 'h1721...
    variables_with_nans:     ['swh', 'cdww', 'mwp', 'mwd', 'wmb', 'ci', 'icet...
    version:                 0.20

>>> xr.open_dataset("/home/mlx/ai-ml/datasets/aifs-rd-an-oper-i6aj-mars-n400-2010-2022-6h-v1-ecland.zarr")
<xarray.Dataset> Size: 4TB
Dimensions:                             (variable: 58, time: 18991,
                                         ensemble: 1, cell: 843490)
Dimensions without coordinates: variable, time, ensemble, cell
Data variables: (12/36)
    count                               (variable) float64 464B ...
    data                                (time, variable, ensemble, cell) float32 4TB ...
    dates                               (time) datetime64[s] 152kB ...
    has_nans                            (variable) object 464B ...
    latitudes                           (cell) float64 7MB ...
    longitudes                          (cell) float64 7MB ...
    ...                                  ...
    statistics_tendencies_6h_minimum    (variable) float64 464B ...
    statistics_tendencies_6h_squares    (variable) float64 464B ...
    statistics_tendencies_6h_stdev      (variable) float64 464B ...
    statistics_tendencies_6h_sums       (variable) float64 464B ...
    stdev                               (variable) float64 464B ...
    sums                                (variable) float64 464B ...
Attributes: (12/30)
    allow_nans:              False
    attribution:             ECMWF
    constant_fields:         ['gwus']
    data_request:            {'area': [89.828, 0.0, -89.828, 359.775], 'grid'...
    description:             Dataset containing land surface based ecland mod...
    end_date:                2022-12-31T18:00:00
    ...                      ...
    total_size:              620386108419
    uuid:                    b3df63f7-27a6-43b0-8d1f-720ed8ee2d5f
    variables:               ['10u', '10v', '2d', '2t', 'aco2gpp', 'aco2nee',...
    variables_metadata:      {'10u': {'mars': {'class': 'rd', 'date': 2010010...
    variables_with_nans:     True
    version:                 0.30

URL to sample input data
Provide a URL to a sample input data, or attach a file to that report if it is small enough.

Expected behavior
I expected the output format to remain the same between dataset versions.

@observingClouds observingClouds changed the title Datasets of same output schema version differ in structure. Datasets with same output schema version differ in structure. Nov 11, 2024
@floriankrb
Copy link
Member

Opening an anemoi dataset with xarray is not documented and not supported.
The recommended interface to read the dataset is anemoi.datasets.open_dataset(...), which is documented and supported.

This being said, when a dataset has version 0.30, I would expect xarray to be able to read it. When it is lower than 0.30, the behaviour is undefined. Hopefully, this (undocumented) functionality will still work in the future and you will be able to read the datasets with higher version numbers. However, as this is not currently tested, it can break silently.

@floriankrb
Copy link
Member

Is there any specific reason why anemoi.datasets.open_dataset(...) cannot do what you want to need to do? What are our actual requirements when using the dataset, do you have a use case that is not currently covered?

@observingClouds
Copy link
Author

Hi @floriankrb thanks for your quick response. Please see my responses to your additional questions below:

Specific reason not to use anemoi.datasets.open_dataset(...)

  • familiarity with popular xarray package
    • most operations seem to be mimicking/reinvent xarray operations that I am more familiar with
  • API of anemoi.datasets.open_dataset does not seem to be defined. I cannot find docstrings, API documentation or explicit keyword arguments.
  • easy usage with other packages from the xarray/pangeo software stack e.g. for plotting, manipulating,...
  • using the datasets for other analysis outside of the anemoi framework (using e.g. the recipes to retrieve MARS data in zarr format)

Requirements

  • support for well-established software stacks, like xarray
  • dataset schema is well defined and as such predictable, e.g. if version==0.30, _ARRAY_DIMENSIONS are given. This would then make it possible to create downstream applications. Maybe one could create a test to ensure that future changes to anemoi.datasets do adhere to one schema version?

This is more a comment, but I would like to see anemoi to make more use of the available software stack and document the API, schemas so it is easier to use and potentially contribute to the package.

@floriankrb
Copy link
Member

API of anemoi.datasets.open_dataset does not seem to be defined. I cannot find docstrings, API documentation or explicit keyword arguments

It may be easier to look at the documentation, there are quite a few pages dedicated to open_dataset: https://anemoi-datasets.readthedocs.io/en/latest/using/opening.html

Thanks for the detailed feedback, I understand that plotting is a lot easier with a multi-purpose tools such as xarray. And I also appreciate how it is easier to stick to one tool when you are used to it. This is something to keep in mind when shaping the future of anemoi.

@observingClouds
Copy link
Author

observingClouds commented Feb 24, 2025

I would like to reopen this, as the original cause of a missing schema definition has not been addressed and seem to already cause inconsistencies within versions

@observingClouds observingClouds changed the title Datasets with same output schema version differ in structure. Missing output schema definition and inconsistencies within a specific version. Feb 24, 2025
@mchantry mchantry reopened this Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants