You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -22,13 +22,13 @@ If the xsearch version is incremented you will receive a warning on import:
22
22
23
23
`xsearch` will create a file in your home directory to support version checking.
24
24
25
-
Alternatively, you can download the contents of this repository and import the package (or the `search.py` file).
25
+
Alternatively, you can download the contents of this repository and import the package (or the `search.py` file). In this case, we suggest you watch the repository (in GitHub) so that you know when there are version changes.
26
26
27
27
`xsearch` is built around metadata logic documented in the [CMIP6 Global Attributes, DRS, Filenames, Directory Structure, and CV’s document](https://goo.gl/v1drZl). These logic underpin the filename and directory creation, written by [CMOR](https://cmor.llnl.gov/) and other CMIP-writing libraries. The [CMIP6 CMOR Tables](https://github.com/PCMDI/cmip6-cmor-tables/tree/master/Tables) (e.g. Amon - designate A=atmospheric realm, and mon=monthly frequency), similar logic underpinned the [CMIP5](https://github.com/PCMDI/cmip5-cmor-tables/tree/master/Tables) and [CMIP3](https://github.com/PCMDI/cmip3-cmor-tables/tree/master/Tables) phases.
28
28
29
29
### Support
30
30
31
-
Note: this software library is designed to help PCMDI users search for CMIP data. If the search works correctly, but points to a dataset that has problems (e.g., a corrupted or incomplete dataset) these issues should not be logged as an issue / bug in this repository.
31
+
Note that this software library is designed to help PCMDI users search for CMIP data. If the search works correctly, but points to a dataset that has problems (e.g., a corrupted or incomplete dataset) these issues should not be logged as an issue / bug in this repository.
32
32
33
33
Contributions from users of this utility are critical (issue reports and contributed code to improve the package or address bugs).
34
34
@@ -47,7 +47,7 @@ Contributions from users of this utility are critical (issue reports and contrib
47
47
The search warns that the returned paths include data that spans multiple realms and multiple tables. You can add optional facets to select one realm/table:
The user can also specify a number of optional arguments:
73
-
74
-
*`deduplicate` (default `True`): flag whether to de-duplicate so that a single dataset path is returned for each unique combination of model and member. If one of multiple possible datasets are chosen, the alternates are listed in `alternate_paths` (see `fullMetadata` option).
75
-
* criteria: The ordering of criteria used to deduplicate. The default is `["version", "timepoints", "nc_creation_date", "esgf_publish", "gr"]`, which orders by a) newest version, number of timesteps, the creation_date embedded in netCDF file metadata, and a prioritization for data that has been published on ESGF and has the glabel `gr`.
76
-
*`printDuplicates` (default `False`): If True, all potential paths will be printed and paths that are chosen will be denoted with an asterisk
77
-
*`verbose` (default `True`): flag to print out information during the search
78
-
*`lcpath` (default `False`): flag to adjust the data directories to reflect the LC (Livermore Computing/LLNL HPC) mountpoint
79
-
*`filterRetired` (default `True`): flag to filter out datasets that are retired (datasets are retired if they are moved)
80
-
*`filterRetracted` (default `True`): flag to filter out datasets that are retracted
81
-
*`filterIgnored` (default `True`): flag to filter out datasets that have been marked as ignored
82
-
*`jsonDir` (`str`): This defaults to the current location for the json metadata files, but the user can modify this
83
-
84
-
By default, `xsearch` will simply return a list of paths. `xsearch` can also return the full metadata for selected paths by setting `fullMetadata=True`:
63
+
Note that, by default, `dpaths` is a dictionary and that each key is a dataset path. Each dictionary entry contains dataset metadata:
The user can also specify a number of optional arguments:
119
+
120
+
*`deduplicate` (default `True`): flag whether to de-duplicate so that a single dataset path is returned for each unique combination of model and member. If one of multiple possible datasets are chosen, the alternates are listed in `alternate_paths` (see `fullMetadata` option). Make sure to read the note on de-duplication below.
121
+
* criteria: The ordering of criteria used to deduplicate. The default is `["version", "timepoints", "nc_creation_date", "esgf_publish", "gr"]`, which orders by a) newest version, b) number of timesteps, the c) creation_date embedded in netCDF file metadata, and a prioritization for data that has been d) published on ESGF and e) has the grid label `gr`.
122
+
*`printDuplicates` (default `False`): If True, all potential paths will be printed and paths that are chosen will be denoted with an asterisk
123
+
*`verbose` (default `True`): flag to print out information during the search
124
+
*`lcpath` (default `False`): flag to adjust the data directories to reflect the LC (Livermore Computing/LLNL HPC) mountpoint
125
+
*`filterRetired` (default `True`): flag to filter out datasets that are retired (datasets are retired if they are moved)
126
+
*`filterRetracted` (default `True`): flag to filter out datasets that are retracted
127
+
*`filterIgnored` (default `True`): flag to filter out datasets that have been marked as ignored
128
+
*`jsonDir` (`str`): This defaults to the current location for the json metadata files, but the user can modify this
129
+
*`fullMetadata` (default `True`): flag to specify the datatype returned. If true, `xsearch` will simply return a dictionary with complete metadata for selected paths. `xsearch` can also return a list of paths by setting `fullMetadata=False`:
130
+
131
+
### A note on de-duplication
132
+
133
+
Note that `xsearch` is configured to de-duplicate for each set of paths related to each model and model realization. For example, if you have several versions of data for **Model A** / **r1i1p1f1**, `xsearch` will select the most recent and complete version for you.
134
+
135
+
But for a given model and realization, there may be more aspects of the data than the dataset version. For example, if we search for all variables beginning with `ta` for `E3SM-1-0`:
This isn't typically a problem if you specify an exact experiment, variable, and frequency.
155
+
156
+
130
157
### Acknowledgements
131
158
132
159
Much of the de-duplication logic and ideas from this are from the [durolib](https://github.com/durack1/durolib) library, created by Paul Durack [@durack1](https://github.com/durack1) to search for CMIP data. durolib handled [CDAT](https://cdat.llnl.gov/) xml files, which could be used to read in multi-file datasets. Logic to produce the xml files was refactored in [xagg](https://github.com/pochedls/xagg). xagg xmls are being phased out in favor the current approach: rapidly searchable json files, which allow the users to locate datasets and read them in with xarray based tools. Stephen Po-Chedley [@pochedls](https://github.com/pochedls/) produced the `xsearch` initial version.
0 commit comments