Skip to content

Commit d0c939d

Browse files
committed
update to default to full metadata; increment version
1 parent 2dda585 commit d0c939d

File tree

4 files changed

+61
-34
lines changed

4 files changed

+61
-34
lines changed

LICENSE

100644100755
File mode changed.

README.md

100644100755
+57-30
Original file line numberDiff line numberDiff line change
@@ -22,13 +22,13 @@ If the xsearch version is incremented you will receive a warning on import:
2222
2323
`xsearch` will create a file in your home directory to support version checking.
2424

25-
Alternatively, you can download the contents of this repository and import the package (or the `search.py` file).
25+
Alternatively, you can download the contents of this repository and import the package (or the `search.py` file). In this case, we suggest you watch the repository (in GitHub) so that you know when there are version changes.
2626

2727
`xsearch` is built around metadata logic documented in the [CMIP6 Global Attributes, DRS, Filenames, Directory Structure, and CV’s document](https://goo.gl/v1drZl). These logic underpin the filename and directory creation, written by [CMOR](https://cmor.llnl.gov/) and other CMIP-writing libraries. The [CMIP6 CMOR Tables](https://github.com/PCMDI/cmip6-cmor-tables/tree/master/Tables) (e.g. Amon - designate A=atmospheric realm, and mon=monthly frequency), similar logic underpinned the [CMIP5](https://github.com/PCMDI/cmip5-cmor-tables/tree/master/Tables) and [CMIP3](https://github.com/PCMDI/cmip3-cmor-tables/tree/master/Tables) phases.
2828

2929
### Support
3030

31-
Note: this software library is designed to help PCMDI users search for CMIP data. If the search works correctly, but points to a dataset that has problems (e.g., a corrupted or incomplete dataset) these issues should not be logged as an issue / bug in this repository.
31+
Note that this software library is designed to help PCMDI users search for CMIP data. If the search works correctly, but points to a dataset that has problems (e.g., a corrupted or incomplete dataset) these issues should not be logged as an issue / bug in this repository.
3232

3333
Contributions from users of this utility are critical (issue reports and contributed code to improve the package or address bugs).
3434

@@ -47,7 +47,7 @@ Contributions from users of this utility are critical (issue reports and contrib
4747
The search warns that the returned paths include data that spans multiple realms and multiple tables. You can add optional facets to select one realm/table:
4848

4949
dpaths = xs.findPaths('historical', 'tas', 'mon', realm='atmos', cmipTable='Amon')
50-
print(dpaths)
50+
print(dpaths.keys())
5151

5252
> ['/p/css03/cmip5_css01/data/cmip5/output1/BCC/bcc-csm1-1-m/historical/mon/atmos/Amon/r1i1p1/v20120709/tas/',
5353
> '/p/css03/cmip5_css01/data/cmip5/output1/BCC/bcc-csm1-1-m/historical/mon/atmos/Amon/r2i1p1/v20120709/tas/',
@@ -60,35 +60,14 @@ The search warns that the returned paths include data that spans multiple realms
6060
> '/p/css03/cmip5_css01/data/cmip5/output1/CNRM-CERFACS/CNRM-CM5/historical/mon/atmos/Amon/r2i1p1/v20110901/tas/',
6161
> ...
6262
63-
Other search facets include: `mip_era` (**CMIP5** or **CMIP6**), `activity` (e.g., **CMIP** or **ScenarioMIP**), `institute` (e.g., **E3SM-Project**), `model` (e.g., **E3SM-1-1**), `member` (e.g., **r1i1p1f1**), `grid` (e.g., **gn** or **gr**), or `gridLabel` (e.g., **glb-z1-gr**).
64-
65-
Searches can also include a wildcard (`*`) in required search terms or for additional facets:
66-
67-
dpaths = xs.findPaths('ssp*', 'tas', 'mon', cmipTable='Amon', realm='atmos', activity='Scenario*', fullMetadata=True)
68-
print(xs.getGroupValues(dpaths, 'experiment'))
69-
70-
> ['ssp119', 'ssp370', 'ssp245', 'ssp126', 'ssp434', 'ssp534-over', 'ssp585', 'ssp460']
71-
72-
The user can also specify a number of optional arguments:
73-
74-
* `deduplicate` (default `True`): flag whether to de-duplicate so that a single dataset path is returned for each unique combination of model and member. If one of multiple possible datasets are chosen, the alternates are listed in `alternate_paths` (see `fullMetadata` option).
75-
* criteria: The ordering of criteria used to deduplicate. The default is `["version", "timepoints", "nc_creation_date", "esgf_publish", "gr"]`, which orders by a) newest version, number of timesteps, the creation_date embedded in netCDF file metadata, and a prioritization for data that has been published on ESGF and has the glabel `gr`.
76-
* `printDuplicates` (default `False`): If True, all potential paths will be printed and paths that are chosen will be denoted with an asterisk
77-
* `verbose` (default `True`): flag to print out information during the search
78-
* `lcpath` (default `False`): flag to adjust the data directories to reflect the LC (Livermore Computing/LLNL HPC) mountpoint
79-
* `filterRetired` (default `True`): flag to filter out datasets that are retired (datasets are retired if they are moved)
80-
* `filterRetracted` (default `True`): flag to filter out datasets that are retracted
81-
* `filterIgnored` (default `True`): flag to filter out datasets that have been marked as ignored
82-
* `jsonDir` (`str`): This defaults to the current location for the json metadata files, but the user can modify this
83-
84-
By default, `xsearch` will simply return a list of paths. `xsearch` can also return the full metadata for selected paths by setting `fullMetadata=True`:
63+
Note that, by default, `dpaths` is a dictionary and that each key is a dataset path. Each dictionary entry contains dataset metadata:
8564

86-
dpaths = xs.findPaths('historical', 'tas', 'mon', realm='atmos', cmipTable='Amon', fullMetadata=True)
87-
p1 = list(dpaths.keys())[0] # get first path
88-
dpaths[p1]
89-
90-
This option will instead return a dictionary for each path with a complete set of metadata for that dataset directory:
65+
key = list(dpaths.keys())[0] # get the first key
66+
print(key) # print dataset
67+
print(dpaths[key]) # print metadata for this dataset
9168

69+
> /p/css03/cmip5_css01/data/cmip5/output1/CNRM-CERFACS/CNRM-CM5/historical/mon/atmos/Amon/r10i1p1/v20110901/tas/
70+
>
9271
> {'keyid': 'CMIP5.CMIP.CNRM-CERFACS.CNRM-CM5.historical.r10i1p1.Amon.atmos.mon.tas.gu.glb-z1-gu.v20110901',
9372
> 'mip_era': 'CMIP5',
9473
> 'activity': 'CMIP',
@@ -127,6 +106,54 @@ This function supports wildcard searches. Another search function allows the use
127106
> 'GFDL-ESM2M',
128107
> ...
129108
109+
Other search facets include: `mip_era` (**CMIP5** or **CMIP6**), `activity` (e.g., **CMIP** or **ScenarioMIP**), `institute` (e.g., **E3SM-Project**), `model` (e.g., **E3SM-1-1**), `member` (e.g., **r1i1p1f1**), `grid` (e.g., **gn** or **gr**), or `gridLabel` (e.g., **glb-z1-gr**).
110+
111+
Searches can also include a wildcard (`*`) in required search terms or for additional facets:
112+
113+
dpaths = xs.findPaths('ssp*', 'tas', 'mon', cmipTable='Amon', realm='atmos', activity='Scenario*')
114+
print(xs.getGroupValues(dpaths, 'experiment'))
115+
116+
> ['ssp119', 'ssp370', 'ssp245', 'ssp126', 'ssp434', 'ssp534-over', 'ssp585', 'ssp460']
117+
118+
The user can also specify a number of optional arguments:
119+
120+
* `deduplicate` (default `True`): flag whether to de-duplicate so that a single dataset path is returned for each unique combination of model and member. If one of multiple possible datasets are chosen, the alternates are listed in `alternate_paths` (see `fullMetadata` option). Make sure to read the note on de-duplication below.
121+
* criteria: The ordering of criteria used to deduplicate. The default is `["version", "timepoints", "nc_creation_date", "esgf_publish", "gr"]`, which orders by a) newest version, b) number of timesteps, the c) creation_date embedded in netCDF file metadata, and a prioritization for data that has been d) published on ESGF and e) has the grid label `gr`.
122+
* `printDuplicates` (default `False`): If True, all potential paths will be printed and paths that are chosen will be denoted with an asterisk
123+
* `verbose` (default `True`): flag to print out information during the search
124+
* `lcpath` (default `False`): flag to adjust the data directories to reflect the LC (Livermore Computing/LLNL HPC) mountpoint
125+
* `filterRetired` (default `True`): flag to filter out datasets that are retired (datasets are retired if they are moved)
126+
* `filterRetracted` (default `True`): flag to filter out datasets that are retracted
127+
* `filterIgnored` (default `True`): flag to filter out datasets that have been marked as ignored
128+
* `jsonDir` (`str`): This defaults to the current location for the json metadata files, but the user can modify this
129+
* `fullMetadata` (default `True`): flag to specify the datatype returned. If true, `xsearch` will simply return a dictionary with complete metadata for selected paths. `xsearch` can also return a list of paths by setting `fullMetadata=False`:
130+
131+
### A note on de-duplication
132+
133+
Note that `xsearch` is configured to de-duplicate for each set of paths related to each model and model realization. For example, if you have several versions of data for **Model A** / **r1i1p1f1**, `xsearch` will select the most recent and complete version for you.
134+
135+
But for a given model and realization, there may be more aspects of the data than the dataset version. For example, if we search for all variables beginning with `ta` for `E3SM-1-0`:
136+
137+
dpaths = xs.findPaths('historical', 'ta*', 'mon', realm='atmos', cmipTable='Amon', member='r1i1p1f1', model='E3SM-1-0', printDuplicates=True)
138+
139+
We get a message noting that we are spanning multiple variables:
140+
141+
> Multiple values for variable. Consider filtering by variable.
142+
> Available values: tauu, tauv, tasmin, tasmax, ta, tas
143+
144+
`xsearch` chooses one dataset (of the datasets spanning several possible variables) for a given member (`r1i1p1f1`):
145+
146+
> \* /p/user_pub/work/CMIP6/CMIP/E3SM-Project/E3SM-1-0/historical/r1i1p1f1/Amon/ta/gr/v20220108/
147+
>
148+
> duplicate: /p/user_pub/work/CMIP6/CMIP/E3SM-Project/E3SM-1-0/historical/r1i1p1f1/Amon/tauu/gr/v20190913/
149+
> duplicate: /p/user_pub/work/CMIP6/CMIP/E3SM-Project/E3SM-1-0/historical/r1i1p1f1/Amon/tauv/gr/v20190913/
150+
> duplicate: /p/user_pub/work/CMIP6/CMIP/E3SM-Project/E3SM-1-0/historical/r1i1p1f1/Amon/ta/gr/v20191220/
151+
> duplicate: /p/css03/esgf_publish/CMIP6/CMIP/E3SM-Project/E3SM-1-0/historical/r1i1p1f1/Amon/tas/gr/v20190913/
152+
> ...
153+
154+
This isn't typically a problem if you specify an exact experiment, variable, and frequency.
155+
156+
130157
### Acknowledgements
131158

132159
Much of the de-duplication logic and ideas from this are from the [durolib](https://github.com/durack1/durolib) library, created by Paul Durack [@durack1](https://github.com/durack1) to search for CMIP data. durolib handled [CDAT](https://cdat.llnl.gov/) xml files, which could be used to read in multi-file datasets. Logic to produce the xml files was refactored in [xagg](https://github.com/pochedls/xagg). xagg xmls are being phased out in favor the current approach: rapidly searchable json files, which allow the users to locate datasets and read them in with xarray based tools. Stephen Po-Chedley [@pochedls](https://github.com/pochedls/) produced the `xsearch` initial version.

__init__.py

100644100755
+1-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
import os
1313

1414
# version
15-
__version__ = "0.0.2"
15+
__version__ = "0.0.3"
1616

1717
# since this software is installed centrally and may be updated
1818
# this section of code stores the xsearch version in a hidden

search.py

100644100755
+3-3
Original file line numberDiff line numberDiff line change
@@ -143,7 +143,7 @@ def findPaths(experiment,
143143
filterRetired=True,
144144
filterRetracted=True,
145145
filterIgnored=True,
146-
fullMetadata=False,
146+
fullMetadata=True,
147147
**kwargs):
148148
"""
149149
findPaths
@@ -240,14 +240,14 @@ def findPaths(experiment,
240240
for dpath in rmpaths:
241241
data.pop(dpath)
242242
# if filter not specified, get all possible values
243-
for key in ['activity', 'cmipTable', 'realm']:
243+
for key in ['activity', 'cmipTable', 'realm', 'variable', 'experiment', 'frequency']:
244244
if key not in filterDict.keys():
245245
filterDict[key] = getGroupValues(data, key)
246246
# get model list
247247
models = getGroupValues(data, 'model')
248248
# warn user that some datasets have multiple attributes
249249
if verbose:
250-
for key in ['activity', 'cmipTable', 'realm']:
250+
for key in ['activity', 'cmipTable', 'realm', 'variable', 'experiment', 'frequency']:
251251
if ((len(filterDict[key]) > 1) & (key not in kwargs.keys())):
252252
print('Multiple values for ' + key + '. Consider filtering by ' + key + '.')
253253
print('Available values: ' + ', '.join(filterDict[key]))

0 commit comments

Comments
 (0)