-
Notifications
You must be signed in to change notification settings - Fork 0
ESGF_THREDDS_Harvesting
An ESGF Node is capable of harvesting and exposing metadata from a variety of sources, including THREDDS catalogs, OAI repositories, GCMD records and others. The publishing framework is extensible: metadata from any source repository can be harvested by writing a proper handler that knows to traverse the repository, and how to parse its metadata.
The most common way of ingesting metadata into an ESGF Node is by parsing a hierarchy of THREDDS catalogs. Parsing of the catalogs can be initiated either by the ESGF Publisher application, or simply by invoking the ESGF class _ esg.search.publish.impl. PublishingServiceMain _ from the command line. The following considerations apply:
-
Harvesting can be limited to a single catalog, or it can extend to parse the full hierarchy of catalogs with one single commands.
-
When parsing a catalog, a single record of type _ "Dataset" _ will be created for the top-level dataset in the THREDDS catalog, while no records will be created for all enclosed (i.e. child) datasets. In other words, the catalog represents the unity of discovery for the system, so data providers should pay attention to creating catalogs with the same granularity that they want to be exposed to users when searching. Metadata associated with the top-level dataset will be harvested and ingested into the ESGS Solr system. Special metadata fields associated with the dataset need to be included in the catalog as (name,value) properties (see example), and can later be chosen as _ search facets _ in the ESGF web portal interface. A snippet of a THREDDS catalog for a top-level dataset is shone below.
Specific Humidity Number of Observations Specific Humidity Standard Error Air Temperature Number of Observations Air Temperature Standard Error Specific Humidity Air Temperature GRID NetCDF -
A record of type "File" will be created for each single file preset in the catalog. All these _ "File" _ records are associated with the single top-level dataset for that catalog, and be assigned the exact same fields possessed by the _ Dataset _ (i.e. metadata.is inherited top-to-bottom), unless those fields already exist at the File level. In other words, metadata is inherited from the Dataset to the Files, but not overridden. This model allows searching files by any available facet such as _ experiment _ , _ model _ , and do execute file-specific searches.
Note that because THREDDS does not distinguish between Datasets and Files, the ESGF software relies on the presence of at the attribute file_id= to recognize a file.
- Records are ingested into Solr as a bulk operation that involves all records generated from the catalog (i.e. the top-lel Datset record and all File records). If anything fails during the processing of a catalog, no records from that catalog will be ingested.