-
Notifications
You must be signed in to change notification settings - Fork 0
ESGF_Search_API_Obsolete
This document describes the ESGF Search Services API (Application Programming Interface), which is all the information that is needed by a client application to send a search query to the service, and interpret the response sent back by the service.
Note that the document does NOT include requirements for a browser user interface, or a desktop tool, that act as clients to the API (although the requirements of such clients do inform the definition of the API). Note also that the target of a search is always metadata (specifically, a text document that contains information about matching search results), not the resources themselves (dataset, files, etc.). In other words, an ESGF search is not meant to return a binary stream to data.
The ESGF search API is meant to be used by the following actors:
- Humans using a web browser interface, interacting with the interface in real time
- Humans using a rich desktop client, or a command line utility, interacting with the client in real time
- Batch Jobs executing data discovery at regular intervals
Currently, the API is not meant to support massive harvesting of records from one ESGF site to another, which is accomplished through different protocols and services.
The following are representative examples of the kind of queries that an ESGF search service must be able to resolve. Unless specified, these queries are expected to be performed by humans using either a web browser, or a desktop client alike. For each query, we report the syntax (described later) that can be used to execute it.
-
Find all datasets about "climate change"
- ?query=climate+change
-
Find all files for experiment=decadal2001, variable=Specific Humidity, frequency=monthly, model=CanCM4
- ?experiment=decadal2000&variable=Specific+Humidity&time_frequency=mon&model=CanCM4
-
Find a specific dataset by id, return all possible metadata
- ?id=my.test.dataset&fields=*
-
Find all files belonging to a given dataset
- ?type=File&dataset_id=my.test.dataset
-
Find a specific file by id, or tracking id, or filename, etc.
- ?type=File&tracking_id=abc123
-
Find all datasets (or files) that have changed since a given date
- ?type=Dataset&from=20100101T00:00:00Z
-
Find all datasets for which the id is matching a wildcard expression
- ?id=my.test.* (or ?query=id:my.test.* ?)
-
Find all datasets within a latitude/longitude bounding box
- ?bbox=-111.032,42.943,-119.856,43.039
-
Find all datasets within a certain time extent
- ?start=2007-02-12T04:30:02Z&end=2007-03-11T02:28:00Z
-
Find all datasets for a certain named geographic region
- ?location=Arctic
-
Find all available facets names that can be used to query for results
- ?facets=*
-
Find all possible values of a given facet that apply to search results (wether or not it is a controlled vocabulary)
- ?facets=experiment
-
Find all replicas of a dataset that exist in the system
- ?
A request to an ESGF-compliant search engine is expressed as an HTTP GET/POST request, which is composed of a base URL and additional query parameters. The general syntax of a request (in the GET case) is as follows:
http://<base search url>/?[keyword parameters as (name, value) pairs][facet parameters as (name,value) pairs]
or more explicitly:
http://<base search url>/?[query=...][offset=...][limit=...][type=...][format=...][facets=...][fields=...][lat,lon,radius,polygon,location=...][start,end=...][from,to=...][facet1=value1][facet2=value2][...]
Note that the is totally opaque as to the search semantics, i.e. the full search specification must be encoded as part of the query parameters (this is un-RESTful but self-consistent and more scalable...). At the discretion of the site, the may contain the version of the supported ESGF search API, for example: = http://hostname/search/v1/
Note also that all parameters are optional, i.e. a simple request to with no parameters at all should return a response document corresponding to all default values of the query parameters.
The value of all parameters must be URL-encoded, so that the complete search URL is well formed.
Keyword parameters are query parameters that have reserved names, and are interpreted by the search engine for special purposes. The following keyword parameters are recognized at this time:
- query (default: *): used to pass a free text constraint to the search engine, to match one or more fields
* The search engine is free to apply that constraint any way it is appropriate to its back-end and holdings. For example, "query=cmip5" may be used by a search engine to match any of the metadata fields, or maybe just the _ title _ and _ description _ fields.
* Depending on what the search engine supports, the query value may include special characters such as "*", "?", "!" that are interpreted as query _ modifiers _
* It is also highly recommended that the search engine parse the query value and reject requests that contain dangerous characters such as ">","<","$" etc.
-
type (default: all types): the type of the returned record. The value of _ type _ must be chosen from the ESGF controlled vocabulary - currently the only allowed types are: Dataset, File and Aggregation.
-
format (default: site specific): the format of the returned response document, encoded as the document mime type
* Each search engine is free to return documents in its default format of choice, if a format is not explicitly requested
* If a search engine cannot support a requested format, a 501 HTTP response ("Not Implemented") should be returned
* Examples: format=application/atom+xml, format=application/solr+xml, format=application/esgf+json
-
offset (default: 0): the starting index for the returned results
-
limit (default: site specific): the maximum number of returned results. The search engine is also free to override this value with a maximum number of records it is willing to serve for each request.
-
lat , lon , bbox , location , radius , polygon (default: none): these parameters are used to perform a geo-spatial search according to the Open Search Geo extension specification.
-
start , end (default: none): these parameters are used to perform a temporal query according to the Open Search Time extension specification.
* The date and time values must be encoded in the format "YYYY-MM-DDTHH:mm:ssZ".
* The _ start _ , _ end _ parameters refer to the data temporal coverage, not the metadata last update time stamp.
- from , to (default: none): used as lower and upper limit of the last update time stamp for each record, for example to return only the newest records.
* The values must be encoded in the format "YYYY-MM-DDTHH:mm:ssZ".
- facets (default: site specific): comma separated list of facets to be returned in the response. For each requested facet, the engine should return all the possible values and counts (if available) for that facet _ across all the records matching the query _ (not just the records returned in the current response document).
* Example: "query=co2&facets=experiment&offset=0&limit=10" will instruct the search engine to include in the response document all the possible values and counts of the field "experiment" for _ all _ records matching "co2" (not just the first 10 records).
* Example: "query=co2&facets=experiment&offset=10&limit=20" will return different records than the previous query, but the same values and counts for the "experiment" facet.
* Each engine is free to implement its default behavior for the case when the _ facets _ parameter is not specified: return no facets, return all facets, or a selected set of facets.
-
fields (default: site specific): used to specify which metadata fields should be included for each returned result, if available.
-
distrib (default: true): specifies if the search should be done in a distributed manner or only locally (if set to false).
-
replica (default: unset): Specifies if the search should return replicas only(true) or only originals (false). Unset to return both.
-
latest (default: unset): Specifies if the search should return the latest version only (true) or older versions only (false). Unset to return both.
In addition to the standard keyword parameters, each engine is free to implement and process additional keyword parameters.
- For example, an engine might be able to process the instruction "highlight=true" to highlight matching text in the search result.
- Site specific keywords must not collide with the controlled vocabulary of ESFG facets.
Any parameter which is not a keyword parameter (i.e. it has a name that is not one of the special names listed above) is interpreted by the system as a facet parameter, and used to apply a facet constraint to the query. Multiple facet parameters can be specified as part of the same request to limit the results space to the intersection of records matching all constraints (in other words, facet parameters are combined with a logical _ AND _ ).
Note that:
- At this time, for interoperability reasons (as well as security), the facet names must be chosen from a controlled vocabulary.
- The facet value is used as-is to match the returned results, i.e. no regular expression matching is applied.
- As for keyword parameter values, facet values must be properly URL-encoded.
Examples:
-
_ &experiment=decadal2000 _ : will match all records that have a metadata field with name="experiment" and value="decadal2000"
-
_ &cf_standard_name=Air+Temperature _ : will match records that have a metadata field with name="cf_standard_name" and value="Air Temperature"
-
_ &experiment=decadal* _ : will not likely match any record, since no records will have a metadata field "experiment" which is _ exactly _ set to "decadal*"
The result of a search request is a response document that is encoded in the format specified by the request parameter _ format _ . Independently of the format, the response document always contains the following logical sections:
-
Header : contains all of the parameters used in the request, so that the same response document can be re-produced
-
Results : contains zero or more records that match the search criteria, each with associated metadata.
-
Facets : contains the values and counts of the requested facets. May be empty or non-present if no facets are requested.
Each result record contained in the response document is associated with a set of metadata fields. Each field has a name, and may be single-valued or multiple-valued. Some fields are meaningful for records of all types and have been assigned standardized names, while other fields that are more type specific and may have any name. The rules for including metadata fields in the response documents are as follows:
- The standard metadata fields must always be included for each record (if applicable and available)
- Other fields may be included for each record, if explicitly requested via the _ fields _ parameter.
* Even if _ fields=... _ is specified, the common metadata fields must always be included
* fields=* may be used to include all available metadata fields
- Each search engine is allowed to define its default behavior, provided it complies with the previous rules. For example, the default behavior of a search engine can be to return only the common metadata fields, or all of the available fields.
The following table lists the standard metadata fields , i.e. those fields that represent the minimum amount of metadata that is common to records of all types, and that must always be returned as part of each result record.
Field Name
Description
Multi-Valued?
Mandatory?
Applicable Record Type
id
Globally unique record identifier
false
true
all
title
Human-readable short description of the record, usable in a summary display of results
false
true
all
description
A human-readable longer description of the record, suitable for being displayed under the record's title, possibly in a shortened form
true
false
all
type
The record's type, which should match the search requested type (if provided)
false
true
all
timestamp
The date and time when the record was last updated
false
true
all
url
A URL that can be used to access the record, must include a descriptive name and the content/mime type
true
true
all
size
The file size, or total dataset size (sum of all files)
false
true
Dataset, File
dataset_id
The identifier of the containing dataset
false
true
File
checksum
The file checksum, if available
true
true
File
checksum_type
The file checksum type, if available
true
true
File
Notes:
-
Each result record must contain one or more URLs that the record can be hyperlinked to. The URL metadata must include the type of the application serving that URL, and a short descriptive name of the application itself. For example, the record representing a single file could contain a URL field for each of the possible ways to download the file:
-
url= http://myhostname/thredds/fileserver/file_aaa.nc , type=application/netcdf, name=HTTP Server
-
url= http://myhostname/thredds/dodsC/file_aaa.nc , type=application/opendap, name=OpenDAP Server
-
url= http://myhostname/gridftp/file_aaa.nc , type=application/gridftp, name=GridFTP Server
-
-
For simplicity, a search should return only one single version (the latest) of each matching record. If available, each record may contain an optional version field that is part of its descriptive metadata. The versioning schema may be completely arbitrary, being dependent on the record type, the publishing agent, etc., and should not be used for any other purposes than visual inspection of the record.
[incomplete]
The search engine responsible for processing a search request should encode any unusual circumstance in processing as an HTTP response with the appropriate HTTP status code. In particular, the following status code can be returned:
- HTTP 400 ("Bad Request")
* if any of the HTTP parameter names or values contain illegal characters
* if the _ facets= _ parameter contains an illegal facet
* if the _ fields= _ parameter contains an illegal field
- HTTP 500 ("Internal Server Error")
* if a generic server error takes place
- HTTP 501 ("Not Implemented")
* in response to a valid value of _ format= _ which is not yet supported by the search engine