-
Notifications
You must be signed in to change notification settings - Fork 0
Search
The scope of this document is to capture the specification of the ESGF Search back-end Services, i.e. the API that an ESGF search engine exposes to its clients. The document does NOT include requirements for a browser user interface, or a desktop tool, that act as clients to the API (although the requirements of such clients do inform the definition of the API). Note also that the target of a search is always metadata (specifically, a text document that contains information about matching search results), not the resources themselves (dataset, files, etc.). In other words, an ESGF search is not meant to return a binary stream to data.
The ESGF search API is meant to be used by the following actors:
- Humans using a web browser interface, interacting with the interface in real time
- Humans using a rich desktop client, or a command line utility, interacting with the client in real time
- Batch Jobs executing data discovery at regular intervals
Currently, the API is not meant to support massive harvesting of record from one ESGF site to another, which is accomplished through different protocols and services.
The following are representative examples of the kind of queries that the ESGF search service must be able to resolve. Unless specified, these queries are expected to be performed by humans using a web browser, and a desktop client alike.
- Find all datasets about "climate change"
- Find all files for experiment=decadal2001, variable=Specific Humidity, frequency=monthly, model=CanCM4
- Find a specific dataset by id, return all possible metadata
- Find all files belonging to a given dataset
- Find a specific file by id, or tracking id, or filename, etc.
- Find all datasets (or files) that have changed since a given date
- Find all datasets for which the id is matching a wildcard expression
- Find all datasets for a certain latitude/longitude bounding box, and/or a time extent, and/or a named geographic region
- Find all available facets names that can be used to query for results
- Find all possible values of a given facet that apply to search results (wether or not it is a controlled vocabulary)
- Find all replicas of a dataset that exist in the system
The following list includes desired requirements for the public API exposed by the search services, i.e. for the arguments passed in the search request, and the results returned in the search response. Note that not all requirements may necessarily or completely be satisfied at this time.
-
Ability to specify constraints as text, possibly with some form of pattern matching or regular expression
-
Ability to specify constraints as arbitrary (name,value) pairs ("faceted search").
-
Ability to return records of different type: Datasets, Files, Models etc. 1. Should this ability be expressed through a well defined API feature (like in the URL syntax .../dataset/... or as a keyword _ type=Dataset _ , or simply as a facet constraint ? (Is there a list of allowed types somewhere or are these ad hoc? Is the structure of the implied hierarchy documented somewhere? --tpb)
-
Ability to encode the response in multiple formats: XML dialects, JSON, etc.
-
Ability to return proper error messages/codes.
-
Support pagination of results (lower priority)
-
Support ordering (possibly by several algorithms: alphabetical, score etc.) (lower priority)
The following list includes requirements about the system architecture and software implementing the ESGF search services. Note that not all requirements may necessarily or completely be satisfied at this time.
- Adaptability: it should be possible to use the search service API and user interface with a pluggable back-end engine (Solr, RDB, RDF triple store etc.).
- Configurability: it should be possible to configure, on a per node basis, several characteristics of the search including the facets in the user interface, the back-end metadata holdings, etc.
- Extensibility: it should be possible to use new search categories with minimal effort with respect to software changes, configuration, or deployment.
- Scalability: must be able to scale to approximately 100M records (?)
- Federation: must be able to return (consistent) results across multiple nodes
- Flexible Import: must be able to ingest metadata from repositories serving records through heterogenous protocols (TDS, OAI, etc.) and formats (THREDDS, FGDC, DC, etc.)
- Flexible Export: ability to expose its functional API through multiple protocols: REST, SOAP, Hessian, OAI etc.
- Ability to support rich desktop clients and browsers
- Caching: ability to cache results for identical queries
- Test Framework: availability of an automated mechanism to test the system functionality
- Versioning: capability of the service to advertise its version
- Simplicity: minimizing the number of hoops developers have to go through in order to start consuming the search services
- Maintainability: capability of the code base to be maintained and evolved by the community as an open source project, according to the approved ESGF development process
(Depending on re-writing of the use cases above RESTful style urls might be desirable if search needs to be interacted with programmatically. NCH)
Currently, the main public API exposed by the ESGF search service to clients is in the form of an HTTP RESTful web service. The REST API is composed of a basic generic HTTP request URL, that can be used to execute a completely generic query for records of some type, under free text and facets constraints, and other more specific HTTP request URLs, that allow to execute specific queries through a simplified syntax.
At some point, we need to formulate this API in a more precise way: Globus Online REST API is one of examples.
- http(s)://[hostname]:[port]/[context]/ws/rest/search/?text=[...]&[facet1]=[value1]&[facet2=value2] : to execute a generic search for documents matching some text and with one or more facet constraints.
* Any facet key can be specified as an HTTP request parameter, as long as it matches one of the facets harvested into the index at the time of publishing. If a facet key is not found in the index, no results will be returned. (Will an error be returned so the client can distinguish between an empty result set and an invalid facet key? --tpb) Note that facet values are evaluated as-is when finding matching records, i.e. the facet key must match the record field and the facet value must match exactly the record value. If the facet value contains an illegal character, an exception will be thrown. (What will the client see in this case? --tpb) On the other hand, the value of the "text" expression is always escaped, and is compared against the post-processed record metadata. Note a few words cannot be used as facets since they are reserved keywords (see below), and "text" is interpreted not as a facet, but as a free form query.
* See [ examples ](https://github.com/ESGF/esgf.github.io/wiki/SearchExamples) of this REST call.
(These urls don't seem to be very RESTful since they don't point at a specific resource and they contain a verb instead of a noun. Perhaps something more along the lines of ../resource.(extension of content desired)?(Search parameters) would be a bit more RESTful?
http://localhost/resource.html?type=temproal&start_date= &end_date= http://localhost/resource.html?facet_name=facet_value
Also, how is content type going to be resolved? Is client content type going to be used or extensions or parameter passing going to be used? NCH)
- http(s)://[hostname]:[port]/[context]/ws/rest/searchById/?id=[...] : to execute a search for a document matching a specific identifier, or a set of documents with identifier matching a wildcard expression.
* Besides "id", the only other (optional) HTTP parameter accepted by this endpoint is "type", to sub-select the results space in case a wildcard expression is used. (What happens if a client offers some other HTTP parameter? Is it ignored or is an error returned? --tpb)
* See [ examples ](https://github.com/ESGF/esgf.github.io/wiki/SearchExamples) of this REST call.
- http(s)://[hostname]:[port]/[context]/ws/rest/searchByTimeStamp/?from=[YYYY-MM-DDTmm:hh:ssZ]&to=[YYYY-MM-DDTmm:hh:ssZ] : to execute a search for all documents that were last modified in a given time interval.
* The interval date and times must be specified in ISO8601 format, or a special valued strings like "NOW", "NOW-1YEAR", etc. The only other (optional) HTTP parameter that can be supplied is "type".
* See [ examples ](https://github.com/ESGF/esgf.github.io/wiki/SearchExamples) of this REST call.
Note that all HTTP request URLs accept the following optional HTTP parameters (which are reserved keywords):
-
offset =... : offset into the returned results (default: 0)
-
limit =... : maximum number of returned results (default: 10)
-
type =...: type of returned results (no default, i.e. by default results of all types are returned)
-
back= =...: encoding type for response document (default: Solr XML, which is also the only currently supported return type)
(Is this for every content type or do we need to define what is passed back for each different content type. E.g. Are json results different then xml or html results? NCH)
Each response document is composed of zero, one or more matching records. All records, irrespective of their type, contain a few common fields:
-
id : a globally unique identifier (single valued, mandatory)
-
url : the main URL the record result should be hyperlinked to (single valued, mandatory)
-
timestamp : the date and time the record was last updated (single valued, mandatory)
-
title : a human-readable short description of the record, usable in a summary display of results (single-valued, mandatory)
-
description : a human-readable longer description of the record, suitable for being displayed under the record's title, possibly in a shortened form (multi-valued, optional)
-
type : the record's type, which should match the search requested type (single-valued, mandatory)
Additionally, records may include type-specific information, specifically:
-
Record type= Dataset :
-
service : any other available access point for the dataset (besides the main url ), encoded as <service_name|service_type|endpoint> (multi-valued, optional)
-
size : the total size of the dataset, sum of the composing file sizes (single-valued, optional)
-
The values of all the other metadata fields associated with the dataset, in the form of (name,value) pairs (multi-valued, optional)
-
Example of Dataset record encoded as Solr XML.
-
-
Record type= File :
-
parent_id : the id of the enclosing dataset
-
service : any other available access point for the file (besides the main url ), encoded as <service_name|service_type|endpoint> (multi-valued, optional)
-
size : the file size in bytes (single-valued, optional)
-
The values of all the other metadata fields associated with the file, in the form of (name,value) pairs (multi-valued, optional)
-
Example of File record encoded as Solr XML.
-
Define a roadmap for features of the search API to be included in each version.
- Do we need an Hessian API ? (For now and next few months, Yes. -gavin)
- Should we return Solr XML as default return type ? (Perhaps we should also support the catalog's format and thus return mini catalogs. -gavin)
- How to properly support replicas (of files and/or datasets) ? (Replicas are of datasets only, not files. Files get "copied". -gavin)
- How do we suppor geo/temporal queries in the REST API - i.e. which URL syntax should we use ?
- How do we express the fact that a search engine might not have all relevant information about the query, and might want to include references to additional query services that might have it ?