|
1 | 1 | # Annotating machine learning resources with Schema.org and its Bioschemas extension
|
2 | 2 |
|
3 |
| -## Summary of the tutorial |
4 |
| -1. Intro to schema.org (why is it useful) |
5 |
| -2. Intro to bioschemas (software / application) -> link to some DOME entries |
6 |
| -3. Annotate training dataset by using the dataset profile |
7 |
| - - Help needed: DOME Registry select one entry |
8 |
| -4. Annotate software |
9 |
| - - Not manually write markup, but rather use bio.tools (need to have the tool in the registry) |
10 |
| - - Help: select relevant ML software |
11 |
| -5. Show how to validate markup and evaluate FAIRness of the resource. |
12 |
| -FAIRChecker, F-UJI, other tools for assessment. |
| 3 | +**This website is under construction** |
13 | 4 |
|
| 5 | +## Step 1. What is Schema.org & Bioschemas? |
14 | 6 |
|
15 |
| -## Understanding Schema.org |
| 7 | +### [Schema.org](http://schema.org) |
16 | 8 |
|
17 | 9 | Schema.org is a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond. Schema.org vocabulary can be used with many different encodings, including RDFa, Microdata and JSON-LD. These vocabularies cover entities, relationships between entities and actions, and can easily be extended through a well-documented extension model. Over 10 million sites use Schema.org to markup their web pages and email messages. Many applications from Google, Microsoft, Pinterest, Yandex and others already use these vocabularies to power rich, extensible experiences.
|
18 | 10 |
|
19 |
| -## Understanding Bioschemas |
| 11 | +### [Bioschemas](https://bioschemas.org/) |
20 | 12 |
|
21 |
| -Bioschemas is a community initiative to improve the findability of biological data. Bioschemas is a profile of Schema.org, i.e., it reuses terms from Schema.org and extends them when necessary. Bioschemas is a community initiative and it is open to everyone who wants to contribute. The Bioschemas community is composed of people from different backgrounds, including researchers, developers, publishers, and funders. The Bioschemas community is open to everyone who wants to contribute. The Bioschemas community is composed of people from different backgrounds, including researchers, developers, publishers, and funders. |
| 13 | +Bioschemas is a community initiative to improve the findability of life science data. Bioschemas mostly results in metadata profiles. The profiles are based on Schema.org, i.e., it reuses terms from Schema.org and extends them when necessary. Bioschemas profiles are defined by communities which specify the metadata terms that are relevant for their domain, and their level of importance. The agree on metadata terms that are required, recommended, and optional. Several profiles are available to describe for instance [datasets](https://bioschemas.org/profiles/Dataset/1.0-RELEASE), [softwares](https://bioschemas.org/profiles/ComputationalTool/1.0-RELEASE), [workflows](https://bioschemas.org/profiles/ComputationalWorkflow/1.0-RELEASE), [genes](https://bioschemas.org/profiles/Gene/1.0-RELEASE), [proteins](https://bioschemas.org/profiles/Protein/0.11-RELEASE), or [training materials](https://bioschemas.org/profiles/TrainingMaterial/1.0-RELEASE). |
22 | 14 |
|
23 |
| -## Why is Bioschemas useful? |
| 15 | +## Step 2. Annotating a ML training dataset with metadata from the Bioschemas dataset profile |
| 16 | + |
| 17 | +Here, we consider the following dataset: [https://registry.dome-ml.org/review/dfyn1yvtz3#dataset](https://registry.dome-ml.org/review/dfyn1yvtz3#dataset). More details for this dataset can be found here in [this GitHub repository](https://github.com/RyanCook94/inphared). |
| 18 | + |
| 19 | +After browsing the [Bioschemas dataset profile](https://bioschemas.org/profiles/Dataset/1.0-RELEASE), we figure out that we **must** provide the required metatdata, and we **should** provide the recommended metadata. For brevity, we will ommit the optional metadata. |
| 20 | + |
| 21 | +The **required metadata** fields are (`description`, `identifier`, `keywords`, `license`, `name`, `url`) |
| 22 | + |
| 23 | +A subset of the **recommended metadata** are (`citation`, `creator`, `version`, `datePublished`, `distribution`). |
| 24 | + |
| 25 | +We can annotate the dataset by using JSON-LD as follows: |
| 26 | + |
| 27 | +```json |
| 28 | +{ |
| 29 | + "@context": "http://schema.org", |
| 30 | + "@type": "Dataset", |
| 31 | + |
| 32 | + "name": "inphared.pl", |
| 33 | + "description": "inphared.pl (INfrastructure for a PHAge REference Database) is a perl script which downloads and filters phage genomes from Genbank to provide the most complete phage genome database possible.", |
| 34 | + "identifier": "https://github.com/RyanCook94/inphared", |
| 35 | + "url": "https://github.com/RyanCook94/inphared", |
| 36 | + "keywords": "Data Retrieval, Data Analysis, Bioinformatics, Phage Genomes", |
| 37 | + "license": "https://opensource.org/license/agpl-v3", |
| 38 | + |
| 39 | + "citation": "Cook R, Brown N, Redgwell T, Rihtman B, Barnes M, Clokie M, Stekel DJ, Hobman JL, Jones MA, Millard A. INfrastructure for a PHAge REference Database: Identification of Large-Scale Biases in the Current Collection of Cultured Phage Genomes. PHAGE. 2021. Available from: http://doi.org/10.1089/phage.2021.0007", |
| 40 | + "creator": { |
| 41 | + "@type": "Person", |
| 42 | + "name": "Ryan Cook", |
| 43 | + }, |
| 44 | + "version": "v1.2", |
| 45 | + "datePublished": "2021-02-18", |
| 46 | + "distribution": { |
| 47 | + "@type": "DataDownload", |
| 48 | + "contentUrl": "https://github.com/RyanCook94/inphared/archive/refs/tags/v1.2.zip" |
| 49 | + } |
| 50 | +} |
| 51 | +``` |
| 52 | +For more details on JSON-LD encoding of Schema.org, please refer to the [Bioschemas training material](https://bioschemas.org/tutorials/howto/howto_add_markup). |
| 53 | + |
| 54 | +The last step of the annotation process consists in making accessible the metadata. This can be done by adding the metadata to the HTML code of the dataset webpage. |
| 55 | + |
| 56 | +## Step 3. Annotating an ML software by using the computational tool |
| 57 | +We have seen how to manually write Bioschemas metatadata in JSON-LD to annotate a sample dataset. Now, we will see how we can use a software registry to lighten the annotation process. |
| 58 | + |
| 59 | +[Bio.tools](https://bio.tools) is a registry of software tools for the life sciences. It allow users to submit new tools, and to search for existing ones. During the submission process, users are asked to provide metadata about the tool. This metadata is then used to dynamically generate a JSON-LD representation on the web page describing the tool. |
| 60 | + |
| 61 | +Let's take an example. Here, we consider the deepvariant tool. This tool is registered in bio.tools at the following URL: [https://bio.tools/deepvariant](https://bio.tools/deepvariant): |
| 62 | + |
| 63 | + |
| 64 | +If we check this web page with the [schema.org validation tool](https://validator.schema.org/#url=https%3A%2F%2Fbio.tools%2Fdeepvariant), we can see that Bioschemas markup is present: |
| 65 | + |
| 66 | + |
| 67 | +**The key message here is that chosing the right registry is already a key step towards FAIRer ML resources** |
| 68 | + |
| 69 | +## Step 4. Evaluating the global FAIRness of the annotated ML resources |
| 70 | +Finally we will breifly explore how to evaluate the FAIRness of the annotated resources. |
| 71 | +We will use the [FAIRChecker](https://fair-checker.france-bioinformatique.fr) tool. This tool allows to evaluate the FAIRness of a resource by checking the presence of semantic metadata. |
| 72 | + |
| 73 | +If we consider a [ML dataset](https://www.kaggle.com/datasets/ankushpanday1/heart-attack-in-youth-vs-adult-in-germany) registered in Kaggle, we can see that the dataset is FAIR enough: |
| 74 | + |
| 75 | + |
| 76 | +The FAIR assesment report shows scores and recommendations for each of the FAIR principles: |
| 77 | +  |
| 78 | + |
| 79 | +<!-- Let's come back to our dataset described in the DOME registry. DOME allows to expose Bioschemas metadata for each of the hosted entries. This is key to increase the FAIRness of the hosted machine learning resource descriptions. By submitting the [dataset url](https://registry.dome-ml.org/review/dfyn1yvtz3#dataset) in the FAIRChecker tool, we can see that the dataset is FAIR enough:--> |
| 80 | + |
| 81 | + |
| 82 | +## Takeaways |
| 83 | + |
| 84 | +- Schema.org and Bioschemas are key to increase the findability of machine learning resources |
| 85 | +- You can embed JSON-LD metadata in the HTML code of your web pages |
| 86 | +- Using registries (for software, datasets, etc.) is a good practice to lighten the annotation process |
| 87 | +- You can evaluate the FAIRness of your digital resources with online tools such as FAIR-Checker |
0 commit comments