Skip to content

Commit

Permalink
Text and bib corrections
Browse files Browse the repository at this point in the history
  • Loading branch information
ivanmicetic committed Nov 21, 2022
1 parent de53b2f commit 2e47290
Show file tree
Hide file tree
Showing 2 changed files with 43 additions and 39 deletions.
50 changes: 26 additions & 24 deletions paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -17,31 +17,33 @@ @inproceedings{Kl_mek_2017
title = {{LinkedPipes} {ETL} in use},
booktitle = {Proceedings of the 19th International Conference on Information Integration and Web-based Applications's Services}
}
@article{Sickmeier_2007,
doi = {10.1093/nar/gkl893},
url = {https://doi.org/10.1093%2Fnar%2Fgkl893},
year = 2007,
month = {jan},
publisher = {Oxford University Press ({OUP})},
volume = {35},
number = {Database},
pages = {D786--D793},
author = {M. Sickmeier and J. A. Hamilton and T. LeGall and V. Vacic and M. S. Cortese and A. Tantos and B. Szabo and P. Tompa and J. Chen and V. N. Uversky and Z. Obradovic and A. K. Dunker},
title = {{DisProt}: the Database of Disordered Proteins},
journal = {Nucleic Acids Research}
@article{Quaglia_2021,
author = {Quaglia, Federica and Mészáros, Bálint and Salladini, Edoardo and Hatos, András and Pancsa, Rita and Chemes, Lucía B and Pajkos, Mátyás and Lazar, Tamas and Peña-Díaz, Samuel and Santos, Jaime and Ács, Veronika and Farahi, Nazanin and Fichó, Erzsébet and Aspromonte, Maria Cristina and Bassot, Claudio and Chasapi, Anastasia and Davey, Norman E and Davidović, Radoslav and Dobson, Laszlo and Elofsson, Arne and Erdős, Gábor and Gaudet, Pascale and Giglio, Michelle and Glavina, Juliana and Iserte, Javier and Iglesias, Valentín and Kálmán, Zsófia and Lambrughi, Matteo and Leonardi, Emanuela and Longhi, Sonia and Macedo-Ribeiro, Sandra and Maiani, Emiliano and Marchetti, Julia and Marino-Buslje, Cristina and Mészáros, Attila and Monzon, Alexander Miguel and Minervini, Giovanni and Nadendla, Suvarna and Nilsson, Juliet F and Novotný, Marian and Ouzounis, Christos A and Palopoli, Nicolás and Papaleo, Elena and Pereira, Pedro José Barbosa and Pozzati, Gabriele and Promponas, Vasilis J and Pujols, Jordi and Rocha, Alma Carolina Sanchez and Salas, Martin and Sawicki, Luciana Rodriguez and Schad, Eva and Shenoy, Aditi and Szaniszló, Tamás and Tsirigos, Konstantinos D and Veljkovic, Nevena and Parisi, Gustavo and Ventura, Salvador and Dosztányi, Zsuzsanna and Tompa, Peter and Tosatto, Silvio C E and Piovesan, Damiano},
title = "{DisProt in 2022: improved quality and accessibility of protein intrinsic disorder annotation}",
journal = {Nucleic Acids Research},
volume = {50},
number = {D1},
pages = {D480--D487},
year = {2021},
month = {11},
issn = {0305-1048},
doi = {10.1093/nar/gkab1082},
url = {https://doi.org/10.1093/nar/gkab1082},
eprint = {https://academic.oup.com/nar/article-pdf/50/D1/D480/42058329/gkab1082.pdf},
}
@article{Di_Domenico_2012,
doi = {10.1093/bioinformatics/bts327},
url = {https://doi.org/10.1093%2Fbioinformatics%2Fbts327},
year = 2012,
month = {jun},
publisher = {Oxford University Press ({OUP})},
volume = {28},
number = {15},
pages = {2080--2081},
author = {T. Di Domenico and I. Walsh and A. J. M. Martin and S. C. E. Tosatto},
title = {{MobiDB}: a comprehensive database of intrinsic protein disorder annotations},
journal = {Bioinformatics}
@article{Piovesan_2020,
author = {Piovesan, Damiano and Necci, Marco and Escobedo, Nahuel and Monzon, Alexander Miguel and Hatos, András and Mičetić, Ivan and Quaglia, Federica and Paladin, Lisanna and Ramasamy, Pathmanaban and Dosztányi, Zsuzsanna and Vranken, Wim F and Davey, Norman E and Parisi, Gustavo and Fuxreiter, Monika and Tosatto, Silvio C E},
title = "{MobiDB: intrinsically disordered proteins in 2021}",
journal = {Nucleic Acids Research},
volume = {49},
number = {D1},
pages = {D361--D367},
year = {2020},
month = {11},
issn = {0305-1048},
doi = {10.1093/nar/gkaa1058},
url = {https://doi.org/10.1093/nar/gkaa1058},
eprint = {https://academic.oup.com/nar/article-pdf/49/D1/D361/35363750/gkaa1058.pdf},
}
@article{Lazar_2020,
doi = {10.1093/nar/gkaa1021},
Expand Down
32 changes: 17 additions & 15 deletions paper.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
---
title: 'An ETL pipeline to construct IDPcentral Knowledge Graph using Bioschemas JSON-LD data feeds'
title_short: 'An ETL pipeline to construct IDPcentral Knowledge Graph using Bioschemas JSON-LD data feeds'
title: 'An ETL pipeline to construct IDPcentral Knowledge Graph using Bioschemas JSON-LD data dumps'
title_short: 'An ETL pipeline to construct IDPcentral Knowledge Graph using Bioschemas JSON-LD data dumps'
tags:
- pipeline
- workflow
- knowledge graph
- bioschemas
- ETL
- data harvesting
- markup
authors:
- name: Ammar Ammar
orcid: 0000-0002-8021-9162
Expand All @@ -20,7 +22,7 @@ authors:
affiliations:
- name: Department of Bioinformatics (BiGCaT), Maastricht University, The Netherlands
index: 1
- name: BioComputing UP Lab, University of Padua, Italy
- name: Department of Biomedical Sciences, University of Padua, Padova, Italy
index: 2
- name: Heriot-Watt University, Edinburgh, UK
index: 3
Expand All @@ -36,7 +38,7 @@ group: Project 23
git_url: https://github.com/ammar257ammar/biohackathon2022-bioschemas-consumer
# This is the short authors description that is used at the
# bottom of the generated paper (typically the first two authors):
authors_short: Ammar Ammar & Alasdair Gray \emph{et al.}
authors_short: Ammar Ammar, Alasdair Gray \emph{et al.}
---


Expand All @@ -56,9 +58,9 @@ pasting above link (or yours) in
# Introduction

As part of the one-week Biohackathion Europe 2022 in Paris France, a group was formed to work on project 23 titled: Publishing and Consuming Schema.org DataFeeds.
Schema.org and Bioschemas which is built on top of it [@8cbc7003c0f74e778be9bbd9e9354a15] are lightweight vocabularies that aim at making web pages contents machine-readable so that software agents can consume that content and understand it in an actionable way. Due to the time needed to process each page, extracting markup by visiting each page of a site is not practical for huge sites. This approach imposes processing requirements on the publisher and the consumer.
Schema.org and Bioschemas which is built on top of it [@8cbc7003c0f74e778be9bbd9e9354a15] are lightweight vocabularies that aim at making web page contents machine-readable so that software agents can consume that content and understand it in an actionable way. Due to the time needed to process each page, extracting markup by visiting each page of a large resource is not practical. This approach imposes processing requirements on the publisher and the consumer.
The Schema.org community proposed a method for exchanging markup from various pages as a DataFeed published at a recognized address in February 2022. The feed could consist of a single file containing the entire information or it could be divided into different files based on different aspects of the dataset, such as proteins and molecular entities, as in the case of ChEMBL. This would ease publisher and customer processing requirements and accelerate data collection.
The aim of Project 23 is to explore the implementation of the Schema.org proposal from both a producer and consumer perspective, for a variety of resources implementing different Bioschemas profiles. This report focuses on the consumer part of the project proposal where we explored an ETL pipeline (Extract-Transform-Load) approach and implemented a consumption pipeline that enables data feeds to be ingested into knowledge graphs (KG).
The aim of the project is to explore the implementation of the Schema.org proposal from both a producer and consumer perspective, for a variety of resources implementing different Bioschemas profiles. This report focuses on the consumer part of the project proposal where we explored an ETL pipeline (Extract-Transform-Load) approach and implemented a consumption pipeline that enables data dumps to be ingested into knowledge graphs (KG).


<!--
Expand All @@ -68,10 +70,10 @@ The aim of Project 23 is to explore the implementation of the Schema.org proposa
## The construction of IDPcentral Knowledge Graph as a use case

The example pipeline that we developed in this work is based on a [previous work](https://github.com/BioComputingUP/IDP-KG) developed during the ELIXIR-sponsored BioHackathon-Europe 2020 and reported in BioHackrXiv [@Gray_2021].
In that work, several notebooks were developed to generate the IDPcentral Knowledge Graph based on data harvested from three sources:
[DisProt](https://disprot.org/) [@Sickmeier_2007], [MobiDB](https://mobidb.org/) [@Di_Domenico_2012], and [ProteinEnsemble (PED)](https://proteinensemble.org/) [@Lazar_2020].
In that work, a method based on RDFLib was developed to generate the IDPcentral Knowledge Graph based on data harvested from three sources:
[DisProt](https://disprot.org/) [@Quaglia_2021], [MobiDB](https://mobidb.org/) [@Piovesan_2020], and [Protein Ensemble Database](https://proteinensemble.org/) [@Lazar_2020].

More specifically, we aimed at reproducing [one of the notebooks](https://github.com/BioComputingUP/IDP-KG/blob/main/notebooks/ETLProcess.ipynb) that did the ETL processing in order to create the knowledge graph, but this time in the form of a pipeline.
More specifically, we aimed at reproducing the ETL processing from [@Gray_2021] in order to create the knowledge graph, but this time in the form of a pipeline.

The pipeline is supposed to load scraped JSON-LD from the three aforementioned sources, convert it to RDF, apply SPARQL construct queries to map the source RDF to a unified Bioschemas-based model and store the resulting KG as a ttl file.

Expand Down Expand Up @@ -142,7 +144,7 @@ Fortunately, this configuration can be constructed from a list of URLs using a S

## The ETL pipeline explained

Figure \ref{fig-pipeline} shows the pipeline developed in this work which downloads JSON-LD files scraped from three sources and stored on GitHub, converts them to RDF, maps the RDF to a unified model and saves the resulting graph to a ttl file. The pipeline also calculates some statistics from the converted RDF and stores it in a CSV file next to the output ttl file. The following sections describe in detail the different stages of the KG construction pipeline.
Figure \ref{fig-pipeline} shows the pipeline developed in this work. The pipeline downloads JSON-LD files scraped from three sources, converts them to RDF, maps the RDF to a unified model and saves the resulting graph to a ttl file. The pipeline also calculates some statistics from the converted RDF and stores it in a CSV file next to the output ttl file. The following sections describe in detail the different stages of the KG construction pipeline.

![An overview of the developed pipeline using LinkedPipes \label{fig-pipeline}](./figures/the-pipeline.png)

Expand Down Expand Up @@ -183,7 +185,7 @@ Now, using the constructed config, the node "HTTP get list" downloads the JSON-L

![The Transform part of the developed pipeline \label{fig-transform}](./figures/transform.png)

In this stage, the download JSON-LD files go through a series of transformations in order to get the final RDF graph as show in Figure \ref{fig-transform}. First, the node of type "JSON to JSON-LD" is used to add a specified JSON-LD context (in this case: https://schema.org/) and additional provenance data to the input JSON files. The input JSON-LD files contain multiple entities in each file, and thus, this step is needed to get a proper JSON-LD for the next step in the pipeline. In case the input JSON-LD file contains a single entity, this step is not needed. Next, the node of type "JSON-LD to RDF" is applied to convert the JSON-LD to turtle RDF on which SPARQL queries can be executed. Next, eight SPARQL construct queries are applied on the RDF of the input JSON-LD files using nodes of type "SPARQL construct" to map them to a unified Bioschemas-based model. Below, we show an example of a construct query to create protein entities having IRIs that follow the IDPC accession URL pattern (ht<span>tps://</span>idpcentral.org/id/{UNIPROT_ID}). The query is an adapted version of the original one in the ETL notebook that this pipeline is aiming to reproduce.
In this stage, the downloaded JSON-LD files go through a series of transformations in order to get the final RDF graph as show in Figure \ref{fig-transform}. First, the node of type "JSON to JSON-LD" is used to add a specified JSON-LD context (in this case: https://schema.org/) and additional provenance data to the input JSON files. The input JSON-LD files contain multiple entities in each file, and thus, this step is needed to get a proper JSON-LD for the next step in the pipeline. In case the input JSON-LD file contains a single entity, this step is not needed. Next, the node of type "JSON-LD to RDF" is applied to convert the JSON-LD to turtle RDF on which SPARQL queries can be executed. Next, eight SPARQL construct queries are applied on the RDF of the input JSON-LD files using nodes of type "SPARQL construct" to map them to a unified Bioschemas-based model. Below, we show an example of a construct query to create protein entities having IRIs that follow the IDPC accession URL pattern (ht<span>tps://</span>idpcentral.org/id/{UNIPROT_ID}). The query is an adapted version of the original one in the ETL notebook that this pipeline is aiming to reproduce.

```
PREFIX pav: <http://purl.org/pav/>
Expand Down Expand Up @@ -291,19 +293,19 @@ SELECT ?desc ?count WHERE {

### The "Load" phase

In this stage, as shown in Figure \ref{fig-load}, the resulting RDF graph is converted to a file and stored in the local file system to a path specified by the configuration of the node of type "Files to local". Similarly, the summary statistics are stored in a CSV file in the local file system. The pipeline can be adapted to utilize the output data more practically by loading it into a live SPARQL endpoint or sending the file over FTP to another location.
In this stage (Figure \ref{fig-load}), the resulting RDF graph is converted to a file and stored in the local file system to a path specified by the configuration of the node of type "Files to local". Similarly, the summary statistics are stored in a CSV file in the local file system. The pipeline can be adapted to utilize the output data more practically by loading it into a live SPARQL endpoint or sending the file over FTP to another location.

![The Transform part of the developed pipeline \label{fig-load}](./figures/load.png)
![The Load part of the developed pipeline \label{fig-load}](./figures/load.png)

## LinkedPipes pipeline exporting, testing and FAIR compliance

LinkedPipes user interface allows downloading individual pipelines in JSON-LD format or exporting all the pipelines as Trig files wrapped in a zip archive. Therefore, the pipeline itself is machine-readable since it is represented in JSON-LD where components, connections and configurations are annotated with terms from the LinkedPipes ontology and other ontologies like SKOS [@8e3f54f09cd0481b9d826939a5d596a9]. Figure \ref{fig-viz} shows a simplified visualization of the pipeline JSON-LD structure. Adopting JSON-LD as a medium to import/export pipelines in LinkedPipes makes them natively machine-readable and compliant with several FAIR subprinciples under the interoperability (I) and the reusability (R) main principles. The pipeline produced in this work is made available on [GitHub](https://github.com/ammar257ammar/biohackathon2022-bioschemas-consumer) under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) license along with the BioHackathon report and it is also archived in Zenodo giving it a DOI, a globally-unique identifier. Therefore, those steps add to the findability, accessibility and reusability of the work.
The pipeline can be tested using a local running instance of LinkedPipes or using the online [demo instance](https://demo.etl.linkedpipes.com/). By using the upload function under the "pipelines" tab, users can load the pipeline either as a file or as a URL, and then, it can be executed to get the output RDF. All the files needed for the pipeline to execute (the input URL list and the JSON-LD files) are hosted online in GitHub repositories. Therefore, the pipeline is portable and reproducible on any machine. Finally, the pipeline can be executed using a REST API call which makes it convenient for automation and scheduling via a cron job for example to run at a fixed interval and update the knowledge graph continuously.
The pipeline can be tested using a local running instance of LinkedPipes or using the online [demo instance](https://demo.etl.linkedpipes.com/). By using the upload function under the "pipelines" tab, users can load the pipeline either as a file or as a URL and execute it to get the output RDF. All the files needed for the pipeline execution (the input URL list and the JSON-LD files) are hosted online in GitHub repositories making the pipeline portable and reproducible. Finally, the pipeline can be executed using a REST API call which makes it convenient for automation and scheduling via a cron job for example to run at a fixed interval and update the knowledge graph continuously.

![Simplified visualization of the LinkedPipes pipeline JSON-LD markup. Three types of entities were kept in the graph for simiplicity, namely, Pipeline, ExecutionProfile and Component. The pipeline itself is stored and exported as a JSON-LD file where components, connections and configurations are annotated with terms from the LinkedPipes ontology and other ontologies like SKOS. The figure is generated using classyschema.org \label{fig-viz}](./figures/pipeline_visualization.png)

## Conclusion

In this work, an ETL consumption pipeline was developed to construct a knowledge graph from Bioschemas JSON-LD feeds. The LinkedPipes suite was explored and assessed as suitable for the purpose of this project. This approach proved that consuming JSON-LD data feeds through an ETL pipeline is a viable solution. Moreover, the ability to automate the execution of the pipeline via a scheduled REST API call (e.g. using a cron job or a Jenkins job) makes it convenient to continuously update the knowledge graph with new data feeds. Finally, the RDF-based nature of LinkedPipes in creating and communicating ETL pipelines largely contributes to the overall FAIRness of this approach and helps to make not only the data FAIR but also the software/pipeline used to create the data.
In this work, an ETL consumption pipeline was developed to construct a knowledge graph from Bioschemas JSON-LD dumps. The LinkedPipes suite was explored and assessed as suitable for the purpose of this project. This approach proved that consuming JSON-LD data dumps through an ETL pipeline is a viable solution. Moreover, the ability to automate the execution of the pipeline via a scheduled REST API call (e.g. using a cron job or a Jenkins job) makes it convenient to continuously update the knowledge graph with new data dumps. Finally, the RDF-based nature of LinkedPipes in creating and communicating ETL pipelines largely contributes to the overall FAIRness of this approach and helps to make not only the data FAIR but also the software/pipeline used to create the data.

## References

0 comments on commit 2e47290

Please sign in to comment.