Paper (elixir-europe#4) improvements

vedran-kasalica · vedran-kasalica · commit 2a2424ca6ab8 · 2023-12-14T11:38:21.000+01:00
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,3 @@
 
 .DS_Store
+.vscode/settings.json
diff --git a/4/paper/paper.bib b/4/paper/paper.bib
@@ -153,3 +153,28 @@ @article {Capella-Gutierrez:2017
 	eprint = {https://www.biorxiv.org/content/early/2017/08/31/181677.full.pdf},
 	journal = {bioRxiv}
 }
+
+@article{IsonKJBUMMLPR13,
+  author    = {Jon C. Ison and
+               Mat{\'{u}}s Kalas and
+               Inge Jonassen and
+               Dan M. Bolser and
+               Mahmut Uludag and
+               Hamish McWilliam and
+               James Malone and
+               Rodrigo Lopez and
+               Steve Pettifer and
+               Peter M. Rice},
+  title     = {{EDAM:} an ontology of bioinformatics operations, types of data and
+               identifiers, topics and formats},
+  journal   = {Bioinform.},
+  volume    = {29},
+  number    = {10},
+  pages     = {1325--1332},
+  year      = {2013},
+  url       = {https://doi.org/10.1093/bioinformatics/btt113},
+  doi       = {10.1093/bioinformatics/btt113},
+  timestamp = {Mon, 02 Mar 2020 16:24:09 +0100},
+  biburl    = {https://dblp.org/rec/journals/bioinformatics/IsonKJBUMMLPR13.bib},
+  bibsource = {dblp computer science bibliography, https://dblp.org}
+}
diff --git a/4/paper/paper.md b/4/paper/paper.md
@@ -38,7 +38,7 @@ event: BioHackathon Europe 2023
 Benchmarks - standardized tests comparing performance, accuracy, and efficiency - are key for evaluating individual tools and composite workflows. In a “Bake Off” setting, they allow for comparisons of candidate tools and workflows for a particular computational task in order to determine the best-performing one [@Lamprecht:2021]. In this BioHackathon Project, we worked toward a “Great Bake Off of Bioinformatics Workflows” to develop workflow-level benchmarks and further ideas. We invited BioHackathon participants to share tools and workflows and collected their feedback and further ideas for benchmarks. Initially, the benchmarks was tested in the proteomics domain due to mature domain annotations and project lead expertise [@Palmblad:2019]. The participants' areas of expertise guided the exploration of additional domains, such as genomics. The same approach was recently applied in metabolomics [@Du:2023].
 
 
-In recent and ongoing work in ELIXIR Implementation Studies and spin-off projects, we have already developed several rudimentary workflow-level benchmarks for bioinformatics data analysis pipelines, including those automatically composed by the APE (Automatic Pipeline Explorer) framework [@Kasalica:2021]. Before deploying these benchmarks for production use, however, their definitions must be aligned with benchmarks at the tool-level and formalized. This process should prioritize benchmarks that are most relevant for users when selecting, comparing, and deploying workflows for daily use. During the BioHackathon Europe 2023 project, we attempted to consolidate these efforts by bringing together people with complementary expertise and bridging ongoing ELIXIR efforts. We aim to produce a minimum fit-for-purpose set of workflow-specific benchmarks by aggregating tool benchmarks (Task 1) and assess the feasibility of annotating and mapping tool- and workflow-level benchmarks to EDAM operations to explore the reusability of benchmarks across domains (Task 2).
+In recent and ongoing work in ELIXIR Implementation Studies and spin-off projects, we have already developed several rudimentary workflow-level benchmarks for bioinformatics data analysis pipelines, including those automatically composed by the APE (Automatic Pipeline Explorer) framework [@Kasalica:2021]. Before deploying these benchmarks for production use, however, their definitions must be aligned with benchmarks at the tool-level and formalized. This process should prioritize benchmarks that are most relevant for users when selecting, comparing, and deploying workflows for daily use. During the BioHackathon Europe 2023 project, we attempted to consolidate these efforts by bringing together people with complementary expertise and bridging ongoing ELIXIR efforts. We aim to produce a minimum fit-for-purpose set of workflow-specific benchmarks by aggregating tool benchmarks (Task 1) and assess the feasibility of annotating and mapping tool- and workflow-level benchmarks to EDAM [@IsonKJBUMMLPR13] operations to explore the reusability of benchmarks across domains (Task 2).
 
 
 Short-term, the Project aimed to deliver a draft of defined workflow-level benchmarks, each with examples and defined relationships to existing tool-level benchmarks and standards. We set out to systematically discuss the different types of workflow-level benchmarks, including both design-time (related to algorithmic complexity, licenses and workflow deployability) and run-time (performance metrics). Long-term, these benchmarks will be implemented in the Workflomics project [@Kasalica:2023a] and ongoing Proteomics Community ELIXIR Implementation Studies. These workflow-level benchmarks will be carefully documented, demonstrated and shared with the research community in relevant fora and beyond this preprint.
@@ -73,6 +73,10 @@ During the discussions, we considered three levels of run-time benchmarks:
 Level 0 may require zero input files (returning a usage string, demonstrating the tool or tools could be accessed and executed), level 1 typically requires at least 1 input file to compute a benchmark dependent only knowing the input file and EDAM operation, and level 2 typically requires at least 2 (the gold standard data and expected output/correct answer). Note that these levels do not correspond to common levels of software testing, but are specifically defined for the testing functionality of individual operations performed by a workflow, where more than one component may be responsible for the output.
 
 
+### Level 0 benchmarks
+
+The level 0 benchmarks serve as basic checks for component or workflow functionality, assessing whether tools can execute without errors. These tests may require no inputs, focusing on validating correct installations. Often, this involves assessing the command line interface's functionality, ensuring tools can be initiated and run without data processing, providing a foundational evaluation of operational integrity.
+
 ### Level 1 benchmarks
 
 The level 1 benchmarks are usually straightforward, such as checking that Format detection [operation:3357] detects a format or Aggregation [operation:3436] outputs a single file. In most cases, the operation itself immediately suggests at least one suitable benchmark that can be checked with a bash command or regular expression. Level 1 benchmarks are purely technical and have no meaningful scientific interpretation. They are similar to the kinds of tests typically performed in automated testing in continuous integration and continuous delivery  pipelines.
@@ -82,10 +86,10 @@ The level 1 benchmarks are usually straightforward, such as checking that Format
 
 The level 2 benchmarks range from the straightforward, such as Format detection determining the correct format or Aggregation output identical to a file provided by the user, to the hard, such as data anonymization [operation:3283], the benchmarking of which has itself been the topic of several recent publications [@Prasser:2014][@Pilan:2022].
 
-To inform discussions, all subclasses Spectral analysis [operation:2945] and Genetic variation analysis [operation:3197], both subclasses of Analysis [operation:2945], and Data handling [operation:2409], in total 28 specific operations in mass spectrometry/proteomics, genomics and general data handling. While some operations, such as Spectrum calculation [operation:3860] and Mass spectra calibration [operation:3627] have unique benchmarks (residual mass measurement error and spectral accuracy respectively), several benchmarks are shared across many operations. Any operation that is expected to output an identifier of a format, gene or protein sequence, or ontology class have the same generic benchmarks, namely whether the output contains an identifier of the correct type (level 1) or the correct identifier (level 2). Similarly, accuracy (fraction correct calls) is a generic benchmark for any operation identifying natural products or peptides from mass spectra, or any type of genomic variants from sequence reads. In situations where the positives and negatives are highly imbalanced, metrics such as the Matthew's correlation coefficient [@Matthews:1975], can be computed from the same information (true and false positives and negatives).
+To inform discussions, all subclasses Spectral analysis [operation:2945] and Genetic variation analysis [operation:3197], both subclasses of Analysis [operation:2945], and Data handling [operation:2409], in total 28 specific (EDAM) operations in mass spectrometry/proteomics, genomics and general data handling. <!--- Vedran: The previous sentence was not clear to me. ---> While some operations, such as Spectrum calculation [operation:3860] and Mass spectra calibration [operation:3627] have unique benchmarks (residual mass measurement error and spectral accuracy respectively), several benchmarks are shared across many operations. Operations that is expected to output an identifier of a format, gene or protein sequence, or ontology class have the same generic benchmarks, namely whether the output contains an identifier of the correct type (level 1) or the correct identifier (level 2). Similarly, accuracy (fraction correct calls) is a generic benchmark for any operation identifying natural products or peptides from mass spectra, or any type of genomic variants from sequence reads. In situations where the positives and negatives are highly imbalanced, metrics such as the Matthew's correlation coefficient [@Matthews:1975], can be computed from the same information (true and false positives and negatives).
 
 ![Figure 2](./figures/edam-benchmarks.png)
-Figure 2. Mapping of level 2 benchmarks (pink ellipses) to EDAM operations in subsets of the proteomics (Spectral analysis, blue), genomics (Genetic variation analysis, green) and data wrangling (Data handling, orange) domains, suggesting a high degree of reusability of benchmarks across domains. This should be taken into account when computing and visualizing the benchmarks.
+Figure 2. Mapping of level 2 benchmarks (pink ellipses) to EDAM operations in subsets of the proteomics (Spectral analysis, blue rectangles), genomics (Genetic variation analysis, green rectangles) and data wrangling (Data handling, orange rectangles) domains, suggesting a high degree of reusability of benchmarks across domains. This should be taken into account when computing and visualizing the benchmarks.
 
 In summary, **Task 2** results, though preliminary, allow us to hypothesize (Figure 2) that the number of generic benchmarks at these levels of the EDAM ontology is an order of magnitude smaller than the number of operations. For the Workflomics project, such mappings between EDAM operations and computable benchmarks are directly useful in benchmarking of automatically generated workflows.
 
@@ -119,7 +123,7 @@ Feel free to use numbered lists or bullet points as you need.
 
 # Conclusion
 
-During the BioHackathon, we made significant progress on the retrieval and aggregation of tool-level benchmarks from OpenEBench, matured the visualization and user interaction with these aggregate benchmarks in the Workflomics user interface, and discussed the feasibility of defining and mapping benchmarks to EDAM operations in different domains. Less tangible but equally important, we strengthened the embedding in the ELIXIR tools ecosystem through extensive technical discussions with experts involved with OpenEBench, EDAM, bio.tools and CWL. The efforts now continue in this larger community of stakeholders and collaborators, and will be published separately and in greater detail elsewhere.
+During the BioHackathon 2023, we made significant progress on the retrieval and aggregation of tool-level benchmarks from OpenEBench, matured the visualization and user interaction with these aggregate benchmarks in the Workflomics user interface, and discussed the feasibility of defining and mapping benchmarks to EDAM operations in different domains. Less tangible but equally important, we strengthened the embedding in the ELIXIR tools ecosystem through extensive technical discussions with experts involved with OpenEBench, EDAM, bio.tools and CWL. The efforts now continue in this larger community of stakeholders and collaborators, and will be published separately and in greater detail elsewhere.
 
 
 # Future work

Original file line number	Diff line number	Diff line change
`@@ -1,2 +1,3 @@`
`1`	`1`
`2`	`2`	`.DS_Store`
	`3`	`+.vscode/settings.json`