diff --git a/R/tag-and-archive.R b/R/tag-and-archive.R new file mode 100644 index 00000000..c76cee6f --- /dev/null +++ b/R/tag-and-archive.R @@ -0,0 +1,35 @@ +# library(tidyverse) +library(here) +library(gert) + +# Generate HTML of application for archiving ------------------------------- +# 1. Comment out the d-title element in the theme.css +# 2. Comment out the navbar in the _site.yml + +rmarkdown::render( + here("index.Rmd"), + output_format = distill::distill_article(self_contained = TRUE, toc = TRUE, toc_float = FALSE), + output_file = here("nnf-dif-application.html") +) + +# Then add and commit to the Git repo. + +repo_version <- "v2021.05.04" +version_tag <- git_tag_create( + name = repo_version, + message = "Grant application submitted to Novo Nordisk Foundation for the Data Science Infrastructure funding. This is the version submitted to NNF, even though the tag date is off by a year." +) +git_push() +git_tag_push(repo_version) +if (interactive()) browseURL("https://gitlab.com/rostools/r-cubed/-/releases/new") + +# This is for when we upload to Zenodo +# tag_archive_file <- str_c("dif-project", repo_version, ".zip") +# git_archive_zip(tag_archive_file) + +# zenodo <- ZenodoManager$new( +# # url = "https://sandbox.zenodo.org/api", +# url = "https://zenodo.org/api", +# logger = "INFO", +# token = askpass::askpass() +# ) diff --git a/index.Rmd b/index.Rmd index cf6adcfa..dd8a6baa 100644 --- a/index.Rmd +++ b/index.Rmd @@ -1,5 +1,9 @@ --- title: "A framework for an open and scalable infrastructure for health data exemplified by the DD2 initiative" +author: + - Luke W. Johnston + - Alisa Devedzic Kjærgaard + - Annelli Sandbæk site: distill::distill_website bibliography: "resources/references.bib" csl: "resources/vancouver.csl" diff --git a/nnf-dif-application.html b/nnf-dif-application.html new file mode 100644 index 00000000..108b278a --- /dev/null +++ b/nnf-dif-application.html @@ -0,0 +1,3678 @@ + + + + +
+ + + + + + + + + + + + + + + + +Many initiatives in clinical research, including the Danish Centre +for Strategic Research in Type 2 Diabetes (DD2) initiative, have a hard +time getting funds for building and maintaining needed software that +make sure health science is the highest quality it can be. For this +project, we aim to build software tools that make it easier to do better +research, especially for managing and working with data. We’ll first be +building these tools to help support the DD2 initiative.
+We will be sharing these tools widely and freely, so that as many +research groups as possible can use it for their own projects. Not only +will we build these tools to help researchers with managing and sharing +their data, we also will create them with beginner-friendly +documentation and training material to make sure more researchers use +our tools, no matter their skill level. We believe that with these +tools, science on health and well-being can become better, ultimately +helping people with diabetes and society in general.
+ +In clinical and health research, especially for small- to mid-sized +research groups, funding for building modern, open source software +infrastructures for managing and using data has been limited. This gap +has naturally led to organizational challenges for managing existing and +incoming data for many research initiatives, including the Danish Centre +for Strategic Research in Type 2 Diabetes (DD2) initiative, a national +research collaboration and database initiative established in 2010 with +continual enrollment of persons with type 2 diabetes. The aim of our +project is to close this gap by creating and implementing an efficient, +scalable, and open source data infrastructure framework that connects +data collectors, researchers, clinicians, and other stakeholders, with +the data, documentation, and findings within the DD2 study. This will +improve and extend the existing DD2 research infrastructure into an open +national state-of-the-art research infrastructure that will provide easy +and transparent access to this resource for researchers, clinicians and +stakeholders, thus enabling excellent data science driven research. +Furthermore, we will create this framework in such a way that other +research groups and companies, who are unable to adequately invest in +building infrastructures of this type, can relatively easily implement +it, and modify as needed, for their own purposes. By building this +framework, we have the potential to help propel research groups and +companies across Denmark (and globally) to quickly getting updated on +modern, scalable, and efficient approaches to working with data. Within +the DD2 setting, an open, transparent, and easy access to this +constantly growing resource has the potential of greatly improving the +interest in, use of, and scientific impact of this resource, thus +leading to substantial scientific and medical advancements, +individualised treatment and improved human health in not only persons +with type 2 diabetes, but population overall.
+In clinical research, software and data infrastructure development is +undervalued and, aside from this funding call, underfunded, particularly +for small- to mid-sized research organizations. Clinical and health +researchers largely lack formal training, support, and awareness in +research software engineering (RSE) and in building and managing data +infrastructures. The result is that the overall software and +computational ecosystem, as well as the technical capacities to maintain +them, lags behind multiple other scientific domains (e.g., +bioinformatics). Particularly with the recent rise of data science and +the greater focus on analytical reproducibility, this issue has become +increasingly apparent as data, and the skills required to work with it, +become ever larger, more technical, and complex. Indeed, investing in +and implementing scalable and modern data infrastructures and RSE +processes, built on open source software, have the potential to greatly +improve the quality of science, to produce more transparent and +streamlined workflows, to lead to reproducible research, and generally +better science in less time (1).
+Funding for participant recruitment and data acquisition has +historically been (and still is) easier to obtain than for building open +source software and infrastructures that support and enhance science, +particularly for managing and using data. This imbalance has naturally +led to organizational challenges for managing existing and incoming data +for many research initiatives within the field of clinical research, +including for the Danish Centre for Strategic Research in Type 2 +Diabetes (DD2) initiative (2,3).
+DD2 is a national type 2 diabetes (T2D) research collaboration and +database initiative that was established in 2010, with on-going +enrollment by hospital physicians and general practitioners (GPs). +Although T2D is a single diagnosis, it comprises several phenotypes with +diverse prognoses and risks for complications, which can lead to +treatments tailored to each phenotype. The overarching aim of DD2 is to +improve and individualise the treatment of persons with T2D. Figure 1 +shows the several datasets within DD2 (4–7). DD2 has +received extensive funding from the Danish Council for Strategic +Research and the Novo Nordisk Foundation as well as a Steno National +Collaborative Grant for deep phenotyping. Continuously recruiting more +participants, adding new data, and expanding the data access to +researchers throughout Denmark and abroad has the potential to further +increase the value of DD2. However, this comes with higher costs and +resources for maintaining, extending, and improving on the existing DD2 +research infrastructure.
+Building modern data infrastructures has slowly been taking greater +priority by funding and research agencies globally. For instance, the UK +Biobank (8,9) is a +large-scale biomedical database with highly detailed data on ~500,000 +participants. It is regularly expanded with additional data and is +globally accessible to approved researchers and is a role model to +building a modern research infrastructure.
+While the UK Biobank is a source of inspiration on the +state-of-the-art, the underlying infrastructure itself is not openly +accessible and reusable. The same applies to a similar Danish +initiative, the “Single path to access Danish health data” +project (10), where the +Danish government and individual regions are collaborating to map out +all Danish health data. Another state-of-the-art initiative led by the +University of Chicago, USA is Gen3 (11), which contains modular open source +services that can form the basis for a data infrastructure (12,13) and powers several research +platforms, including the National Institutes of Health (14). However, we are unaware of any +similar current national efforts that are open source, re-usable, and +suitable for the Danish and EU legal context.
+Our primary aim is to create and implement an efficient, scalable, +and open source data infrastructure framework that connects +data collectors, researchers, clinicians, and other stakeholders, with +the data, documentation, and findings within the DD2 study. This will +improve and extend the existing DD2 into an open national +state-of-the-art research infrastructure that will provide easy and +transparent access to this resource for researchers, thus enabling +excellent data science driven research. Our secondary aim is to create +this framework in such a way so that other research groups and +companies, who are unable to adequately invest in building similar +infrastructures, can relatively easily implement it and modify as needed +for their own purposes.
+Our first steps are to build a data infrastructure framework and the +second step is to implement it in DD2. The details of the framework +itself are described first and later we describe how we will apply it to +DD2.
+For this project, the data infrastructure framework is +defined as 1) a set of software programs, 2) a defined and fixed set of +conventions on the structure and format of the filesystem and URL paths, +and 3) a defined structure to the data and associated documentation, all +of which are linked together as modular components. The framework will +serve as an open source starting template for setting up data +infrastructures that make use of modern tools and processes.
+This framework encompasses four target users and three layers, with a +complete schematic shown in Figure 2. The three layers are the web +portal frontend, the database and documentation backend, and the API +(Application Programming Interface) that interacts with both. The four +users and their associated use cases are:
+Throughout this application, we’ll refer to these four users and +three layers as we expand on and describe the framework.
+To ensure the development of this framework is efficient and focused, +it will adhere to key principles that are supported by strong +philosophical and scientific rationale:
+In order to maximise the potential for re-use and to minimise the +technical debt and expertise needed to use, maintain, and modify the +framework, we will use software and tools underlying the framework that +fit these principles:
+Based on the above principles, we have chosen the following software +and conventions to form the framework’s foundation:
+This interface is what all users interact with and use, with +essentially three “permission” levels available:
+All content would be rendered directly as plain HTML text to ease use +of existing webpage translation services (e.g. Google Translate), so +that content written in another language, i.e., Danish, would still be +readable to non-native speakers. This would also lower the amount of +maintenance necessary for documentation.
+Modern web and computational infrastructures are built on web APIs. +Any modern online resource or interface makes use of an API, such as +from Google, Gen3, or the UK Biobank. An API is a mechanism by which +different programs can communicate with one another. They form a set of +instructions or conventions that allow easy communication between a user +and the computer. APIs by their nature are transparent and if +well-documented would ensure the linked data would be FAIR, safely and +securely.
+In this case, the API would be between the user and the web server +that stores the underlying database and documentation. The API would be +a combination of a predefined set of instructions that are sent to the +web server to run certain commands as well as a set of explicit +conventions and rules on how files and folders are structured and named. +Taken together, this API would allow other software like R packages to +be built to interact with the backend to automate tasks done by the +users.
+Given the heterogeneity in the sources of data input, the backend +will need to be composed of multiple components: raw data files as plain +text, cleaning and processing programming scripts, a formal database +structure (e.g. SQL), a VC system to track changes to the raw data and +processing scripts, a data version numbering system, a changelog +describing the changes, and a data dictionary linked to the variables +contained in the database. Versioning of the raw data and scripts is +done for recordkeeping, auditing, and transparency, in addition to +allowing comparison of data used between past and current projects that +use the data.
+A major challenge to building the backend is in the heterogeneity of +the data input. The key is to establish and enforce a standardized +Common Data Model (CDM) for all incoming data at the point of entry. For +the framework, the exact contents of the database aren’t important, +since as long as the contents follow the CDM it will be programmatically +merged into the final formal database. This is necessary as the database +contents depend heavily on the research topic and aims of the study that +will use this framework.
+The backend documentation is either largely generated automatically +or manually written. For instance, the list of projects and findings +would be generated by the submitted projects and input from User 2 +(researchers) while the changelog would be updated either by automated +additions or, optionally, manually from User 4. The data dictionary +would be stored as a JSON file with the documentation text itself as +Markdown text. This data dictionary would be publicly accessible and +could be updated by anyone (with approval from User 4), potentially +through “Merge Request” mechanisms. This mechanism involves +automatically linking any addition or correction back to the main +documentation and requesting it be merged into it.
+Depending on the source of data, there may already be established +data input processes. Substantial amounts of biomedical data, especially +in Denmark, come from already established, routinely collected clinical +data such as from outpatient clinics. For these sources of data, the +data input pipeline would involve redirecting these data sources through +the API and storage format so the data continues on to the backend.
+Sources of data that don’t have well-established data input +processes, such as from hospitals, medical laboratories, and so on, +would use the data input portal. This portal would only accept data that +is in a pre-defined format and would include documentation, and +potentially automation scripts, on how to pre-process the data prior to +uploading it.
+Once the data is submitted through the portal, it would get sent in +an encrypted, legally-compliant format to the server and stored in the +way defined by the API and CDM. Any new or updated data that is uploaded +would trigger generic automated data cleaning, processing, quality +control checks of this new data. Any automated processing that is +developed specific to a project would need to adhere to the API’s +conventions. If any issues are found or if the data is entirely new to +the database, they get sent to a log and User 4 would receive a +notification to deal with the issue. If there are no issues or the +issues have been dealt with, an automated script would take a snapshot +of the data with the VC system, the version number (based on Semantic Versioning) of the data would be +updated, an entry would get added to the changelog, and the formal +database would get updated.
+Researchers and other users who want to request access to the data +would first need their identity verified and then be approved for +authorized access. After approval, they would interact with the frontend +by two routes:
+When User 4 approves a data request project, it will trigger an API +request that would automatically extract the requested subset of data, +bundle and encrypt it, and send it to the researcher’s secure server. +The framework will contain sufficiently generic methods for automating +the data transfer process.
+The framework assumes that this user would interact with the portal +through at least three routes:
+These users would largely interact with the web portal for managing +and overseeing ongoing projects, approve access for new researchers, and +approve projects requesting access to data. Approving new researchers +would grant the researchers access permissions to enter the User 2 +portal.
+The framework itself does not contain any personal data. When the +framework is deployed as an infrastructure for a database, only +aggregate statistics, and not individual-level personal data, would be +publicly accessible. Any personal data would be stored on a secure +server that is decided and controlled by User 4, who would be +responsible for complying with relevant legal requirements.
+For data transfers of personal data, either from data collection +centers, data generated from researchers, or when transferring data for +approved projects, we would use well-established and compliant encrypted +data transfer processes. Key authentication principles such as +two-factor authentication and OAuth (open standard for access +authentication) will be central to the framework to control who can +update or transfer the data. The endpoint of the data transfer is dealt +with by the legal teams of the relevant institutions.
+To be aligned to the goals of openness, transparency, and FAIR +principles, the complete development of the framework will take place +openly on GitHub. From there we will link to and promote it through +various outlets, including publications, conferences, and social media. +The framework and all its components will be licensed under permissive +copyright licenses like the MIT License for the software and the +Creative Commons Attribution License for non-software content.
+Integral to this framework is ensuring it is sustainable over the +long-term by:
+Usage of this framework depends on the quality of its documentation +and training material. A key concept we will use heavily is +Documentation Driven Development, where the framework’s development is +guided and informed from the development of the documentation, which +places documentation as a high priority. We will also be creating and +running short workshops and tutorials that teach researchers how they +can use this framework.
+While the proposed framework is software-based, storing and deploying +the infrastructure requires server space and IT support. The framework +itself takes up little space and can optimise computational resources, +but the underlying DD2 data requires considerable server space.
+To have a meaningful impact on improving research infrastructure, the +minimum skills and knowledge necessary are:
+The biggest potential challenge to applying the framework to DD2 is +getting the database backend into the appropriate structure to fit +within the framework. With the current state of the DD2 data, +considerable time and effort is needed to organize it. Our initial steps +will be to:
+Currently, User 3 can request data by filling out a Word application +form and emailing it to the chair of the advisory board Kurt Højlund +and programme leader Jens Steen Nielsen. Applications are reviewed by +the steering committee and, once approved, the data manager at the +Department of Clinical Epidemiology (KEA) in Aarhus University Hospital +then manually extracts the requested data and transfers the data subset +to the applicant’s secure server and does this for each individual +research project. If requested, KEA may also perform analyses on the +data. Researchers must already have valid authorized access to the +secure servers on an existing “forskermaskine” or an HPC +facility for the large-scale data, such as Computerome 2 and GenomeDK.
+The costs of storing the original data are covered by DD2, while +applicants cover the costs related to storing the transferred data. We +will not charge for data access. As per legal requirements, researchers +can only use the data for the intended purposes listed in the +application. After project completion, the researchers must delete or +close access to the data and inform DD2 as legally required. Any newly +generated data must be returned to DD2 by uploading via the User 1 +portal.
+Because the framework will be built with modularity in mind, where +each component could be used alone or together, nearly all the +components could be deliverables (each User by each layer). Each +deliverable would be to prototype a MVP to begin testing it, identifying +bugs, getting feedback, and establishing maintenance procedures. See +Figure 3 for the Gantt chart.
+The framework will be developed at SDCA with Professor Annelli +Sandbæk (applicant) as the lead PI responsible for reaching the overall +goals of the project and defined milestones, as well as two postdocs, +Luke Johnston, MSc, PhD and Alisa Kjærgaard, MD, PhD. A project group +headed by the PI will be established that includes central persons from +SDCA and DD2. In close collaboration with the project manager of DD2 and +the current data manager, the deliverables will be planned and carried +through. Completing the proposed project requires hiring data and +research software engineer personnel into the project, which will be the +first step in the project process. The DD2 advisory group will also act +as the advisory group for this project. This group is chaired by Kurt +Højlund, MD, PhD, head of research at Steno Diabetes Center Odense and +contains representatives from affiliated research projects and other DD2 +stakeholders.
+We are at a key point in time within clinical research where it is +increasingly being recognized that open source software and +computational infrastructure are critical and necessary components to +ensuring science is high-quality, reproducible, rigorous, and +transparent. Funding agencies and research institutions globally are +putting greater efforts into modernizing many of their infrastructures +using the many software technologies that have risen in the last decade. +By building this framework, we have the potential to help propel +research groups and companies across Denmark (and globally) to quickly +getting updated on modern, scalable, and efficient approaches to working +with data.
+Within the DD2 setting, an open, transparent, and easy access to this +constantly growing resource has the potential of greatly improving the +interest in, use of, and scientific impact of this resource. +Incorporating new data generated from the DD2 resource back into DD2 +will enable other researchers to test or use this data to advance their +own work. This would lead to substantial scientific and medical +advancements, which will ultimately lead to individualised treatment and +improved human health in individuals with type 2 diabetes, and very +likely the population overall.
+
If you see mistakes or want to suggest changes, please create an issue on the source repository.
+Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/steno-aarhus/dif-project, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
+