Source and cluster submission code for transdiagnostic patient-control classification in 17 ICD-10 diagnostic groups in the UK BioBank dataset.
This repository contains code for and records the directory tree structure of the project published (in GigaScience; bioarXiv preprint) as
- T. Easley, X. Luo, K. Hannon, P. Lenzini, and J. Bijsterbosch, “Opaque Ontology: Neuroimaging Classification of ICD-10 Diagnostic Groups in the UK Biobank,” bioRxiv, p. 2024.04.15.589555, Apr. 2024, doi: 10.1101/2024.04.15.589555.
The diretories of this repository are roughly subdivided into three categories of function:
- data selection and pre-processing
- specification and deployment of classification models
- visualization and post-hoc statistical analysis of classification results
The directory-wise contents of each function category are a given detailed overview below.
The data generation pipeline splits into two main components: developing patient/control lists based on inclusion/matching criteria and generating neuroimaging features.
Note that no subject data of any kind is included in this public repository! Instead, the following directories contain the extraction/computation/processing code used to create differentiate types of feature sets within each ICD-10 diagnostic group. Wherever relevant, group-level analyses were computed anew for each group.
- gradient_data: code to compute diffusion-network based gradient representations from connectivity data at both the subject and group level
- ICA_data: sub-repository to execute ICA dual-regression (ICA-DR) procedure within each diagnostic group; melodic group regression and ICA dual regression, both implemented in FSL, and addtional code for extracting secondary features (FC network matrices, partial FC matrices, and amplitudes)
- PROFUMO_data: sub-repository for the computation of the PROFUMO parcellation for each diagnostic group, and additional code for extracting secondary features (FC network matrices, spatial correlation matrices, and amplitudes)
- Schaefer_data: code for extracting parcellation-level timeseries from data and computing FC network matrices, partial FC matrices, and amplitude features from Schaefer-parcellated data
- T1_data: storage location for Freesurfer-extracted structural volume and cortical surface features from T1-weighted structural MRI scans
- sociodemographic: contains data cleaning and feature extraction code to pull sociodemographic features from the UKB biobank for all subjects.
Repository containing lists of electronic ID numbers (eIDs) of patients in each diagnostic group, code for selecting corresponding matched healthy controls, and matched patient/control lists. Also contains some code and eID lists used to troubleshoot problems encountered in the UKB with incomplete or corrupted imaging, diagnostic, or sociodemographic data.
Subject classification code specifies, parameterizes, and propogates the classification model. This pipeline assumes that classification is deployed massively in parallel on a distributed system (e.g., high-performance computing cluster) operating under a SLURM queue manager.
This directory contains the two most important pieces of code in the repository: classify_patients.py
and model_specification.py
. These python mini-modules are the central workhorse of the classification project.
Adapts the binary prediction engine to a multiclass setting in multiclass.py
.
Central hub for infrastructural bash scripts to assign, distribute, and organize the submission of compute jobs to the job manager. Split into three classes of jobs:
- cross-prediction: multiclassification jobs
- extraction: jobs extracting neuorimaging predictive features from raw scan data
- prediction: jobs classifying patients vs. controls within diagnostic groups
General-purpose bash code to perform basic bookkeeping functions while navigating large collections of UK BioBank data on the compute cluster.
Both the figure
(results visualization) and stat_testing
(post-hoc statistical analysis of results) directories contain code referencing directories not present in this repository (i.e., prediction_outputs
and cross-prediction_outputs
), whose outputs would need to be created to test the replicability of our findings.
Visualization code producing summary swarm plots of prediction outputs under varying experimental conditions.
Code to compute statsitcal signficance testing with family-wise error correction and statistical summarization of the multiclass prediction's confusion matrix.
As much as possible, we confined our architecture to commonly used and publicly available code and packages. However, because of the large-scale nature of the problem at hand, our code reflects our use of a high-performance computing cluster (managed, in our case, with SLURM).
The analyses in this have several dependencies; they are listed below according to their functional role.
Neuroimaging Data Pre-processing:
Classification and Processing:
Figures:
As stated elsewhere above, the pipelines in this repository were designed for use on a high-performance cluster with SLURM job management.
This code is available and is fully adaptable for individual user customization. If you use the our methods, please cite as the following:
T. Easley, X. Luo, K. Hannon, P. Lenzini, and J. Bijsterbosch, “Opaque Ontology: Neuroimaging Classification of ICD-10 Diagnostic Groups in the UK Biobank,” bioRxiv, p. 2024.04.15.589555, Apr. 2024, doi: 10.1101/2024.04.15.589555.