Opaque Ontology: Neuroimaging Classification of ICD-10 Diagnostic Groups in the UK Biobank

Source and cluster submission code for transdiagnostic patient-control classification in 17 ICD-10 diagnostic groups in the UK BioBank dataset.

Rough Overview

This repository contains code for and records the directory tree structure of the project published (in GigaScience; bioarXiv preprint) as

T. Easley, X. Luo, K. Hannon, P. Lenzini, and J. Bijsterbosch, “Opaque Ontology: Neuroimaging Classification of ICD-10 Diagnostic Groups in the UK Biobank,” bioRxiv, p. 2024.04.15.589555, Apr. 2024, doi: 10.1101/2024.04.15.589555.

The diretories of this repository are roughly subdivided into three categories of function:

data selection and pre-processing
specification and deployment of classification models
visualization and post-hoc statistical analysis of classification results

Structure of the repository

The directory-wise contents of each function category are a given detailed overview below.

Data Selection and Pre-processing Pipeline

The data generation pipeline splits into two main components: developing patient/control lists based on inclusion/matching criteria and generating neuroimaging features.

brainrep_data

Note that no subject data of any kind is included in this public repository! Instead, the following directories contain the extraction/computation/processing code used to create differentiate types of feature sets within each ICD-10 diagnostic group. Wherever relevant, group-level analyses were computed anew for each group.

gradient_data: code to compute diffusion-network based gradient representations from connectivity data at both the subject and group level
ICA_data: sub-repository to execute ICA dual-regression (ICA-DR) procedure within each diagnostic group; melodic group regression and ICA dual regression, both implemented in FSL, and addtional code for extracting secondary features (FC network matrices, partial FC matrices, and amplitudes)
PROFUMO_data: sub-repository for the computation of the PROFUMO parcellation for each diagnostic group, and additional code for extracting secondary features (FC network matrices, spatial correlation matrices, and amplitudes)
Schaefer_data: code for extracting parcellation-level timeseries from data and computing FC network matrices, partial FC matrices, and amplitude features from Schaefer-parcellated data
T1_data: storage location for Freesurfer-extracted structural volume and cortical surface features from T1-weighted structural MRI scans
sociodemographic: contains data cleaning and feature extraction code to pull sociodemographic features from the UKB biobank for all subjects.

subject_lists

Repository containing lists of electronic ID numbers (eIDs) of patients in each diagnostic group, code for selecting corresponding matched healthy controls, and matched patient/control lists. Also contains some code and eID lists used to troubleshoot problems encountered in the UKB with incomplete or corrupted imaging, diagnostic, or sociodemographic data.

Subject Classification Pipeline

Subject classification code specifies, parameterizes, and propogates the classification model. This pipeline assumes that classification is deployed massively in parallel on a distributed system (e.g., high-performance computing cluster) operating under a SLURM queue manager.

classification_model

This directory contains the two most important pieces of code in the repository: classify_patients.py and model_specification.py. These python mini-modules are the central workhorse of the classification project.

multiclass

Adapts the binary prediction engine to a multiclass setting in multiclass.py.

job_submission_portal

Central hub for infrastructural bash scripts to assign, distribute, and organize the submission of compute jobs to the job manager. Split into three classes of jobs:

cross-prediction: multiclassification jobs
extraction: jobs extracting neuorimaging predictive features from raw scan data
prediction: jobs classifying patients vs. controls within diagnostic groups

utils

General-purpose bash code to perform basic bookkeeping functions while navigating large collections of UK BioBank data on the compute cluster.

Statistical Testing and Visualization Pipeline

Both the figure (results visualization) and stat_testing (post-hoc statistical analysis of results) directories contain code referencing directories not present in this repository (i.e., prediction_outputs and cross-prediction_outputs), whose outputs would need to be created to test the replicability of our findings.

figures

Visualization code producing summary swarm plots of prediction outputs under varying experimental conditions.

stat_testing

Code to compute statsitcal signficance testing with family-wise error correction and statistical summarization of the multiclass prediction's confusion matrix.

Preparing the Computing Environment

As much as possible, we confined our architecture to commonly used and publicly available code and packages. However, because of the large-scale nature of the problem at hand, our code reflects our use of a high-performance computing cluster (managed, in our case, with SLURM).

Software Dependecies

The analyses in this have several dependencies; they are listed below according to their functional role.

Neuroimaging Data Pre-processing:

Classification and Processing:

Figures:

Queue Management and Cluster Computing

As stated elsewhere above, the pipelines in this repository were designed for use on a high-performance cluster with SLURM job management.

Academic use

This code is available and is fully adaptable for individual user customization. If you use the our methods, please cite as the following:

T. Easley, X. Luo, K. Hannon, P. Lenzini, and J. Bijsterbosch, “Opaque Ontology: Neuroimaging Classification of ICD-10 Diagnostic Groups in the UK Biobank,” bioRxiv, p. 2024.04.15.589555, Apr. 2024, doi: 10.1101/2024.04.15.589555.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Opaque Ontology: Neuroimaging Classification of ICD-10 Diagnostic Groups in the UK Biobank

Rough Overview

Structure of the repository

Data Selection and Pre-processing Pipeline

brainrep_data

subject_lists

Subject Classification Pipeline

classification_model

multiclass

job_submission_portal

utils

Statistical Testing and Visualization Pipeline

figures

stat_testing

Preparing the Computing Environment

Software Dependecies

Queue Management and Cluster Computing

Academic use

About

Releases 2

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
brainrep_data		brainrep_data
classification_model		classification_model
figures		figures
job_submission_portal		job_submission_portal
multiclass		multiclass
sociodemographic		sociodemographic
stat_testing		stat_testing
subject_lists		subject_lists
utils		utils
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md

License

tyo8/WAPIAW3

Folders and files

Latest commit

History

Repository files navigation

Opaque Ontology: Neuroimaging Classification of ICD-10 Diagnostic Groups in the UK Biobank

Rough Overview

Structure of the repository

Data Selection and Pre-processing Pipeline

brainrep_data

subject_lists

Subject Classification Pipeline

classification_model

multiclass

job_submission_portal

utils

Statistical Testing and Visualization Pipeline

figures

stat_testing

Preparing the Computing Environment

Software Dependecies

Queue Management and Cluster Computing

Academic use

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages