Skip to content
/ WAPIAW3 Public

Public repository for the Personomics Lab's third annual "write a paper in a week" festival!

License

Notifications You must be signed in to change notification settings

tyo8/WAPIAW3

Repository files navigation

Opaque Ontology: Neuroimaging Classification of ICD-10 Diagnostic Groups in the UK Biobank

Source and cluster submission code for transdiagnostic patient-control classification in 17 ICD-10 diagnostic groups in the UK BioBank dataset.

Rough Overview

This repository contains code for and records the directory tree structure of the project published (in GigaScience; bioarXiv preprint) as

  • T. Easley, X. Luo, K. Hannon, P. Lenzini, and J. Bijsterbosch, “Opaque Ontology: Neuroimaging Classification of ICD-10 Diagnostic Groups in the UK Biobank,” bioRxiv, p. 2024.04.15.589555, Apr. 2024, doi: 10.1101/2024.04.15.589555.

The diretories of this repository are roughly subdivided into three categories of function:

  1. data selection and pre-processing
  2. specification and deployment of classification models
  3. visualization and post-hoc statistical analysis of classification results

Structure of the repository

The directory-wise contents of each function category are a given detailed overview below.

Data Selection and Pre-processing Pipeline

The data generation pipeline splits into two main components: developing patient/control lists based on inclusion/matching criteria and generating neuroimaging features.

brainrep_data

Note that no subject data of any kind is included in this public repository! Instead, the following directories contain the extraction/computation/processing code used to create differentiate types of feature sets within each ICD-10 diagnostic group. Wherever relevant, group-level analyses were computed anew for each group.

  • gradient_data: code to compute diffusion-network based gradient representations from connectivity data at both the subject and group level
  • ICA_data: sub-repository to execute ICA dual-regression (ICA-DR) procedure within each diagnostic group; melodic group regression and ICA dual regression, both implemented in FSL, and addtional code for extracting secondary features (FC network matrices, partial FC matrices, and amplitudes)
  • PROFUMO_data: sub-repository for the computation of the PROFUMO parcellation for each diagnostic group, and additional code for extracting secondary features (FC network matrices, spatial correlation matrices, and amplitudes)
  • Schaefer_data: code for extracting parcellation-level timeseries from data and computing FC network matrices, partial FC matrices, and amplitude features from Schaefer-parcellated data
  • T1_data: storage location for Freesurfer-extracted structural volume and cortical surface features from T1-weighted structural MRI scans
  • sociodemographic: contains data cleaning and feature extraction code to pull sociodemographic features from the UKB biobank for all subjects.

subject_lists

Repository containing lists of electronic ID numbers (eIDs) of patients in each diagnostic group, code for selecting corresponding matched healthy controls, and matched patient/control lists. Also contains some code and eID lists used to troubleshoot problems encountered in the UKB with incomplete or corrupted imaging, diagnostic, or sociodemographic data.

Subject Classification Pipeline

Subject classification code specifies, parameterizes, and propogates the classification model. This pipeline assumes that classification is deployed massively in parallel on a distributed system (e.g., high-performance computing cluster) operating under a SLURM queue manager.

classification_model

This directory contains the two most important pieces of code in the repository: classify_patients.py and model_specification.py. These python mini-modules are the central workhorse of the classification project.

multiclass

Adapts the binary prediction engine to a multiclass setting in multiclass.py.

job_submission_portal

Central hub for infrastructural bash scripts to assign, distribute, and organize the submission of compute jobs to the job manager. Split into three classes of jobs:

  • cross-prediction: multiclassification jobs
  • extraction: jobs extracting neuorimaging predictive features from raw scan data
  • prediction: jobs classifying patients vs. controls within diagnostic groups

utils

General-purpose bash code to perform basic bookkeeping functions while navigating large collections of UK BioBank data on the compute cluster.

Statistical Testing and Visualization Pipeline

Both the figure (results visualization) and stat_testing (post-hoc statistical analysis of results) directories contain code referencing directories not present in this repository (i.e., prediction_outputs and cross-prediction_outputs), whose outputs would need to be created to test the replicability of our findings.

figures

Visualization code producing summary swarm plots of prediction outputs under varying experimental conditions.

stat_testing

Code to compute statsitcal signficance testing with family-wise error correction and statistical summarization of the multiclass prediction's confusion matrix.

Preparing the Computing Environment

As much as possible, we confined our architecture to commonly used and publicly available code and packages. However, because of the large-scale nature of the problem at hand, our code reflects our use of a high-performance computing cluster (managed, in our case, with SLURM).

Software Dependecies

The analyses in this have several dependencies; they are listed below according to their functional role.

Neuroimaging Data Pre-processing:

Classification and Processing:

Figures:

Queue Management and Cluster Computing

As stated elsewhere above, the pipelines in this repository were designed for use on a high-performance cluster with SLURM job management.

Academic use

This code is available and is fully adaptable for individual user customization. If you use the our methods, please cite as the following:

T. Easley, X. Luo, K. Hannon, P. Lenzini, and J. Bijsterbosch, “Opaque Ontology: Neuroimaging Classification of ICD-10 Diagnostic Groups in the UK Biobank,” bioRxiv, p. 2024.04.15.589555, Apr. 2024, doi: 10.1101/2024.04.15.589555.

About

Public repository for the Personomics Lab's third annual "write a paper in a week" festival!

Resources

License

Stars

Watchers

Forks

Packages

No packages published