Multimodal head and neck cancer dataset

This repository contains code for exploring the multimodal head and neck cancer dataset HANCOCK and for training Machine Learning models to predict outcomes and treatment choices. We also provide strategies for multimodal data handling, feature extraction, and generating train/test dataset splits.

Setup

To set up the environment, first clone this repository to your local machine and create a directory for storing the results:

git clone https://github.com/ankilab/HANCOCK_MultimodalDataset.git
cd HANCOCK_MultimodalDataset
mkdir results

Next, set up an Anaconda environment and install the required Python packages:

conda create -n hancock_multimodal python=3.12
conda activate hancock_multimodal
pip install -r requirements.txt

Our code was tested on Ubuntu-24.04 with an NVIDIA RTX 4060 besides the additional section Adjuvant treatment prediction using histology images, which was tested on Windows. For running the code described the section Adjuvant treatment prediction using histology images, TensorFlow 2.16 is used (see requirements.txt). Furthermore, QuPath needs to be installed for the analysis of histology data.

Dataset

The dataset can be explored and downloaded at our project website: https://hancock.research.fau.eu/

The dataset is structured in ZIP archives. If all archives are downloaded and unzipped, the dataset is structured as follows:

Hancock_Dataset
|
├── StructuredData
|   ├── blood_data.json
|   ├── blood_data_reference_ranges.json
|   ├── clinical_data.json
|   └── pathological_data.json
|
├── TextData
|   ├── histories
|   ├── histories_english
|   ├── icd_codes
|   ├── ops_codes
|   ├── reports
|   ├── reports_english
|   ├── surgery_descriptions
|   └── surgery_descriptions_english
|
├── DataSplits_DataDictionaries
|   ├── DataDictionary_blood.csv
|   ├── DataDictionary_clinical.csv
|   ├── DataDictionary_pathological.csv
|   ├── dataset_split_in.json
|   ├── dataset_split_out.json
|   ├── dataset_split_Oropharynx.json
|   └── dataset_split_treatment_outcome.json
|
├── TMA_CellDensityMeasurements
|   └── TMA_celldensity_measurements.csv
|
├── TMA_InvasionFront
├── TMA_TumorCenter
├── TMA_Maps
├── WSI_LymphNode
├── WSI_PrimaryTumor_Annotations
|
└── WSI_PrimaryTumor
    └── WSI_PrimaryTumor_[Site]

However, it is sufficient to download the following folders for reproducing most results from our paper: StructuredData, TextData, DataSplits_DataDictionaries, TMA_CellDensityMeasurements.

However, if one is only interested in reproducing the final predictions (outcome) and does not want to reproduce the feature extraction and data splitting, it is possible to rely solely on the features found in the features directory of this repository together with the additional to be downloaded DataSplits_DataDictionaries from the dataset.

Disclaimer

We expect the user to use the structure presented in the Dataset section, and additionally to locate the repository in the same directory as the directory that contains the `Hancock_Dataset'.

Parent_Directory
├── Hancock_Dataset
|   ├── ...
|
├── HANCOCK_MultimodalDataset
|   ├── ...

This makes it easier to run the scripts without specifying the paths to the data directory. If you do not want to follow this structure, you can either change the default paths in the file ./defaults/__init__.py or set them manually each time the scripts are called. Note that you do not have to set every argument, just the ones that differ from the recommended structure.

To check which options are available, you can run e.g.

python3 ./data_exploration/plot_available_data.py --help

The following assumes that the recommended structure is used.

However, if one is only interested in reproducing the final predictions (adjuvant therapy and outcome) and does not want to reproduce the feature extraction and data splitting, it is possible to rely solely on the features found in the features directory of this repository together with the additional to be downloaded DataSplits_DataDictionaries from the dataset.

Disclaimer

We expect the user to use the structure presented in the Dataset section, and additionally to locate the repository in the same directory as the directory that contains the `Hancock_Dataset'.

Parent_Directory
├── Hancock_Dataset
|   ├── ...
|
├── HANCOCK_MultimodalDataset
|   ├── ...

This makes it easier to run the scripts without specifying the paths to the data directory. If you do not want to follow this structure, you can either change the default paths in the file ./defaults/__init__.py or set them manually each time the scripts are called. Note that you do not have to set every argument, just the ones that differ from the recommended structure.

To check which options are available, you can run e.g.

python3 ./data_exploration/plot_available_data.py --help

The following assumes that the recommended structure is used.

Data exploration

We provide a jupyter notebook exploring_tabular_data.ipynb for visualizing the structured (clinical, pathological, and blood) data. The jupyter notebook survival_analysis.ipynb can be used to reproduce Kaplan-Meier curves. You might need to adjust the path data_dir which should point to the directory that contains the structured data (JSON files).

To visualize which modalities are available for how many out of the 763 patients, run the following script:

cd data_exploration
python3 plot_available_data.py

Multimodal feature extraction

This step is optional, as we already provide the extracted features in the `feature' directory.

To better understand the multimodal data, we extracted features from different modalities and concatenated them to vectors, termed multimodal patient vectors. These vectors were used for the following:

For visualizing the data in 2D
For generating train/test data splits
For training Machine Learning models

Features are extracted from demographical, pathological, and blood data (structured data), ICD codes (text data), and intratumoral density of CD3- and CD8-positive cells that was computed from TMAs (image data).

Run create_multimodal_patient_vectors.py to extract features and create multimodal patient vectors:

cd feature_extraction

python3 create_multimodal_patient_vectors.py

After running this script, a 2D representation of the multimodal patient vectors can be visualized using the jupyter notebook umap_visualization.ipynb in the data_exploration folder.

Generating data splits

Performing this step is optional, as we provide the data splits on the download page of the HANCOCK dataset.

We implemented a genetic algorithm to find different data splits, where the data is split into a training and a test set. You can directly use the data splits provided in our dataset, in "DataSplits_DataDictionaries".

Alternatively, if you would like to run the genetic algorithm to reproduce these splits, you can use the code in the folder data_splitting to create the dataset splits. Run the following code to create different data splits: Run genetic_algorithm to generate a split where the test dataset contains either in-distribution data or out-of-distribution data using --in or --out, respectively. A dataset split by primary tumor site can be generated using split_by_tumor_site.py. All cases with the specified site are assigned to the test dataset and the remaining cases are assigned to the training dataset. Running split_by_treatment_outcome.py assigns cases to the test dataset where no adjuvant treatment was used but an event occurred, including recurrence, metastasis, progress, or death. The remaining cases are assigned to the training dataset.

cd data_splitting

python3 genetic_algorithm.py ../features ../results in_distribution_test_dataset --in
python3 genetic_algorithm.py ../features ../results out_of_distribution_test_dataset --out
python3 split_by_tumor_site.py path/to/Hancock_Dataset/StructuredData ../results -s Oropharynx
python3 split_by_treatment_outcome.py ./../../Hancock_Dataset/StructuredData ../results

Outcome prediction

Run outcome_prediction.py to reproduce results of training a Machine Learning classifier on the multimodal patient vectors to predict recurrence and survival status. The classifier is trained five times on the different data splits. Plots of the data splits (2D representation) and of Receiver-Operating Characteristic (ROC) curves are saved to the results directory.

cd mulitmodal_machine_learning/execution
python3 outcome_prediction.py ./../../../Hancock_Dataset/DataSplits_DataDictionaries  ../../features ../../results recurrence 
python3 outcome_prediction.py ./../../../Hancock_Dataset/DataSplits_DataDictionaries ../../features ../../results survival_status

Additional Information Tissue Micro Arrays (TMAs)

Note

This section describes the optional process of using the TMA files provided in the HANCOCK data set, along with the TMA Maps also provided, to extract TMA cores and perform immune cell counting.

We used the open-source histology software QuPath for analyzing TMAs. The folder qupath_scripts contains code that can be executed in QuPath's script editor. You can run the following scripts for all TMAs at once by selecting Run>Run for project. For more information about scripting in QuPath, check the documentation.

Step 1: Creating QuPath projects

Create one empty directory for each immunohistochemistry marker, named "TMA_CD3", "TMA_CD8", "TMA_CD56", and so on:

QuPathProjectsDirectory
├── TMA_CD3
├── TMA_CD8
├── TMA_CD56
├── TMA_CD68
├── TMA_CD163
├── TMA_HE
├── TMA_MHC1
└── TMA_PDL1

Next, create a QuPath project from each of these folders and import the corresponding TMAs. For example, click Create Project, select the directory "TMA_CD3" and import all SVS files from the folder "Hancock_Dataset/TMA_TumorCenter/CD3".

Important

Set "Rotate Image" to 180 degrees for all TMAs in QuPath's import dialog.

Step 2: TMA dearraying Next, open import_tma_map.groovy in the script editor and click Run for project to import the TMA maps. When prompted, select the folder "TMA_Maps" provided in our dataset. The patient ID can then be found as "Case ID" in QuPath.

Note

Each TMA contains tissue cores of several patients. In QuPath, TMA maps must be imported to assign patient IDs to tissue cores. You can run our scripts (see qupath_scripts) which automate the required steps for dearraying and importing TMA maps.

However, if you would like to manually analyze a TMA, you can perform the following steps:

With a TMA opened in QuPath, run the TMA dearrayer (TMA>TMA dearrayer) with a core diameter of 1.9 mm, column labels 1-6 and row labels 1-12
A grid has been created and might need manual adjustments
A TMA map e.g. "TMA_Map_block1.csv" contains the core coordinates of the 12 x 6 grid and patient IDs (Case IDs)
Click File>TMA data>Import TMA map>Import data and select the TMA map
Each TMA core is now associated with the correct patient ID

For learning more about TMA dearraying, we recommend reading this guide.

Step 3: Extracting tiles

Next, run export_centertiles_from_tma_cores.groovy to extract one tile from the center of each TMA core. The images are saved as PNG files to the directory path/to/your_qupath_project/tiles. Each tile's filename is built as follows: <patient_id>_core<core_index>_tile.png

Step 4: Extracting image features

To extract features from images (TMA core tiles), run extract_tma_image_features.py. We use deeptexture for feature extraction. This package requires Python version <= 3.8.15. Therefore, we recommend to run this script in another environment.

conda create -n deeptexture_env python=3.8.15
conda activate deeptexture_env
pip install deeptexture tqdm opencv-python

cd feature_extraction
python extract_tma_image_features.py path/to/QuPathProjectsDirectory ../features

Optional: Counting immune cells

The intratumoral density of CD3- and CD8-positive cells was already computed and is provided in the dataset folder "TMA_CellDensityMeasurements". However, if you would like to reproduce these measurements, you can perform the following steps:

Copy the pixel classifier from this repository to your project:

cd path/to/your_qupath_project/classifiers
mkdir pixel_classifiers
cp qupath_scripts/tissueDetection.json pixel_classifiers

Next, run detect_tissue_in_tma_cores.groovy.

To improve the subsequent counting of positive cells, you can manually remove possible artifacts from the resulting detection objects, for example using the brush tool while holding down the ALT key. However, this is optional.

Run tma_measure_positive_cells.groovy. This script first makes sure that the grid labels and object hierarchy are correct, in case the objects were manually adjusted (e.g. for artifact removal). Next, it runs QuPath's plugin for positive cell detection and imports TMA maps to match cores to patient IDs. Finally, the cell counts and other measurements are exported as CSV files to the directory path/to/your_qupath_project/tma_measurements. Hint: The QuPath script will prompt you to select the directory containing the TMA maps. To avoid the prompt showing for every single TMA, you can set the variable tma_map_dir in the script.

Next, run summarize_tma_measurements.py to create a single file by merging all TMA measurement files from step 4.

Reference

Dörrich, Marion, et al. "A multimodal dataset for precision oncology in head and neck cancer." medRxiv (2024): 2024-05. doi: https://doi.org/10.1101/2024.05.29.24308141

@article{doerrich2024multimodal,
  title={A multimodal dataset for precision oncology in head and neck cancer},
  author={D{\"o}rrich, Marion and Balk, Matthias and Heusinger, Tatjana and Beyer, Sandra and Kanso, Hassan and Matek, Christian and Hartmann, Arndt and Iro, Heinrich and Eckstein, Markus and Gostian, Antoniu-Oreste and others},
  journal={medRxiv},
  pages={2024--05},
  year={2024},
  publisher={Cold Spring Harbor Laboratory Press}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
argument_parser		argument_parser
data_exploration		data_exploration
data_reader		data_reader
data_splitting		data_splitting
defaults		defaults
encoder		encoder
feature_extraction		feature_extraction
features		features
images		images
model_evaluation		model_evaluation
multimodal_machine_learning		multimodal_machine_learning
qupath_scripts		qupath_scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multimodal head and neck cancer dataset

Table of contents

Setup

Dataset

Disclaimer

Disclaimer

Data exploration

Multimodal feature extraction

Generating data splits

Outcome prediction

Additional Information Tissue Micro Arrays (TMAs)

Reference

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

ankilab/HANCOCK_MultimodalDataset

Folders and files

Latest commit

History

Repository files navigation

Multimodal head and neck cancer dataset

Table of contents

Setup

Dataset

Disclaimer

Disclaimer

Data exploration

Multimodal feature extraction

Generating data splits

Outcome prediction

Additional Information Tissue Micro Arrays (TMAs)

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages