Pseudocode and Explanation for CancerMine

This file gives an overview for the following core files. Most of the functionality is managed by the Kindred Python package which is described in the documentation and associated paper.

buildModels.py : Use training data to build Kindred relation classifier models
wordlistLoader.py : Preprepare parsed wordlists using gene names, cancer types and more
parseAndFindEntities.py : Parse documents and find sentences that mention cancer types and gene names with additional filtering
applyModelsToSentences.py : Apply Kindred relation classifiers to find mentions of drivers, oncogenes and tumor suppressors
filterAndCollate.py : Filter for mentions with higher certainty and collate them for counts

buildModels.py

Take in wordlists for genes, cancers, drugs (to identify ambigiuity) and conflicting terms
Get Kindred to parse them and prepare a data structure ready for matching
Save it to a file as a Python pickle

Create a Kindred parser and EntityRecognizer with the terms prepared by wordlistLoader.py
Read in a BioC corpus file (of abstracts or articles) in chunks:
- Filter the corpus by removing documents that don't contain keywords (in filterTerms.txt)
- Parse the documents
- Annotate them with the EntityRecognizer (for cancer types, genes, etc)
- For each sentence in each document
  - Ignore if it doesn't contain any of the keywords (in filterTerms.txt)
  - Check if a gene and cancer are mentioned in the sentence and add to output with metadata if so
Dump all matching sentences to output JSON with metadata of the source of the sentence

Load all the models created by buildModels.py
Create a Kindred parser and EntityRecognizer with the terms prepared by wordlistLoader.py
Open the JSON file with sentences
Parse them and annotated with EntityRecognizer (for cancer types, genes, etc)
Apply the Kindred relation classifier models to this corpus
Iterate over every relation extracted
- Normalize gene names and cancer names where possible
- Output the relation with all metadata and normalized terms

Define thresholds for relations:
- Driver = 0.80
- Oncogene = 0.76
- Tumor_Suppressor = 0.92
Iterate over the combined outputs of all runs of applyModelsToSentences.py
- Check the probability of the relation (as the output for the model) and see if it is above the required thresholds
- Get the core relation info of relation type, cancer type and gene name
- Create a matching ID key that can link back this core relation info
- Add the PubMed ID to the number of citations for this core relation info
Output all the core relations with the number of citations
Output all the sentences with an additional field of the sentence with simple HTML formatting for the location of entities