This file gives an overview for the following core files. Most of the functionality is managed by the Kindred Python package which is described in the documentation and associated paper.
- buildModels.py : Use training data to build Kindred relation classifier models
- wordlistLoader.py : Preprepare parsed wordlists using gene names, cancer types and more
- parseAndFindEntities.py : Parse documents and find sentences that mention cancer types and gene names with additional filtering
- applyModelsToSentences.py : Apply Kindred relation classifiers to find mentions of drivers, oncogenes and tumor suppressors
- filterAndCollate.py : Filter for mentions with higher certainty and collate them for counts
- For each relation type (Driver, Oncogene, Tumor_Suppressor)
- Load the Kindred corpus (1500 annotated sentences)
- Strip all relations that do not match the relation type of interest
- Create a Kindred classifier with a logistic regression model and threshold of 0.5
- Train it on the filtered corpus
- Save the classifier to a file
- Take in wordlists for genes, cancers, drugs (to identify ambigiuity) and conflicting terms
- Get Kindred to parse them and prepare a data structure ready for matching
- Save it to a file as a Python pickle
- Create a Kindred parser and EntityRecognizer with the terms prepared by wordlistLoader.py
- Read in a BioC corpus file (of abstracts or articles) in chunks:
- Filter the corpus by removing documents that don't contain keywords (in filterTerms.txt)
- Parse the documents
- Annotate them with the EntityRecognizer (for cancer types, genes, etc)
- For each sentence in each document
- Ignore if it doesn't contain any of the keywords (in filterTerms.txt)
- Check if a gene and cancer are mentioned in the sentence and add to output with metadata if so
- Dump all matching sentences to output JSON with metadata of the source of the sentence
- Load all the models created by buildModels.py
- Create a Kindred parser and EntityRecognizer with the terms prepared by wordlistLoader.py
- Open the JSON file with sentences
- Parse them and annotated with EntityRecognizer (for cancer types, genes, etc)
- Apply the Kindred relation classifier models to this corpus
- Iterate over every relation extracted
- Normalize gene names and cancer names where possible
- Output the relation with all metadata and normalized terms
- Define thresholds for relations:
- Driver = 0.80
- Oncogene = 0.76
- Tumor_Suppressor = 0.92
- Iterate over the combined outputs of all runs of applyModelsToSentences.py
- Check the probability of the relation (as the output for the model) and see if it is above the required thresholds
- Get the core relation info of relation type, cancer type and gene name
- Create a matching ID key that can link back this core relation info
- Add the PubMed ID to the number of citations for this core relation info
- Output all the core relations with the number of citations
- Output all the sentences with an additional field of the sentence with simple HTML formatting for the location of entities