#FigTag : Find the Papers with the Data you want!

Motivation

Often when searching for papers in pubmed, we are specifically looking for papers with data figures which contain data related to our query. Open-i is awesome, but 1) integration with pubmed would be ideal, as we are looking for papers, and that's where we normally do those searches. 2) Tagging figure with keywords, similar to MeSH for whole papers would better support finding papers with figures related to our keywords. Finally, 3) splitting up multipanel figures would likely support more fine-grained results.

1 ) is beyond the scope of the codeathon, but would cruically depend on 2) - which can leverage work already done for Open-i and PMC to pull out figures and their legends. 3) May be able to be pursued in parallel to 2) depending on team-members' interest and skill-sets.

Background

Towards an Ontology of Data Figures

What high level categories of metadata might we tag figures with?

Source Publication
Figure number what, of what total number of figures
Subfigure identifier, if applicable
Chart/Graph/Figure Type
Grey-scale or color
Experiment Type from which the data was derived (the methods employed)
Number of replicates represented
The categories in which and/or axes on which data is presented
The materials used to perform the experiment
The statistical test(s) used to indicate significance, if applicable

Given a figure, the uid for its source manuscript, and the associated figure legend, we may be able to find all of these. However, it seems unlikely that we will find all of the in any one manuscript. Depending on the teams skill-set it may be more fruitful to pursue some subset of these.

Test Data

PSGL-1 is pretty cool, and there is not TOO much data from PMC indexed in Open-i for that keyword, so this will serve as our test data set.

Information about the Open-i API can be found here.

Requirements

please see the requirement file here

Projects

Alex Kotliarov - Variational autoencoder based clustering of images
David Shao - Multipanel Figure Splitting
Ricardo V. - setting up pipeline
Ryan Connor, Meng Cheng, & Marie Gallagher - MeSh Indexing of Figure Legends

Notes on MeSH Indexing of Figure Legends

The mesh indexing script may not work outside of the NLM network at the moment.

Using Variation Autoencoder (VAE) to cluster images.

Objective: Learn categories of images present in publications accessible via OpenI service.

Approach: Given a collection of images, try to come up with a set of clusters such that each cluster represents an image category. Therefore we are facing unsupervised learning task, where we need to perform following:

extract relevant features that represent our samples - images
use these features to compute similarity between samples
cluster samples based on similarity metric.

Training a model to extract image's features.

How do we extract features from the samples - images - and what are the features? Let's have a neural network to find these features for us. We will train a Variational Autoencoder model on collection of images to learn a latent Gaussian model that represents the collection of images of a training set.

Variation Autoencoder model consists of encoder, decoder and a loss function.

Encoder is a neural network that outputs a latent representation of an image - features of an image that represent a point in the D-dimentional feature space; The encoder serves as inference model.
Decoder is a neural network that learns to reconstruct the data - input image - given its representation (latent variables).

To train a model we

make a decision about dimension of a feature space.
fit model to input images to learn Gaussian distribution parameters - mu and sigma - for each feature, given the data.

Upon completion of the training process we will persist learned model as file for later use.

Clustering model

After training a VAE model, we would use VAE model's encoder to map an input image to its latent representation - features.

We will produce features vector for each image of a training set and will use these features to fit a KMeans clustering model and decide on number of clusters using "elbow" heuristic.

We will persist learned KMeans clustering model as a file for later use.

Assigning an image to a cluster

Given an image, we will

Encode the image to its features vector, using pre-trained VAE encoder;
Use ipre-trained KMeans model to predict cluster id.
Output cluster id.

Data sets

Data set	Size
Training	~62,000 images
Validation	~4,500 images
Test	~5,000 images

Information


Image size	100 x 100 px
Dimension of feature space	256
Number of clusters	8

Indexing images

Setup

After checking out a local copy of the repository, please run this command:

source dev_env.sh

It will create a virtual environment and install all the packages that our applications need.

Building an index

The following command will generate a SQLite DB with an index of images, the clusters they belong to per our models, a list of unique identifiers (uids) of papers where the images appeared, and MeSH terms associated with them. It uses models (Variation Autoencoder and K-Means) we generated (which are stored under the models directory in this repository) and sample parsed MeSH terms (stored in the file mesh_out.txt at the root of the repository) we produced.

# Remove `-file-limit 5` to use all the files
# Remove `FIGTAG_LOGLEVEL=INFO` to not see log messages
FIGTAG_LOGLEVEL=INFO bin/figtag run \
   -query 'https://openi.nlm.nih.gov/api/search?coll=pmc&it=x%2Cu%2Cph%2Cp%2Cmc%2Cm%2Cg%2Cc&m=1&n=100&query=psgl-1%20OR%20sleplg' \
   -vae-model-file models/vae-model.256d.pt -kmeans-model-file models/kmeans_model.256d.8.pt \
   -o /tmp/`whoami`/test -file-limit 5 -mesh-terms-file mesh_out.txt

Searching for figures

We provide a command line utility that allows you to search for MeSH terms in an index we can generate with the utility mentioned in the previous section. In the example below, we search for the term Cercopithecus using the sample index we generated before (file samples/ImageIndex.sqlite in this repository):

bin/figtag figure-search -query "Cercopithecus" -index samples/ImageIndex.sqlite

Limitations or further improvement

Due to the time limitation, the integration of image processing and text mining is not completed
The Medical Text Indexer (MTI) tool supposed to be open to public with out requesting for credentials. However, from our test, it is available to request sent from NIH network. There may be a limited access for outside NIH users.
More test cases are necessary and further evaluation on the MeSH terms from FigTag project's txt mining compared with MeSH indexing could be valuable.
A web-based MTI API would greatly improve the efficiency of FigTag pipeline.
The Imagine splitter could beneit from more robust testing and tuning
As could the image classifier

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
bin		bin
clusters		clusters
models		models
notebooks		notebooks
requirements		requirements
samples		samples
src/figtag		src/figtag
tests/unit_test		tests/unit_test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build_package.sh		build_package.sh
build_pex.sh		build_pex.sh
data4.txt		data4.txt
dev_env.sh		dev_env.sh
mesh_hist.png		mesh_hist.png
mesh_out.txt		mesh_out.txt
nlm_code.pdf		nlm_code.pdf
nlm_code.pptx		nlm_code.pptx
nlm_code_intro.pdf		nlm_code_intro.pdf
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

#FigTag : Find the Papers with the Data you want!

Motivation

Background

Towards an Ontology of Data Figures

Test Data

Requirements

Projects

Notes on MeSH Indexing of Figure Legends

Using Variation Autoencoder (VAE) to cluster images.

Training a model to extract image's features.

Clustering model

Assigning an image to a cluster

Data sets

Indexing images

Setup

Building an index

Searching for figures

Limitations or further improvement

About

Releases

Packages

Contributors 5

Languages

License

NCBI-Codeathons/Automated-Figure-extraction-and-Tagging

Folders and files

Latest commit

History

Repository files navigation

#FigTag : Find the Papers with the Data you want!

Motivation

Background

Towards an Ontology of Data Figures

Test Data

Requirements

Projects

Notes on MeSH Indexing of Figure Legends

Using Variation Autoencoder (VAE) to cluster images.

Training a model to extract image's features.

Clustering model

Assigning an image to a cluster

Data sets

Indexing images

Setup

Building an index

Searching for figures

Limitations or further improvement

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages