Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
.idea/
env/
condaenv/
condaenv/
.DS_Store

chapter_3/MNIST_DATA/
chapter_3/mnist/
146 changes: 144 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,148 @@
# deep-learning-life-sciences-code
Code Samples for the book: Deep Learning for the Life Sciences
Code Samples for the book: [Deep Learning for the Life Sciences](https://amzn.to/3audBIt)

## Installation

`conda activate condaenv`

If using linux: `conda create --name condaenv --file requirements.conda.linux64.txt`
If using Mac OSX: `conda create --name condaenv --file requirements.conda.osx64.txt`
If using Windows: 🤷🏿‍♂️


`conda activate ./condaenv` or `source start.sh`

To run interpreter from conda environment:
`condaenv/bin/python3.7`

# Book Notes

# Deep Learning for the Life Sciences

I think that in the near future the biggest techonological improvements in society will come from advances in artificial intelligence and biology. So I've been trying to learn more about the intersection of artificial intelligence and biology. I decided to read the book, Deep Learning for the Life Sciences because it covers both topics.

If you want to become very rich or help a lot of people in the future, I recommend learning data science and biology. More specifically deep learning and genetics. I recently bought a book about deep learning and genetics.


## Introduction Life Science is Data Science
- The book begins by talking about how modern life sciences is increasingly driven by data science and algorithms
- As I have mentioned earlier, it's not so much that the algorithms we have are more sophisticated, it's that we now have access to smarter computers
- Put mroe bluntly, the computers have gotten way smarter, we on the other hadm have gotten marginally smarter at best.

## Introduction to Deep Learning
- deep learning at it's most basic level is simply a function, f() that transforms an input x, into an output, y: y = f(x) [7]
- The simplest models are linear models of the form y = Mx + b; which is essentially just a straight line [8]
- These are very limited by the fact that they can't fit most datasets. For example, height distribution of the average person would likely not fit a linear model

- A solution to this is basically a multilayer perceptron, which is essentially just putting one linear function inside another linear function
y = M_2(B(M_1x + b_1)) + b_2
- B is called an activation function and transforms the linear function into a non-linear function

- As you put one of these functions inside another one you create what is called a multilayer perceptron
- An interesting blog post which explains the first Perceptron/Neural net, Rosenblatt's Perceptron ([blog post](https://towardsdatascience.com/rosenblatts-perceptron-the-very-first-neural-network-37a3ec09038a) (tk find paper not behind Medium paywall), [paper](tk add paper link))

### Training Models (13)
- To train the algorithm we need a loss function, L(y,y') where Y is the actual output and y' is the target value that we expected to get
- The loss function basically takes the two values to give us a value for how wrong we are
- Usually we use the Euclidean distance
- Also does anyone know a good way of writing mathematical notation on Github? Maybe I should use latex for this review?
- For probability distribution you should use cross entropy (really didn't understand this part)


## Chapter 4: Machine Learning For Molecules

- random search is often used for designing interesting molecules
- How can we find more efficient ways of desigining molecules?
- The first step is to transform molecules into vectors of numbers, called moelcular featurization
- This includes things like chemical descriptor vectors, 2D graph representations,
3D electrostatic grid representations, orbital basis function representations and more


### What is a molecule?
- How do you know which molecules are present in a given sample?
- use a mass spectromoter to fire a bunch of electrons at the sample
- the sample becomes ionized or charged and get propelled by an electric field
- the different fragments go into different buckets based on their mass-to-charge ratio (m/q ion mass/ion charge)
- the spread of the different charged fragments is called the spectrum
- you can then use the ratio of the different fragments in each bucket to figure out which molecule you have

- molecules are dyamic, quantum entities: the atoms in a molecule are always moving (dynamic)
- a given molecule can be described in multiple ways (quantum)

### What are Molecular bonds?

- Covalent bonds are strong bonds formed when two atoms share electrons
- it takes a lot of energy to break them
- this is what actually defines a molecule, a group of atoms joined by covalent bonds

- non-covalent bonds are not as strong as covalent bonds,
- constantly breaking and reforming, they have a huge effect on determining shape and interaction of molecules
- some examples include hydrogen bonds, pi-stacking, salt bridges etc.
- but non-covalent bonds are important because most drugs interact with
biological molecules in human body through non-covalent interactions

- For example water, H20, two hydrogen atoms are strongly attached to an oxygen atom using a covalent bond and
that is what forms a waer molecule

- then different water molecules are attached to other water molecules using a hydrogen bond
- this is what makes water the universal solvent

### Chirality of Molecules
- some molecules come in two forms that are mirror images of each other
- a right form "R" form and the left-form "S" form
- Many physcial properties are identical for both and have identical molecular graphs
- Important because it is possible for both forms to bind to different proteins in body and prodce various effects
- For example, in 1950s thalidomide was prescribed as a sedative for nauseau and morning sickness for pregnant women
but only the R form is the sedative, but the S form was a teratogenic that has been shown to cause severe defects


### Featurize Molecules
- SMILES strings are a way of describing molecules using text strings
- Extended-Connectivity Fingerprints are a way of converting arbitrary length strings into a fixed-length vector
```python
import deepchem as dc
from rdkit import Chem
smiles = ['C1CCCCC`', '01cc0cc1'] # cyclohexane and dioxane
mols = [Chem.MolFromSmiles(smile) for smile in smiles]
feat = dc.feat.CircularFingerprint(size=1024)
arr = feat.featurize(mols)
# arr is a 2-by-1024 array containing the fingerprints of the two molecules
```
- Chemical fingerprints are vectors of 1s and 0s, indicating the presence or absence of a molecular feature
- algorithm works by starting to look at each atom individually, then works outwards

- another line of thinking says that use physics of the molecule's structure to describe the molecules
- this will trypicall work best for problemsthat rely on generic propertis of the molecules
and not detailed arrangement of atoms
```python
import deepchem as dc
from rdkit import Chem
smiles = ['C1CCCCC`', '01cc0cc1'] # cyclohexane and dioxane
mols = [Chem.MolFromSmiles(smile) for smile in smiles]
feat = dc.fet.RDKitDescriptors;
arr = feat.featurize(mols)
# arr is a 2-by-111 array containing the properties of the two molecules
```

### Graph Convolutions

- those previous examples, all required a human that thought of an algorithm that could represent the molecules
in a way that a computer could understand
- what if there was a way to feed the graph representation of a molecule into a deep learning architecture and have the
model figure out the features of the molecule
- similiar to how a deep learning model can learn about properties of an image without being supervised
- the limitation is that the calculation is based on the molecular graph, so it doesn't know anything about
the molecule's conformation
- so it works best for small, rigid molecules, Chapter 5 looks at methods for large, flexible molecules

### SMARTS Strings
- [MoleculeNet](http://moleculenet.ai) is a dataset for molecular machine learing
- Ranching from low level quantum mechanics interactions between atoms
- to highlevel interactions in human body like toxicity and side effects
- SMARTS string are useful if you want to see if atoms in a molecule match a pattern:
- Searching a molecular databse to see if a particular substructure exists
- aligning a set of molecules on common substructure to improve visualization
- SMARTS string is like regular expression for regular languages
- So "foo*.bar" will match "foo.bar" and "foo3.bar"
- Similarly "CCC" will match sequences of three adjacent aliphatic carbon atoms
(aliphatic means containing carbon and hydrogen joined together in a straight line)
-
16 changes: 0 additions & 16 deletions chapter3.py

This file was deleted.

10 changes: 10 additions & 0 deletions chapter_3/get_mnist_data.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/usr/bin/env bash

mkdir MNIST_DATA

cd MNIST_DATA

wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
97 changes: 97 additions & 0 deletions chapter_3/mnist_digit_recognition.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
"""
In molecule_toxicity.py we used a premade model class, now we will be creating an architecture from scratch
The reason we might want to do this is if we are working on a dataset where no predefined architecture exists.
# todo what is the difference between an architecture and a model?

This works by creating two convolution layerr which is a small square that is a subset of the image.
Then uses two fully connected layers to predict the digit from the local features.

# todo what does that actually mean? [tk include a diagram]
"""
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

import deepchem as dc
import deepchem.models.tensorgraph.layers as layers


def create_model():
"""
Create our own MNIST model from scratch
:return:
:rtype:
"""
mnist = input_data.read_data_sets("MNIST_DATA/", one_hot=True)

# the layers from deepchem are the building blocks of what we will use to make our deep learning architecture

# now we wrap our dataset into a NumpyDataset

train_dataset = dc.data.NumpyDataset(mnist.train.images, mnist.train.labels)
test_dataset = dc.data.NumpyDataset(mnist.test.images, mnist.test.labels)

# we will create a model that will take an input, add multiple layers, where each layer takes input from the
# previous layers.

model = dc.models.TensorGraph(model_dir='mnist')

# 784 corresponds to an image of size 28 X 28
# 10 corresponds to the fact that there are 10 possible digits (0-9)
# the None indicates that we can accept any size input (e.g. an empty array or 500 items each with 784 features)
# our data is also categorical so we must one hot encode, set single array element to 1 and the rest to 0
feature = layers.Feature(shape=(None, 784))
labels = layers.Label(shape=(None, 10))

# in order to apply convolutional layers to our input, we convert flat vector of 785 to 28X28
# in_layers means it takes our feature layer as an input
make_image = layers.Reshape(shape=(None, 28, 28), in_layers=feature)

# now that we have reshaped the input, we pass to convolution layers

conv2d_1 = layers.Conv2D(num_outputs=32, activation_fn=tf.nn.relu, in_layers=make_image)

conv2d_2 = layers.Conv2D(num_outputs=64, activation_fn=tf.nn.relu, in_layers=conv2d_1)

# we want to end by applying fully connected (Dense) layers to the outputs of our convolutional layer
# but first, we must flatten the layer from a 2d matrix to a 1d vector

flatten = layers.Flatten(in_layers=conv2d_2)
dense1 = layers.Dense(out_channels=1024,activation_fn=tf.nn.relu, in_layers=flatten)

# note that this is final layer so out_channels of 10 represents the 10 outputs and no activation_fn
dense2 = layers.Dense(out_channels=10,activation_fn=None, in_layers=dense1)

# next we want to connect this output to a loss function, so we can train the output

# compute the value of loss function for every sample then average of all samples to get final loss (ReduceMean)
smce = layers.SoftMaxCrossEntropy(in_layers=[labels, dense2])
loss = layers.ReduceMean(in_layers=smce)
model.set_loss(loss)

# for MNIST we want the probability that a given sample represents one of the 10 digits
# we can achieve this using a softmax function to get the probabilities, then cross entropy to get the labels

output = layers.SoftMax(in_layers=dense2)
model.add_output(output)

# if our model takes long to train, reduce nb_epoch to 1
model.fit(train_dataset,nb_epoch=1)

# our metric is accuracy, the fraction of labels that are accurately predicted
metric = dc.metrics.Metric(dc.metrics.accuracy_score)

train_scores = model.evaluate(train_dataset, [metric])
test_scores = model.evaluate(test_dataset,[metric])

print('train_scores', train_scores)
print('test_scores', test_scores)

if __name__ == '__main__':
create_model()







85 changes: 85 additions & 0 deletions chapter_3/molecule_toxicity.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
import deepchem as dc
import numpy as np


def run():
"""
tox21_tasks is a list of chemical assays and our dataset
contains training data that will tell us whether a certain molecule binds
to one of the molecules in
:return:
:rtype:
"""
# first we must load the Toxicity 21 datasets from molnet (MoleculeNet) unto our local machine
tox21_tasks, tox21_datasets, transformers = dc.molnet.load_tox21()


# tox21_tasks represent 12 assays or bilogicial targets taht we want to see if our molecule binds to
print(tox21_tasks)


# train_dataset is 6264 molecules with a feature vector of length 1024


# it has a feature vector Y, for each of the 12 assays
train_dataset, valid_dataset, test_dataset = tox21_datasets

# the w represents the weights and a weight of zero means that no experiment was run
# to see if the molecule binds to that assay
np.count_nonzero(train_dataset.w == 0)

# this is a BalancingTransformer because most of the molecules do not bind to most targets
# so most of the labels are zero and a model always predicting zero could actually work (but it would be useless!)
# BalancingTransformer adjusts dataset's wieghts of individual points so all classes have same total weight
# Loss function won't have systematic preference for one class
print(transformers)

train_model(train_dataset, test_dataset, transformers)

def train_model(train_dataset, test_dataset, transformers):
"""
Train the model using a multitask classifier because there are multiple outputs for each sample
and evaluate model using the mean ROC AUC.
:param train_dataset:
:type train_dataset:
:param transformers:
:type transformers:
:return:
:rtype:
"""

# this model builds a fully connected network (an MLP)
# since we have 12 assays we're testing for, being able to map to multiple outputs is ideal
# layer_sizes means that we have one hidden layer which has a width of 1,000
model = dc.models.MultitaskClassifier(n_tasks=12, n_features=1024, layer_sizes=[1000])

# nb_epoch means that we will divide the data into batches, and do one step of gradient descent for each batch
model.fit(train_dataset, nb_epoch=10)

# how do we know how accurate our model is? we will find the mean ROC AUC score across all tasks

# What is an ROC AUC score? We are trying to predict the toxicity of the molecules,
# Receiver Operating Characteristic, Area Under Curve
# If there exists any threshold value where, the true positive rate is 1 and false positive is 0 then score is 1
# so we pick a threshold of what is considered a toxic molecule
# if we pick a threshold value that's too low, we will say too many safe molecules are toxic (high false positive)
# alternatively, if we pick one too igh, we will say that toxic molecules are safe (high false negative)
# note on understanding false positive terminology.\:
# Imagine a molecule that is actually toxic. "Is this molecule toxic?" "No." We gave a negative response
# the answer is relative to what we are testing for, in this case, we are testing if a molecule is toxic
# so we are making a tradeoff between high false positive vs high false negative so we use something called
# an ROC AUC curve, which graphs the tradeofff between the false positive rate and the true positive rate

metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean)

# evaluate the performance of this model on the train_dataset using the ROC AUC metric

train_scores = model.evaluate(train_dataset, [metric], transformers)
test_scores = model.evaluate(test_dataset, [metric], transformers)

# the train scores are higher than our test scores which shows us that our model has been overfit
print(f'train_scores: {train_scores}')
print(f'test_scores: {test_scores}')

if __name__ == '__main__':
run()
Loading