Chapter 4 machine learning for molecules #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

ademidun wants to merge 10 commits into master from chapter_4_machine_learning_for_molecules

.gitignore

-Original file line number
+Diff line change
@@ -1,3 +1,7 @@
     .idea/
     env/
-    condaenv/
+    condaenv/
+    .DS_Store
+    chapter_3/MNIST_DATA/
+    chapter_3/mnist/

README.md

-Original file line number
+Diff line change
@@ -1,6 +1,148 @@
     # deep-learning-life-sciences-code
-    Code Samples for the book: Deep Learning for the Life Sciences
+    Code Samples for the book: [Deep Learning for the Life Sciences](https://amzn.to/3audBIt)
     ## Installation
-    `conda activate condaenv`
+    If using linux: `conda create --name condaenv --file requirements.conda.linux64.txt`
+    If using Mac OSX: `conda create --name condaenv --file requirements.conda.osx64.txt`
+    If using Windows: 🤷🏿‍♂️
+    `conda activate ./condaenv` or `source start.sh`
+    To run interpreter from conda environment:
+    `condaenv/bin/python3.7`
+    # Book Notes
+    # Deep Learning for the Life Sciences
+    I think that in the near future the biggest techonological improvements in society will come from advances in artificial intelligence and biology. So I've been trying to learn more about the intersection of artificial intelligence and biology. I decided to read the book, Deep Learning for the Life Sciences because it covers both topics.
+    If you want to become very rich or help a lot of people in the future, I recommend learning data science and biology. More specifically deep learning and genetics. I recently bought a book about deep learning and genetics.
+    ## Introduction Life Science is Data Science
+    - The book begins by talking about how modern life sciences is increasingly driven by data science and algorithms
+    - As I have mentioned earlier, it's not so much that the algorithms we have are more sophisticated, it's that we now have access to smarter computers
+    - Put mroe bluntly, the computers have gotten way smarter, we on the other hadm have gotten marginally smarter at best.
+    ## Introduction to Deep Learning
+    - deep learning at it's most basic level is simply a function, f() that transforms an input x, into an output, y: y = f(x) [7]
+    - The simplest models are linear models of the form y = Mx + b; which is essentially just a straight line [8]
+    - These are very limited by the fact that they can't fit most datasets. For example, height distribution of the average person would likely not fit  a linear model
+    - A solution to this is basically a multilayer perceptron, which is essentially just putting one linear function inside another linear function
+    y = M_2(B(M_1x + b_1)) + b_2
+    - B is called an activation function and transforms the linear function into a non-linear function
+    - As you put one of these functions inside another one you create what is called a multilayer perceptron
+    - An interesting blog post which explains the first Perceptron/Neural net, Rosenblatt's Perceptron ([blog post](https://towardsdatascience.com/rosenblatts-perceptron-the-very-first-neural-network-37a3ec09038a) (tk find paper not behind Medium paywall), [paper](tk add paper link))
+    ### Training Models (13)
+    - To train the algorithm we need a loss function, L(y,y') where Y is the actual output and y' is the target value that we expected to get
+    - The loss function basically takes the two values to give us a value for how wrong we are
+    - Usually we use the Euclidean distance
+    - Also does anyone know a good way of writing mathematical notation on Github? Maybe I should use latex for this review?
+    - For probability distribution you should use  cross entropy (really didn't understand this part)
+    ## Chapter 4: Machine Learning For Molecules
+    - random search is often used for designing interesting molecules
+    - How can we find more efficient ways of desigining molecules?
+    - The first step is to transform molecules into vectors of numbers, called moelcular featurization
+    - This includes things like chemical descriptor vectors, 2D graph representations,
+D electrostatic grid representations, orbital basis function representations and more
+    ### What is a molecule?
+    - How do you know which molecules are present in a given sample?
+    - use a mass spectromoter to fire a bunch of electrons at the sample
+    - the sample becomes ionized or charged and get propelled by an electric field
+    - the different fragments go into different buckets based on their mass-to-charge ratio (m/q ion mass/ion charge)
+    - the spread of the different charged fragments is called the spectrum
+    - you can then use the ratio of the different fragments in each bucket to figure out which molecule you have
+    - molecules are dyamic, quantum entities: the atoms in a molecule are always moving (dynamic)
+        - a given molecule can be described in multiple ways (quantum)
+    ### What are Molecular bonds?
+     - Covalent bonds are strong bonds formed when two atoms share electrons
+     - it takes a lot of energy to break them
+     - this is what actually defines a molecule, a group of atoms joined by covalent bonds
+     - non-covalent bonds are not as strong as covalent bonds,
+     - constantly breaking and reforming, they have a huge effect on determining shape and interaction of molecules
+     - some examples include hydrogen bonds, pi-stacking, salt bridges etc.
+     - but non-covalent bonds are important because most drugs interact with
+     biological molecules in human body through non-covalent interactions
+     - For example water, H20, two hydrogen atoms are strongly attached to an oxygen atom using a covalent bond and
+     that is what forms a waer molecule
+     - then different water molecules are attached to other water molecules using a hydrogen bond
+     - this is what makes water the universal solvent
+     ### Chirality of Molecules
+     - some molecules come in two forms that are mirror images of each other
+     - a right form "R" form and the left-form "S" form
+     - Many physcial properties are identical for both  and have identical molecular graphs
+     - Important because it is possible for both forms to bind to different proteins in body and prodce various effects
+     - For example, in 1950s thalidomide was prescribed as a sedative for nauseau and morning sickness for pregnant women
+     but only the R form is the sedative, but the S form was a teratogenic that has been shown to cause severe defects
+    ### Featurize Molecules
+    - SMILES strings are a way of describing molecules using text strings
+    - Extended-Connectivity Fingerprints are a way of converting arbitrary length strings into a fixed-length vector
+    ```python
+    import deepchem as dc
+    from rdkit import Chem
+    smiles = ['C1CCCCC`', '01cc0cc1'] # cyclohexane and dioxane
+    mols = [Chem.MolFromSmiles(smile) for smile in smiles]
+    feat = dc.feat.CircularFingerprint(size=1024)
+    arr = feat.featurize(mols)
+    # arr is a 2-by-1024 array containing the fingerprints of the two molecules
+    ```
+    - Chemical fingerprints are vectors of 1s and 0s, indicating the presence or absence of a molecular feature
+    - algorithm works by starting to look at each atom individually, then works outwards
+    - another line of thinking says that use physics of the molecule's structure to describe the molecules
+    - this will trypicall work best for problemsthat rely on generic propertis of the molecules
+    and not detailed arrangement of atoms
+    ```python
+    import deepchem as dc
+    from rdkit import Chem
+    smiles = ['C1CCCCC`', '01cc0cc1'] # cyclohexane and dioxane
+    mols = [Chem.MolFromSmiles(smile) for smile in smiles]
+    feat = dc.fet.RDKitDescriptors;
+    arr = feat.featurize(mols)
+    # arr is a 2-by-111 array containing the properties of the two molecules
+    ```
+    ### Graph Convolutions
+    - those previous examples, all required a human that thought of an algorithm that could represent the molecules
+    in a way that a computer could understand
+    - what if there was a way to feed the graph representation of a molecule into a deep learning architecture and have the
+    model figure out the features of the molecule
+    - similiar to how a deep learning model can learn about properties of an image without being supervised
+    - the limitation is that the calculation is based on the molecular graph, so it doesn't know anything about
+    the molecule's conformation
+    - so it works best for small, rigid molecules, Chapter 5 looks at methods for large, flexible molecules
+     ### SMARTS Strings
+     - [MoleculeNet](http://moleculenet.ai) is a dataset for molecular machine learing
+     - Ranching from low level quantum mechanics interactions between atoms
+     - to highlevel interactions in human body like toxicity and side effects
+     - SMARTS string are useful if you want to see if atoms in a molecule match a pattern:
+        - Searching a molecular databse to see if a particular substructure exists
+        - aligning a set of molecules on common substructure to improve visualization
+    - SMARTS string is like regular expression for regular languages
+    - So "foo*.bar" will match "foo.bar" and "foo3.bar"
+    - Similarly "CCC" will match sequences of three adjacent aliphatic carbon atoms
+    (aliphatic means containing carbon and hydrogen joined together in a straight line)
+    -

chapter3.py

This file was deleted.

chapter_3/get_mnist_data.sh

-Original file line number
+Diff line change
@@ -0,0 +1,10 @@
+    #!/usr/bin/env bash
+    mkdir MNIST_DATA
+    cd MNIST_DATA
+    wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
+    wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
+    wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
+    wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz

chapter_3/mnist_digit_recognition.py

-Original file line number
+Diff line change
@@ -0,0 +1,97 @@
+    """
+    In molecule_toxicity.py we used a premade model class, now we will be creating an architecture from scratch
+    The reason we might want to do this is if we are working on a dataset where no predefined architecture exists.
+    # todo what is the difference between an architecture and a model?
+    This works by creating two convolution layerr which is a small square that is a subset of the image.
+    Then uses two fully connected layers to predict the digit from the local features.
+    # todo what does that actually mean? [tk include a diagram]
+    """
+    import tensorflow as tf
+    from tensorflow.examples.tutorials.mnist import input_data
+    import deepchem as dc
+    import deepchem.models.tensorgraph.layers as layers
+    def create_model():
+        """
+        Create our own MNIST model from scratch
+        :return:
+        :rtype:
+        """
+        mnist = input_data.read_data_sets("MNIST_DATA/", one_hot=True)
+        # the layers from deepchem are the building blocks of what we will use to make our deep learning architecture
+        # now we wrap our dataset into a NumpyDataset
+        train_dataset = dc.data.NumpyDataset(mnist.train.images, mnist.train.labels)
+        test_dataset = dc.data.NumpyDataset(mnist.test.images, mnist.test.labels)
+        # we will create a model that will take an input, add multiple layers, where each layer takes input from the
+        # previous layers.
+        model = dc.models.TensorGraph(model_dir='mnist')
+        # 784 corresponds to an image of size 28 X 28
+        # 10 corresponds to the fact that there are 10 possible digits (0-9)
+        # the None indicates that we can accept any size input (e.g. an empty array or 500 items each with 784 features)
+        # our data is also categorical so we must one hot encode, set single array element to 1 and the rest to 0
+        feature = layers.Feature(shape=(None, 784))
+        labels = layers.Label(shape=(None, 10))
+        # in order to apply convolutional layers to our input, we convert flat vector of 785 to 28X28
+        # in_layers means it takes our feature layer as an input
+        make_image = layers.Reshape(shape=(None, 28, 28), in_layers=feature)
+        # now that we have reshaped the input, we pass to convolution layers
+        conv2d_1 = layers.Conv2D(num_outputs=32, activation_fn=tf.nn.relu, in_layers=make_image)
+        conv2d_2 = layers.Conv2D(num_outputs=64, activation_fn=tf.nn.relu, in_layers=conv2d_1)
+        # we want to end by applying fully connected (Dense) layers to the outputs of our convolutional layer
+        # but first, we must flatten the layer from a 2d matrix to a 1d vector
+        flatten = layers.Flatten(in_layers=conv2d_2)
+        dense1 = layers.Dense(out_channels=1024,activation_fn=tf.nn.relu, in_layers=flatten)
+        # note that this is final layer so out_channels of 10 represents the 10 outputs and no activation_fn
+        dense2 = layers.Dense(out_channels=10,activation_fn=None, in_layers=dense1)
+        # next we want to connect this output to a loss function, so we can train the output
+        # compute the value of loss function for every sample then average of all samples to get final loss (ReduceMean)
+        smce = layers.SoftMaxCrossEntropy(in_layers=[labels, dense2])
+        loss = layers.ReduceMean(in_layers=smce)
+        model.set_loss(loss)
+        # for MNIST we want the probability that a given sample represents one of the 10 digits
+        # we can achieve this using a softmax function to get the probabilities, then cross entropy to get the labels
+        output = layers.SoftMax(in_layers=dense2)
+        model.add_output(output)
+        # if our model takes long to train, reduce nb_epoch to 1
+        model.fit(train_dataset,nb_epoch=1)
+        # our metric is accuracy, the fraction of labels that are accurately predicted
+        metric = dc.metrics.Metric(dc.metrics.accuracy_score)
+        train_scores = model.evaluate(train_dataset, [metric])
+        test_scores = model.evaluate(test_dataset,[metric])
+        print('train_scores', train_scores)
+        print('test_scores', test_scores)
+    if __name__ == '__main__':
+        create_model()

chapter_3/molecule_toxicity.py

-Original file line number
+Diff line change
@@ -0,0 +1,85 @@
+    import deepchem as dc
+    import numpy as np
+    def run():
+        """
+        tox21_tasks is a list of chemical assays and our dataset
+        contains training data that will tell us whether a certain molecule binds
+        to one of the molecules in
+        :return:
+        :rtype:
+        """
+        # first we must load the Toxicity 21 datasets from molnet (MoleculeNet) unto our local machine
+        tox21_tasks, tox21_datasets, transformers = dc.molnet.load_tox21()
+        # tox21_tasks represent 12 assays or bilogicial targets taht we want to see if our molecule binds to
+        print(tox21_tasks)
+        # train_dataset is 6264 molecules with a feature vector of length 1024
+        # it has a feature vector Y, for each of the 12 assays
+        train_dataset, valid_dataset, test_dataset = tox21_datasets
+        # the w represents the weights and a weight of zero means that no experiment was run
+        # to see if the molecule binds to that assay
+        np.count_nonzero(train_dataset.w == 0)
+        #  this is a BalancingTransformer because most of the molecules do not bind to most targets
+        #  so most of the labels are zero and a model always predicting zero could actually work (but it would be useless!)
+        #  BalancingTransformer adjusts dataset's wieghts of individual points so all classes have same total weight
+        #  Loss function won't have systematic preference for one class
+        print(transformers)
+        train_model(train_dataset, test_dataset, transformers)
+    def train_model(train_dataset, test_dataset, transformers):
+        """
+        Train the model using a multitask classifier because there are multiple outputs for each sample
+        and evaluate model using the mean ROC AUC.
+        :param train_dataset:
+        :type train_dataset:
+        :param transformers:
+        :type transformers:
+        :return:
+        :rtype:
+        """
+        # this model builds a fully connected network (an MLP)
+        #  since we have 12 assays we're testing for, being able to map to multiple outputs is ideal
+        # layer_sizes means that we have one hidden layer which has a width of 1,000
+        model = dc.models.MultitaskClassifier(n_tasks=12, n_features=1024, layer_sizes=[1000])
+        # nb_epoch means that we will divide the data into batches, and do one step of gradient descent for each batch
+        model.fit(train_dataset, nb_epoch=10)
+        # how do we know how accurate our model is? we will find the mean ROC AUC score across all tasks
+        # What is an ROC AUC score? We are trying to predict the toxicity of the molecules,
+        # Receiver Operating Characteristic, Area Under Curve
+        # If there exists any threshold value where, the true positive rate is 1 and false positive is 0 then score is 1
+        # so we pick a threshold of what is considered a toxic molecule
+        # if we pick a threshold value that's too low, we will say too many safe molecules are toxic (high false positive)
+        # alternatively, if we pick one too igh, we will say that toxic molecules are safe (high false negative)
+        # note on understanding false positive terminology.\:
+        # Imagine a molecule that is actually toxic. "Is this molecule toxic?" "No." We gave a negative response
+        # the answer is relative to what we are testing for, in this case, we are testing if a molecule is toxic
+        # so we are making a tradeoff between high false positive vs high false negative so we use something called
+        # an ROC AUC curve, which graphs the tradeofff between the false positive rate and the true positive rate
+        metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean)
+        # evaluate the performance of this model on the train_dataset using the ROC AUC metric
+        train_scores = model.evaluate(train_dataset, [metric], transformers)
+        test_scores = model.evaluate(test_dataset, [metric], transformers)
+        # the train scores are higher than our test scores which shows us that our model has been overfit
+        print(f'train_scores: {train_scores}')
+        print(f'test_scores: {test_scores}')
+    if __name__ == '__main__':
+        run()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter 4 machine learning for molecules #3

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!

Chapter 4 machine learning for molecules #3

Are you sure you want to change the base?

Uh oh!

Chapter 4 machine learning for molecules #3

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!