an example implementation of a cnn for predicting bait performance. by yfarjoun · Pull Request #9 · broadinstitute/dsde-deep-learning

yfarjoun · 2017-08-18T18:21:35Z

It doesn't work very well on the ICE sample I tried it with, probably because the baits are not equimolar, and predicting the molarity is quite difficult. Still, it might be a useful template for someone.

…t doesn't work very well on the ICE sample I tried it with, probably because the baits are not equimolar, and predicting the molarity is quite difficult

mbabadi · 2017-08-18T21:20:38Z

Do we know the baits are not equimolar for a fact or just a hunch? if it is a fact, then predicting bait efficiency is obviously impossible, in particular, if the variance in molarity exceeds the context-dependent variance. The relative spatial capture efficiency of each bait, however, is not affected by this confounding factor and may be predictable.

lucidtronix

Cool work! Some nitpicks and suggestions...

lucidtronix · 2017-08-18T20:59:39Z

+
+import os
+import math
+# import h5py


remove unused

lucidtronix · 2017-08-18T21:03:37Z

+    my_metrics = [metrics.mean_squared_error, rmse_log]
+
+    model.compile(loss='mean_squared_error', optimizer=sgd, metrics=my_metrics)
+    print('model summary:\n', model.summary())


My bug but this should just be
model.summary()
That function already contains a print

lucidtronix · 2017-08-18T21:04:06Z

+    # the Input layer and three Dense layers
+    model = Model(input=[input_baits, input_annotations], output=predictions)
+    model.compile(loss=gme, optimizer=sgd, metrics=my_metrics)
+    print('model summary:\n', model.summary())


model.summary()

lucidtronix · 2017-08-18T21:14:23Z

+                     activation="relu",
+                     init='normal'))
+
+    model.add(MaxPooling1D(pool_length=3, stride=3))


I try to avoid maxpooling this early in the model for genetic data. For images we have a strong smoothness prior which we don't have on DNA sequences.

lucidtronix · 2017-08-18T21:14:56Z

+
+    x = Dropout(0.2)(x)
+
+    x = MaxPooling1D(strides=3, pool_size=3)(x)


Maxpooling could be risky this early see above.

lucidtronix · 2017-08-18T21:57:40Z

+    bait_shape = (args.window_size, len(args.inputs),)
+    annotation_shape = (len(args.annotations),)
+
+    print(bait_shape)


add label or remove

lucidtronix · 2017-08-18T21:58:24Z

+
+    predictions = Dense(units=1, init=RandomNormal(mean=1.0, stddev=0.5, seed=None), activation=None)(xy)
+
+    sgd = SGD(lr=0.001, decay=1e-6, momentum=0.9, nesterov=True, clipnorm=0.5)


As we discussed maybe try the Adam optimizer.

lucidtronix · 2017-08-18T22:00:20Z

+    count = 0
+    while count < args.samples:
+        contig_key, start, end, coverage, gc = sample_from_bed(baits_and_coverages)
+        mid = (start + end) / 2


change to // if you import future division

lucidtronix · 2017-08-18T22:04:51Z

+grep -v '^@' whole_exome_illumina_coding_v1.Homo_sapiens_assembly19.targets.interval_list | awk 'BEGIN{FS="\t";OFS="\t"}{print $1, $2-1,$3}' > whole_exome_illumina_coding_v1.Homo_sapiens_assembly19.targets.bed
+grep -v '^@' whole_exome_illumina_coding_v1.Homo_sapiens_assembly19.baits.interval_list | awk 'BEGIN{FS="\t";OFS="\t"}{print $1, $2-1,$3}' > whole_exome_illumina_coding_v1.Homo_sapiens_assembly19.baits.bed
+
+bedtools slop -i  whole_exome_illumina_coding_v1.Homo_sapiens_assembly19.baits.bed -b 250 -g hg19.genome > sloppy.baits.bed


love this tool

lucidtronix · 2017-08-18T22:16:34Z

+    return c_idx, p_idx
+
+
+# TODO: make more random (this gives too much power to the small contigs)


…improved

- removed unused imports - changed dropout from .2 to .3 - removed first MaxPooling layer - added informative labels to print commands - made division floating point

…ailed misrably...anyone care to see what I did wrong ?

…tle...

yfarjoun · 2017-08-23T12:42:47Z

pushed a few more commits...care to comment?

lucidtronix

Some small changes and comment cleanup. But the bigger question is why isn't the model learning more? Judging by the scatter_bait_performance_cnn_model.jpg it look like underfitting not overfitting. Training and test performance are similar and training error is far from 0. Is it because there are non-sequence (and non bait position) factors that are responsible for bait coverage? Is the equimolarity assumption wrong? Are the bait positions normalized between 0 and 1? They probably should be. I would be curious how a pure sequence model performs. Another debug idea is to cast this as classification rather than regression, see the model.compile comment on line 210.

lucidtronix · 2017-08-25T05:10:23Z

+    contig_sizes = {key: len(bed_dict[key]) for key in bed_dict.keys()}
+    total_size = sum(contig_sizes.values())

+    contig_key = np.random.choice(bed_dict.keys(), 1, p=[x / total_size for x in contig_sizes.values()])[0]


lucidtronix · 2017-08-25T05:16:28Z

+    lows = bed_dict[contig][0]
+    ups = bed_dict[contig][1]
+
+    return np.any((lows <= pos) & (pos <= ups))


Again my bug which got fixed in vqsr but never cherrypicked back here. I believe this should be:
return np.any((lows <= pos) & (pos < ups))

lucidtronix · 2017-08-25T05:20:11Z

+
+def gme(y_true, y_pred):
+    """calculates the root (geometric) mean squared error of the values."""
+    return K.exp(K.mean(K.log(K.abs(np.divide(y_true + .001, y_pred + .001) - 1))))


You could use K.epsilon() instead of the hardcoded .001, this value is initialized in your ~/.keras/keras.json file and settable via K.set_epsilon(1e-05). Also I think the division operator / is appropriately overloaded to handle this without np.divide()

lucidtronix · 2017-08-25T05:21:46Z

+    contig_sizes = {key: len(bed_dict[key]) for key in bed_dict.keys()}
+    total_size = sum(contig_sizes.values())
+
+    contig_key = np.random.choice(bed_dict.keys(), 1, p=[x / total_size for x in contig_sizes.values()])[0]


lucidtronix · 2017-08-25T05:24:37Z

+# from math import sqrt
+#
+#
+# def put_kernels_on_grid(kernel, pad=1):


Can this be uncommented? Seems like a helpful fxn...

it might be, but I failed in getting it to work...adding it here in a comment so that folks have something to work with...

lucidtronix · 2017-08-25T05:26:05Z

+    model = Model(inputs=[input_baits, input_annotations], outputs=predictions)
+
+    # # add some TensorBoard annotations
+    # conv1d_1 = filter(lambda y: y.name == "conv1d_1", model.layers)[0]


Why is this commented out?

because I coudlnt' get it to work, but I wanted to show it to you to see if you could help!

Ok, are you trying to visualize the weights in tensorboard or something more complicated?

lucidtronix · 2017-08-25T05:26:37Z

+    # filters=put_kernels_on_grid(reshaped, 2)
+    #
+    # merged = tf.summary.merge_all()
+    # train_writer = tf.summary.FileWriter("./log/" + '/train')


Does this conflict with the TensorBoard callback?

lucidtronix · 2017-08-25T05:40:01Z

+    # train_writer = tf.summary.FileWriter("./log/" + '/train')
+    #
+    adamo = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, clipnorm=1.)
+    model.compile(loss=metrics.mean_squared_error, optimizer=adamo, metrics=my_metrics)


I'm wondering if we should try this as a classification problem by binning the coverage first and then trying to predict (high, med, low) or (too high, high, medium, low, too low) or something. Then we could use categorical crossentropy as our loss, which tends to have more well-behaved learning dynamics, also we could use categorical accuracy as a metric to get a quick idea of how the model compares to chance...

and then what? modify the loss so that it tries to maximize xentropy + MSE?

I was thinking to just try categorical crossentropy though we could do it a multi-task problem where the model tries to do both. It is easy to do with the functional API. Something like:

regression = Dense(units=1, kernel_initializer=RandomNormal(mean=1.0, stddev=0.5, seed=None), activation="relu")(xy) classification = Dense(units=5, activation='softmax')(xy) model = Model(inputs=[input_baits, input_annotations], outputs=[regression, classification]) model.compile(loss=['categorical_crossentropy', metrics.mean_squared_error], optimizer=adamo, metrics=my_metrics)

Assuming 5 bins for the quantized coverage...

yfarjoun · 2017-08-28T11:37:23Z

I've chatted with Tim about this, he's pretty sure that the baits are NOT equi-molar...so unless I can get the molarity mixture, I'm surprised that this worked at all....

lucidtronix · 2017-08-28T20:56:19Z

I'm not so surprised that there is some predictive value in the sequence alone.. we've now seen that on a few different tasks: variant filtering, indel modeling, bait performance. Anyway, it seems like it may be difficult to track down the molarity mixture. Should we leave this PR open while you search or do you want to merge?

yfarjoun · 2017-08-29T10:21:18Z

I'll fix it up and merge...who knows how long it will take to get the molarity or spike-in list...

an example implementation of a cnn for predicting bait performance. I…

512971a

…t doesn't work very well on the ICE sample I tried it with, probably because the baits are not equimolar, and predicting the molarity is quite difficult

yfarjoun requested a review from lucidtronix August 18, 2017 18:21

lucidtronix self-assigned this Aug 18, 2017

lucidtronix requested changes Aug 18, 2017

View reviewed changes

yfarjoun added 4 commits August 19, 2017 10:08

changed the optimizer and the loss function. performance is somewhat …

8ae847d

…improved

responding to review:

1457e12

- removed unused imports - changed dropout from .2 to .3 - removed first MaxPooling layer - added informative labels to print commands - made division floating point

adding TensorBoard capability, tried to visualized the filters, but f…

51c7a33

…ailed misrably...anyone care to see what I did wrong ?

added perfomance plot. network not really working, but learning a lit…

3da4c6c

…tle...

lucidtronix reviewed Aug 25, 2017

View reviewed changes


		x = Dropout(0.2)(x)

		x = MaxPooling1D(strides=3, pool_size=3)(x)


		predictions = Dense(units=1, init=RandomNormal(mean=1.0, stddev=0.5, seed=None), activation=None)(xy)

		sgd = SGD(lr=0.001, decay=1e-6, momentum=0.9, nesterov=True, clipnorm=0.5)

		return c_idx, p_idx


		# TODO: make more random (this gives too much power to the small contigs)

Uh oh!

Conversation

yfarjoun commented Aug 18, 2017

Uh oh!

mbabadi commented Aug 18, 2017

Uh oh!

lucidtronix left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yfarjoun commented Aug 23, 2017

Uh oh!

lucidtronix left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucidtronix Aug 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucidtronix Aug 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yfarjoun commented Aug 28, 2017

Uh oh!

lucidtronix commented Aug 28, 2017

Uh oh!

yfarjoun commented Aug 29, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lucidtronix Aug 25, 2017 •

edited

Loading

lucidtronix Aug 28, 2017 •

edited

Loading