@@ -19,9 +19,14 @@ variant calls. At the highest level, a user needs to provide three inputs:
19
19
The output of DeepVariant is a list of all variant calls in
20
20
[ VCF] ( https://samtools.github.io/hts-specs/VCFv4.3.pdf ) format.
21
21
22
- DeepVariant is composed of three programs: ` make_examples ` , ` call_variants ` , and
23
- ` postprocess_variants ` . More details about each program are described in detail
24
- in the [ Inputs and outputs] ( #inputs-and-outputs ) section.
22
+ DeepVariant can be run in a variety of ways. The simplest method is to use
23
+ ` run_deepvariant ` , which will configure DeepVariant for a given model type
24
+ (` --model_type ` ), and run all subprograms.
25
+
26
+ However, under the hood, DeepVariant is composed of three programs:
27
+ ` make_examples ` , ` call_variants ` , and ` postprocess_variants ` . More details about
28
+ each program are described in detail in the
29
+ [ Inputs and outputs] ( #inputs-and-outputs ) section.
25
30
26
31
## Inputs and outputs
27
32
46
51
[ Using TFRecords and tf.Example] ( https://www.tensorflow.org/tutorials/load_data/tfrecord )
47
52
Colab.
48
53
49
- ` make_examples ` is a single-threaded program using 1-2 GB of RAM. Since the
50
- process of generating examples is embarrassingly parallel across the genome,
51
- ` make_examples ` supports sharding of its input and output via the ` --task `
52
- argument with a sharded output specification. For example, if the output is
53
- specified as
` --examples [email protected] ` and
` --task 0 ` , the input to
54
- the program will be 10% of the regions and the output will be written to
55
- ` examples.tfrecord-00000-of-00010.gz ` .
54
+ ` make_examples ` is a single-threaded program. Since the process of generating
55
+ examples is embarrassingly parallel across the genome, ` make_examples ` supports
56
+ sharding of its input and output via the ` --task ` argument with a sharded output
57
+ specification. For example, if the output is specified as `--examples
58
+ [email protected] ` and ` --task 0`, the input to the program will be 10% of
59
+ the regions and the output will be written to
60
+ ` examples.tfrecord-00000-of-00010.gz ` . Memory usage per instance is detailed
61
+ below. These values do not consider the small model, or the fast-pipeline modes
62
+ of make examples, and should be considered as upper limits of memory usage.
63
+
64
+ ** Memory Usage**
65
+
66
+ These values were calculated by looking at memory usage for 1/25th of chr20.
67
+
68
+ | Model Type | version | mean_mem_mb | median_mem_mb | max_mem_mb |
69
+ | ------------| ---------| -------------| ---------------| ------------|
70
+ | PacBio | 1.6.1 | 464.6 | 426.4 | 590.7 |
71
+ | PacBio | 1.8.0 | 487.0 | 457.1 | 630.3 |
72
+
73
+ | Model Type | version | mean_mem_mb | median_mem_mb | max_mem_mb |
74
+ | ----------------| ---------| -------------| ---------------| ------------|
75
+ | Illumina (WGS) | 1.6.1 | 404.7 | 403.9 | 418.3 |
76
+ | Illumina (WGS) | 1.8.0 | 435.1 | 434.7 | 449.3 |
56
77
57
78
#### Input assumptions
58
79
@@ -74,10 +95,11 @@ present in the BAM but not in the reference.
74
95
The BAM file must be also sorted and indexed. It must exist on disk, so you
75
96
cannot pipe it into DeepVariant. Duplicate marking may be performed, in our
76
97
analyses there is almost no difference in accuracy except at lower (<20x)
77
- coverages. Finally, we recommend that you do not perform BQSR. Running BQSR has
78
- a small decrease on accuracy. It is not necessary to do any form of indel
79
- realignment, though there is not a difference in DeepVariant accuracy either
80
- way.
98
+ coverages. Finally, we recommend that you do not perform
99
+ [ BQSR] ( https://gatk.broadinstitute.org/hc/en-us/articles/360035890531-Base-Quality-Score-Recalibration-BQSR ) .
100
+ Running BQSR results in a small decrease in accuracy. It is not necessary to do
101
+ any form of indel realignment, though there is not a difference in DeepVariant
102
+ accuracy either way.
81
103
82
104
Third, if you are providing ` --regions ` or other similar arguments these should
83
105
refer to contigs present in the reference genome. These arguments accept
@@ -97,9 +119,9 @@ the one provided with the `--ref` argument.
97
119
98
120
### call_variants
99
121
100
- ` call_variants ` consumes TFRecord file(s) of tf.Examples protos created
101
- by ` make_examples ` and a deep learning model checkpoint and evaluates the model
102
- on each example in the input TFRecord. The output here is a TFRecord of
122
+ ` call_variants ` consumes TFRecord file(s) of tf.Examples protos created by
123
+ ` make_examples ` and a deep learning model checkpoint and evaluates the model on
124
+ each example in the input TFRecord. The output here is a TFRecord of
103
125
CallVariantsOutput protos. ` call_variants ` doesn't directly support sharding its
104
126
outputs, but accepts a glob or shard-pattern for its inputs.
105
127
@@ -141,8 +163,9 @@ Key changes and improvements include:
141
163
142
164
We have made a number of improvements to the methodology as well. The biggest
143
165
change was to move away from RGB-encoded (3-channel) pileup images and instead
144
- represent the aligned read data using a multi-channel tensor data layout. We
145
- currently represent the data as a 6-channel raw tensor in which we encode:
166
+ represent the aligned read data using a multi-channel tensor data layout.
167
+ Channels represent sequencing features. All of our models currently have a set
168
+ of 6 'base channels':
146
169
147
170
* The read base (A, C, G, T)
148
171
* The base's quality score
@@ -151,8 +174,13 @@ currently represent the data as a 6-channel raw tensor in which we encode:
151
174
* Does the read support the allele being evaluated?
152
175
* Does the base match the reference genome at this position?
153
176
154
- These are all readily derived from the information found in the BAM file
155
- encoding of each read.
177
+ ![ base channels] ( images/base_channels.png )
178
+
179
+ We can add additional channels to this base channel set to tailor the model
180
+ input to a particular sequencing platform or technology to maximize accuracy.
181
+ For example, for our Illumina models (` wgs ` , ` exome ` ) we add an additional
182
+ ` insert_size ` channel. Because long-read data can be phased, we add a
183
+ ` haplotype ` channel to our ` pacbio ` and ` ont ` models.
156
184
157
185
Additional modeling changes were to move to the inception-v3 architecture and to
158
186
train on many more independent sequencing replicates of the ground truth
@@ -228,8 +256,9 @@ machines.
228
256
229
257
## Starting from v1.2.0, we include ` samtools ` and ` bcftools ` .
230
258
231
- Based on user feedback ([ GitHub issue #414 ] ( https://github.com/google/deepvariant/issues/414 ) ),
232
- we added samtools and bcftools in our Docker image:
259
+ Based on user feedback
260
+ ([ GitHub issue #414 ] ( https://github.com/google/deepvariant/issues/414 ) ), we
261
+ added samtools and bcftools in our Docker image:
233
262
234
263
``` bash
235
264
docker run google/deepvariant:" ${BIN_VERSION} " samtools
@@ -283,8 +312,8 @@ gcloud compute instances create "${USER}-gpu" \
283
312
--min-cpu-platform " Intel Skylake"
284
313
```
285
314
286
- NOTE: Having an instance up and running could cost you . Remember to delete the
287
- instances you're not using. You can find the instances at:
315
+ NOTE: Be sure to manage instances efficiently . Remember to delete the instances
316
+ you're not using. You can find the instances at:
288
317
https://console.cloud.google.com/compute/instances?project=YOUR_PROJECT
289
318
290
319
[ exome case study ] : deepvariant-exome-case-study.md
0 commit comments