Skip to content

Commit 66c74e6

Browse files
danielecookcopybara-github
authored andcommitted
Update README.md to mention DeepSomatic.
PiperOrigin-RevId: 700771628
1 parent af9ed9a commit 66c74e6

File tree

3 files changed

+60
-28
lines changed

3 files changed

+60
-28
lines changed

README.md

+6-3
Original file line numberDiff line numberDiff line change
@@ -33,11 +33,14 @@ DeepVariant supports germline variant-calling in diploid organisms.
3333
[T7 case study](docs/deepvariant-complete-t7-case-study.md);
3434
[G400 case study](docs/deepvariant-complete-g400-case-study.md)
3535

36+
We have also adapted DeepVariant for somatic calling. See the
37+
[github.com/google/deepsomatic](DeepSomatic) repo for details.
38+
3639
Please also note:
3740

38-
* For somatic data or any other samples where the genotypes go beyond two
39-
copies of DNA, DeepVariant will not work out of the box because the only
40-
genotypes supported are hom-alt, het, and hom-ref.
41+
* DeepVariant currently supports variant calling on organisms where the
42+
ploidy/copy-number is two. This is because the genotypes supported are
43+
hom-alt, het, and hom-ref.
4144
* The models included with DeepVariant are only trained on human data. For
4245
other organisms, see the
4346
[blog post on non-human variant-calling](https://google.github.io/deepvariant/posts/2018-12-05-improved-non-human-variant-calling-using-species-specific-deepvariant-models/)

docs/deepvariant-details.md

+54-25
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,14 @@ variant calls. At the highest level, a user needs to provide three inputs:
1919
The output of DeepVariant is a list of all variant calls in
2020
[VCF](https://samtools.github.io/hts-specs/VCFv4.3.pdf) format.
2121

22-
DeepVariant is composed of three programs: `make_examples`, `call_variants`, and
23-
`postprocess_variants`. More details about each program are described in detail
24-
in the [Inputs and outputs](#inputs-and-outputs) section.
22+
DeepVariant can be run in a variety of ways. The simplest method is to use
23+
`run_deepvariant`, which will configure DeepVariant for a given model type
24+
(`--model_type`), and run all subprograms.
25+
26+
However, under the hood, DeepVariant is composed of three programs:
27+
`make_examples`, `call_variants`, and `postprocess_variants`. More details about
28+
each program are described in detail in the
29+
[Inputs and outputs](#inputs-and-outputs) section.
2530

2631
## Inputs and outputs
2732

@@ -46,13 +51,29 @@ the
4651
[Using TFRecords and tf.Example](https://www.tensorflow.org/tutorials/load_data/tfrecord)
4752
Colab.
4853

49-
`make_examples` is a single-threaded program using 1-2 GB of RAM. Since the
50-
process of generating examples is embarrassingly parallel across the genome,
51-
`make_examples` supports sharding of its input and output via the `--task`
52-
argument with a sharded output specification. For example, if the output is
53-
specified as `--examples [email protected]` and `--task 0`, the input to
54-
the program will be 10% of the regions and the output will be written to
55-
`examples.tfrecord-00000-of-00010.gz`.
54+
`make_examples` is a single-threaded program. Since the process of generating
55+
examples is embarrassingly parallel across the genome, `make_examples` supports
56+
sharding of its input and output via the `--task` argument with a sharded output
57+
specification. For example, if the output is specified as `--examples
58+
[email protected]` and `--task 0`, the input to the program will be 10% of
59+
the regions and the output will be written to
60+
`examples.tfrecord-00000-of-00010.gz`. Memory usage per instance is detailed
61+
below. These values do not consider the small model, or the fast-pipeline modes
62+
of make examples, and should be considered as upper limits of memory usage.
63+
64+
**Memory Usage**
65+
66+
These values were calculated by looking at memory usage for 1/25th of chr20.
67+
68+
| Model Type | version | mean_mem_mb | median_mem_mb | max_mem_mb |
69+
|------------|---------|-------------|---------------|------------|
70+
| PacBio | 1.6.1 | 464.6 | 426.4 | 590.7 |
71+
| PacBio | 1.8.0 | 487.0 | 457.1 | 630.3 |
72+
73+
| Model Type | version | mean_mem_mb | median_mem_mb | max_mem_mb |
74+
|----------------|---------|-------------|---------------|------------|
75+
| Illumina (WGS) | 1.6.1 | 404.7 | 403.9 | 418.3 |
76+
| Illumina (WGS) | 1.8.0 | 435.1 | 434.7 | 449.3 |
5677

5778
#### Input assumptions
5879

@@ -74,10 +95,11 @@ present in the BAM but not in the reference.
7495
The BAM file must be also sorted and indexed. It must exist on disk, so you
7596
cannot pipe it into DeepVariant. Duplicate marking may be performed, in our
7697
analyses there is almost no difference in accuracy except at lower (<20x)
77-
coverages. Finally, we recommend that you do not perform BQSR. Running BQSR has
78-
a small decrease on accuracy. It is not necessary to do any form of indel
79-
realignment, though there is not a difference in DeepVariant accuracy either
80-
way.
98+
coverages. Finally, we recommend that you do not perform
99+
[BQSR](https://gatk.broadinstitute.org/hc/en-us/articles/360035890531-Base-Quality-Score-Recalibration-BQSR).
100+
Running BQSR results in a small decrease in accuracy. It is not necessary to do
101+
any form of indel realignment, though there is not a difference in DeepVariant
102+
accuracy either way.
81103

82104
Third, if you are providing `--regions` or other similar arguments these should
83105
refer to contigs present in the reference genome. These arguments accept
@@ -97,9 +119,9 @@ the one provided with the `--ref` argument.
97119

98120
### call_variants
99121

100-
`call_variants` consumes TFRecord file(s) of tf.Examples protos created
101-
by `make_examples` and a deep learning model checkpoint and evaluates the model
102-
on each example in the input TFRecord. The output here is a TFRecord of
122+
`call_variants` consumes TFRecord file(s) of tf.Examples protos created by
123+
`make_examples` and a deep learning model checkpoint and evaluates the model on
124+
each example in the input TFRecord. The output here is a TFRecord of
103125
CallVariantsOutput protos. `call_variants` doesn't directly support sharding its
104126
outputs, but accepts a glob or shard-pattern for its inputs.
105127

@@ -141,8 +163,9 @@ Key changes and improvements include:
141163

142164
We have made a number of improvements to the methodology as well. The biggest
143165
change was to move away from RGB-encoded (3-channel) pileup images and instead
144-
represent the aligned read data using a multi-channel tensor data layout. We
145-
currently represent the data as a 6-channel raw tensor in which we encode:
166+
represent the aligned read data using a multi-channel tensor data layout.
167+
Channels represent sequencing features. All of our models currently have a set
168+
of 6 'base channels':
146169

147170
* The read base (A, C, G, T)
148171
* The base's quality score
@@ -151,8 +174,13 @@ currently represent the data as a 6-channel raw tensor in which we encode:
151174
* Does the read support the allele being evaluated?
152175
* Does the base match the reference genome at this position?
153176

154-
These are all readily derived from the information found in the BAM file
155-
encoding of each read.
177+
![base channels](images/base_channels.png)
178+
179+
We can add additional channels to this base channel set to tailor the model
180+
input to a particular sequencing platform or technology to maximize accuracy.
181+
For example, for our Illumina models (`wgs`, `exome`) we add an additional
182+
`insert_size` channel. Because long-read data can be phased, we add a
183+
`haplotype` channel to our `pacbio` and `ont` models.
156184

157185
Additional modeling changes were to move to the inception-v3 architecture and to
158186
train on many more independent sequencing replicates of the ground truth
@@ -228,8 +256,9 @@ machines.
228256

229257
## Starting from v1.2.0, we include `samtools` and `bcftools`.
230258

231-
Based on user feedback ([GitHub issue #414](https://github.com/google/deepvariant/issues/414)),
232-
we added samtools and bcftools in our Docker image:
259+
Based on user feedback
260+
([GitHub issue #414](https://github.com/google/deepvariant/issues/414)), we
261+
added samtools and bcftools in our Docker image:
233262

234263
```bash
235264
docker run google/deepvariant:"${BIN_VERSION}" samtools
@@ -283,8 +312,8 @@ gcloud compute instances create "${USER}-gpu" \
283312
--min-cpu-platform "Intel Skylake"
284313
```
285314

286-
NOTE: Having an instance up and running could cost you. Remember to delete the
287-
instances you're not using. You can find the instances at:
315+
NOTE: Be sure to manage instances efficiently. Remember to delete the instances
316+
you're not using. You can find the instances at:
288317
https://console.cloud.google.com/compute/instances?project=YOUR_PROJECT
289318

290319
[exome case study]: deepvariant-exome-case-study.md

docs/images/base_channels.png

126 KB
Loading

0 commit comments

Comments
 (0)