Skip to content
This repository has been archived by the owner on Feb 16, 2019. It is now read-only.

Commit

Permalink
Merge pull request #32 from data-lessons/tr-cosmetic-changes
Browse files Browse the repository at this point in the history
cosmetic changes
  • Loading branch information
taylorreiter authored Feb 1, 2019
2 parents 7167f56 + ebfeafd commit 4fe0b77
Show file tree
Hide file tree
Showing 5 changed files with 26 additions and 22 deletions.
8 changes: 4 additions & 4 deletions _episodes/01-background.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@ objectives:

# Background

We are going to use a long-term sequencing dataset from a population of *Escherichia coli* (designated *Ara-3*).
We are going to use a long-term sequencing dataset from a population of *Escherichia coli*.

- **What is *E. coli*?**
- *E. coli* are rod-shaped bacteria can survive under a wide variety of conditions including variable temperatures, nutrient availability, and oxygen levels. Most strains are harmless, but some are associated with food-poisoning.
- *E. coli* are rod-shaped bacteria that can survive under a wide variety of conditions including variable temperatures, nutrient availability, and oxygen levels. Most strains are harmless, but some are associated with food-poisoning.

![ [Wikimedia](https://species.wikimedia.org/wiki/Escherichia_coli#/media/File:EscherichiaColi_NIAID.jpg) ](../img/172px-EscherichiaColi_NIAID.jpg)

Expand All @@ -28,14 +28,14 @@ We are going to use a long-term sequencing dataset from a population of *Escheri

- The data we are going to use is part of a long-term evolution experiment led by [Richard Lenski](https://en.wikipedia.org/wiki/E._coli_long-term_evolution_experiment).

- The experiment was designed to assess adaptation in *E. coli*. A population (designated **Ara-3**) were propagated for more than 40,000 generations in a glucose-limited minimal medium (in most conditions glucose is the best carbon source for *E. coli*, providing faster growth than other sugars). This medium was supplemented with citrate which *E. coli* cannot metabolize in the aerobic conditions of the experiment. Sequencing of the populations at regular time points reveals that spontaneous citrate-using variant (**Cit+**) appeared between 31,000 and 31,500 generations causing an increase in population size and diversity. In addition, this experiment showed hypermutability in certain regions. Hypermutability is important and can help accelerate adaptation to novel environments, but also can be selected against in well-adapted populations.
- The experiment was designed to assess adaptation in *E. coli*. A population was propagated for more than 40,000 generations in a glucose-limited minimal medium (in most conditions glucose is the best carbon source for *E. coli*, providing faster growth than other sugars). This medium was supplemented with citrate which *E. coli* cannot metabolize in the aerobic conditions of the experiment. Sequencing of the populations at regular time points reveals that spontaneous citrate-using variant (**Cit+**) appeared between 31,000 and 31,500 generations causing an increase in population size and diversity. In addition, this experiment showed hypermutability in certain regions. Hypermutability is important and can help accelerate adaptation to novel environments, but also can be selected against in well-adapted populations.

- To see a timeline of the experiment to date, check out this [figure](https://en.wikipedia.org/wiki/E._coli_long-term_evolution_experiment#/media/File:LTEE_Timeline_as_of_May_28,_2016.png), and this paper [Blount et al. 2008: Historical contingency and the evolution of a key innovation in an experimental population of *Escherichia coli*](http://www.pnas.org/content/105/23/7899).


## View the Metadata

We will be working with three sample events from the **Ara-3** strain of this experiment, one from 5,000 generations, one from 15,000 generations, and one from 50,000 generations. The population changed substantially during the course of the experiment, and we will be exploring how (the evolution of a **Cit+** mutant and **hypermutability**) with our variant calling workflow. The metadata file required for this lesson can be [downloaded directly here](https://raw.githubusercontent.com/data-lessons/wrangling-genomics/gh-pages/files/Ecoli_metadata_composite.csv) or [viewed in Github](https://github.com/data-lessons/wrangling-genomics/blob/gh-pages/files/Ecoli_metadata_composite.csv). If you would like to know details of how the file was created, you can look at [some notes and sources here](https://github.com/data-lessons/wrangling-genomics/blob/gh-pages/files/Ecoli_metadata_composite_README.md).
We will be working with three sample events from the **Ara-3** strain of this experiment, one from 5,000 generations, one from 15,000 generations, and one from 50,000 generations. The population changed substantially during the course of the experiment, and we will be exploring how (the evolution of a **Cit+** mutant and **hypermutability**) with our variant calling workflow. The metadata file associated with this lesson can be [downloaded directly here](https://raw.githubusercontent.com/data-lessons/wrangling-genomics/gh-pages/files/Ecoli_metadata_composite.csv) or [viewed in Github](https://github.com/data-lessons/wrangling-genomics/blob/gh-pages/files/Ecoli_metadata_composite.csv). If you would like to know details of how the file was created, you can look at [some notes and sources here](https://github.com/data-lessons/wrangling-genomics/blob/gh-pages/files/Ecoli_metadata_composite_README.md).



Expand Down
22 changes: 11 additions & 11 deletions _episodes/02-quality-control.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_2.fa
The data comes in a compressed format, which is why there is a `.gz` at the end of the file names. This makes it faster to transfer, and allows it to take up less space on our computer. Let's unzip one of the files so that we can look at the fastq format.

~~~
gunzip SRR2584863_1.fastq.gz
$ gunzip SRR2584863_1.fastq.gz
~~~
{: .bash}

Expand Down Expand Up @@ -275,7 +275,7 @@ Approx 85% complete for SRR2589044_2.fastq.gz
Approx 90% complete for SRR2589044_2.fastq.gz
Approx 95% complete for SRR2589044_2.fastq.gz
Analysis complete for SRR2589044_2.fastq.gz
(variants) dcuser@ip-172-31-49-42:~/dc_workshop/data/untrimmed_fastq$
$
~~~
{: .output}
Expand Down Expand Up @@ -418,15 +418,15 @@ tabs in a single window or six separate browser windows.
## Decoding the other FastQC outputs
We've now looked at quite a few "Per base sequence quality" FastQC graphs, but there are nine other graphs that we haven't talked about! Below we have provided a brief overview of interpretations for each of these plots. It's important to keep in mind
+ Per tile sequence quality: the machines that perform sequencing are divided into tiles. This plot displays patterns in base quality along these tiles. Consistently low scores are often found around the edges, but hot spots can also occur in the middle if an air bubble was introduced at some point during the run.
+ Per sequence quality scores: a density plot of quality for all reads at all positions. This plot shows what quality scores are most common.
+ Per base sequence content: plots the proportion of each base position over all of the reads. Typically, we expect to see each base roughly 25% of the time at each position, but this often fails at the beginning or end of the read due to quality or adapter content.
+ Per sequence GC content: a density plot of average GC content in each of the reads.
+ Per base N content: the percent of times that 'N' occurs at a position in all reads. If there is an increase at a particular position, this might indicate that something went wrong during sequencing.
+ Sequence Length Distribution: the distribution of sequence lengths of all reads in the file. If the data is raw, there is often on sharp peak, however if the reads have been trimmed, there may be a distribution of shorter lengths.
+ Sequence Duplication Levels: A distribution of duplicated sequences. In sequencing, we expect most reads to only occur once. If some sequences are occurring more than once, it might indicate enrichment bias (e.g. from PCR). If the samples are high coverage (or RNA-seq or amplicon), this might not be true.
+ Overrepresented sequences: A list of sequences that occur more frequently than would be expected by chance.
+ Adapter Content: a graph indicating where adapater sequences occur in the reads.
+ **Per tile sequence quality**: the machines that perform sequencing are divided into tiles. This plot displays patterns in base quality along these tiles. Consistently low scores are often found around the edges, but hot spots can also occur in the middle if an air bubble was introduced at some point during the run.
+ **Per sequence quality scores**: a density plot of quality for all reads at all positions. This plot shows what quality scores are most common.
+ **Per base sequence content**: plots the proportion of each base position over all of the reads. Typically, we expect to see each base roughly 25% of the time at each position, but this often fails at the beginning or end of the read due to quality or adapter content.
+ **Per sequence GC content**: a density plot of average GC content in each of the reads.
+ **Per base N content**: the percent of times that 'N' occurs at a position in all reads. If there is an increase at a particular position, this might indicate that something went wrong during sequencing.
+ **Sequence Length Distribution**: the distribution of sequence lengths of all reads in the file. If the data is raw, there is often on sharp peak, however if the reads have been trimmed, there may be a distribution of shorter lengths.
+ **Sequence Duplication Levels**: A distribution of duplicated sequences. In sequencing, we expect most reads to only occur once. If some sequences are occurring more than once, it might indicate enrichment bias (e.g. from PCR). If the samples are high coverage (or RNA-seq or amplicon), this might not be true.
+ **Overrepresented sequences**: A list of sequences that occur more frequently than would be expected by chance.
+ **Adapter Content**: a graph indicating where adapater sequences occur in the reads.
## Working with the FastQC text output
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ parameters here, your use case might require a change of parameters. *NOTE: Alwa
and make sure the options you use are appropriate for your data.*
We're going to start by aligning the reads from just one of the
samples in our dataset (`SRRXXXXXXX.fastq`). Later, we'll be
samples in our dataset (`SRR2584866`). Later, we'll be
iterating this whole process on all of our sample files.
~~~
Expand Down
7 changes: 6 additions & 1 deletion _episodes/04-automation.md → _episodes/05-automation.md
Original file line number Diff line number Diff line change
Expand Up @@ -238,7 +238,12 @@ replace SRR2584866_fastqc/Icons/fastqc_icon.png? [y]es, [n]o, [A]ll, [N]one, [r]

We can extend these principles to the entire variant calling workflow. To do this, we will take all of the individual commands that we wrote before, put them into a single file, add variables so that the script knows to iterate through our input files and write to the appropriate output files. This is very similar to what we did with our `read_qc.sh` script, but will be a bit more complex.

Download the script from (here)[https://github.com/data-lessons/wrangling-genomics/blob/gh-pages/files/run_variant_calling.sh] (download to ~/dc_workshop/scripts).
Download the script from (here)[https://raw.githubusercontent.com/data-lessons/wrangling-genomics/gh-pages/files/run_variant_calling.sh] (download to ~/dc_workshop/scripts).

~~~
curl -O https://raw.githubusercontent.com/data-lessons/wrangling-genomics/gh-pages/files/run_variant_calling.sh
~~~
{: .bash}

Our variant calling workflow has the following steps:

Expand Down
9 changes: 4 additions & 5 deletions index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,10 @@ root: .
---

A lot of genomics analysis is done using command-line tools for three reasons:
1) you will often be working with a large number of files,
and working through the command-line rather than through a graphical user interface (GUI) allows you to automate repetitive tasks,
2) you
will often need more compute power than is available on your personal computer, and connecting to and interacting with remote computers
requires a command-line interface, and
1) you will often be working with a large number of files, and working through the command-line rather than
through a graphical user interface (GUI) allows you to automate repetitive tasks,
2) you will often need more compute power than is available on your personal computer, and
connecting to and interacting with remote computers requires a command-line interface, and
3) you will often need to customize your analyses and command-line tools often enable more
customization than the corresponding GUI tools (if in fact a GUI tool even exists).

Expand Down

0 comments on commit 4fe0b77

Please sign in to comment.