Skip to content
This repository was archived by the owner on Feb 16, 2019. It is now read-only.

Commit f456cf3

Browse files
reorganizing
1 parent cc408fc commit f456cf3

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

73 files changed

+18619
-12
lines changed

00-readQC.md

Lines changed: 321 additions & 0 deletions
Large diffs are not rendered by default.

01-automating_a_workflow.md

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# Lesson
2+
3+
Shell scripts
4+
===================
5+
6+
Learning Objectives:
7+
-------------------
8+
#### What's the goal for this lesson?
9+
10+
* Understand what a shell script is
11+
* Learn how automate an analytical workflow
12+
13+
14+
## What is a shell script?
15+
A shell script is basically a text file that contains a list of commands
16+
that are executed sequentially. The commands in a shell script are the same
17+
as you would use on the command line.
18+
19+
Once you have worked out the details and tested your commands in the shell, you can save them into a file so, the next time, you can automate the process with
20+
a script.
21+
22+
The basic anatomy of a shell script is a file with a list of commands.
23+
That is also the definition of pretty much any computer program.
24+
25+
```bash
26+
#!/bin/bash
27+
28+
cd ~/dc_sample_data
29+
30+
for file in untrimmed_fastq/*.fastq
31+
do
32+
echo "My file name is $file"
33+
done
34+
```
35+
36+
This looks a lot like the for loops we saw earlier. In fact, it is no different, apart from using indentation and the lack of the '>' prompts; it's just saved in a text file. The line at the top ('#!/bin/bash') is commonly called the shebang line, which is a special kind of comment that tells the shell which program is to be used as the 'intepreter' that executes the code.
37+
38+
In this case, the interpreter is bash, which is the shell environment we are working in. The same approach is also used for other scripting languages such as perl and python. The shebang line is actually optionally unless you want to
39+
make the script executable like a 'real' program.
40+
41+
## How to run a shell script
42+
There are two ways to run a shell script the first way is to specify the
43+
interpreter (bash) and the name of the script. By convention, shell script
44+
use the .sh extension, though this is not enforced.
45+
46+
```bash
47+
$ bash myscript.sh
48+
My file name is untrimmed_fastq/SRR097977.fastq
49+
My file name is untrimmed_fastq/SRR098026.fastq
50+
```
51+
52+
The second was is a little more complicated to set up and requires the shebang line we talked about earlier.
53+
54+
The first step, which only needs to be done once, is to modify the 'permissions' of the text file so that the shell knows the file is executable.
55+
56+
```bash
57+
$ chmod +x myscript.sh
58+
```
59+
60+
After that, you can run the script as a regular program.
61+
62+
```bash
63+
$ ./myscript.sh
64+
$ bash myscript.sh
65+
My file name is untrimmed_fastq/SRR097977.fastq
66+
My file name is untrimmed_fastq/SRR098026.fastq
67+
```
68+
69+
The thing about running programs on the command line is that the shell may not know the location of your executables unless they are in the 'path' of know locations for programs. So, you need to tell the shell the path to your script, which is './' if it is in the same directory.
70+
71+
****
72+
**Exercise**
73+
1) Use nano to save the code above to a script called myscript.sh
74+
2) run the script
75+
****
76+
77+
78+
## A real shell script
79+
80+
Now, let's do something real. First, recall the code from our our fastqc
81+
workflow from this morning, with a few extra "echo" statements.
82+
83+
```bash
84+
cd ~/dc_workshop/data/untrimmed_fastq/
85+
86+
echo "Running fastqc..."
87+
~/FastQC/fastqc *.fastq
88+
mkdir -p ~/dc_workshop/results/fastqc_untrimmed_reads
89+
90+
echo "saving..."
91+
mv *.zip ~/dc_workshop/results/fastqc_untrimmed_reads/
92+
mv *.html ~/dc_workshop/results/fastqc_untrimmed_reads/
93+
94+
cd ~/dc_workshop/results/fastqc_untrimmed_reads/
95+
96+
echo "Unzipping..."
97+
for zip in *.zip
98+
do
99+
unzip $zip
100+
done
101+
102+
echo "saving..."
103+
cat */summary.txt > ~/dc_workshop/docs/fastqc_summaries.txt
104+
```
105+
106+
107+
****
108+
**Exercise**
109+
110+
1) Use nano to create a shell script using with the code above (you can copy/paste),
111+
named read_qc.sh
112+
113+
2) Run the script
114+
115+
3) Bonus points: Use something you learned yesterday to save the output
116+
of the script to a file while it is running.
117+
****
118+
119+
120+
121+
122+

02-variant-calling-workflow.md

Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
# Lesson
2+
3+
Automating a workflow
4+
===================
5+
6+
Learning Objectives:
7+
-------------------
8+
#### What's the goal for this lesson?
9+
10+
* Use a series of command line tools to perform a variant calling workflow
11+
* Use a For loop from the previous lesson to help automate repetitive tasks
12+
* Group a series of sequential commands into a script to automate a workflow
13+
14+
To get started with this lesson, we will need to grab some data from an outside
15+
server using `wget` on the command line.
16+
17+
Make sure you are in the dc_workshop directory first
18+
19+
```bash
20+
$ cd ~/dc_workshop
21+
$ wget http://reactomerelease.oicr.on.ca/download/archive/variant_calling.tar.gz
22+
```
23+
24+
The file 'variant_calling.tar.gz' is what is commonly called a "tarball", which is
25+
a compressed archive similar to the .zip files we have seen before. We can decompress
26+
this archive using the command below.
27+
28+
```bash
29+
$ tar -zxvf variant_calling.tar.gz
30+
```
31+
This will create a directory tree that contains some input data (reference genome and fastq files)
32+
and a shell script that details the series of commands used to run the variant calling workflow.
33+
34+
<pre>
35+
variant_calling
36+
├── ref_genome
37+
│   └── ecoli_rel606.fasta
38+
├── run_variant_calling.sh
39+
└── trimmed_fastq
40+
├── SRR097977.fastq
41+
├── SRR098026.fastq
42+
├── SRR098027.fastq
43+
├── SRR098028.fastq
44+
├── SRR098281.fastq
45+
└── SRR098283.fastq
46+
</pre>
47+
48+
Without getting into the details yet, the variant calling workflow will do the following steps
49+
50+
1. Index the reference genome for use by bwa and samtools
51+
2. Align reads to reference genome
52+
3. Convert the format of the alignment to sorted BAM, with some intermediate steps.
53+
4. Calculate the read coverage of positions in the genome
54+
5. Detect the single nucleotide polymorphisms (SNPs)
55+
6. Filter and report the SNP variants in VCF (variant calling format)
56+
57+
Let's walk through the commands in the workflow
58+
59+
The first command is to change to our working directory
60+
so the script can find all the files it expects
61+
62+
```bash
63+
$ cd ~/dc_workshop/variant_calling
64+
```
65+
66+
Assign the name/location of our reference genome
67+
to a variable ($genome)
68+
69+
```bash
70+
$ genome=data/ref_genome/ecoli_rel606.fasta
71+
```
72+
73+
We need to index the reference genome for bwa and samtools. bwa
74+
and samtools are programs that are pre-installed on our server.
75+
76+
```bash
77+
bwa index $genome
78+
samtools faidx $genome
79+
```
80+
81+
Create output paths for various intermediate and result files The -p option means mkdir will create the whole path if it does not exist (no error or message will give given if it does exist)
82+
83+
```bash
84+
$ mkdir -p results/sai
85+
$ mkdir -p results/sam
86+
$ mkdir -p results/bam
87+
$ mkdir -p results/bcf
88+
$ mkdir -p results/vcf
89+
```
90+
91+
We will now use a loop to run the variant calling work flow of each of our fastq files, so the list of command below will be execute once for each fastq files.
92+
93+
We would start the loop like this, so the name of each fastq file will by assigned to $fq
94+
95+
```bash
96+
$ for fq in data/trimmed_fastq/*.fastq
97+
> do
98+
> # etc...
99+
```
100+
101+
In the script, it is a good idea to use echo for debugging/reporting to the screen
102+
103+
```bash
104+
$ echo "working with file $fq"
105+
```
106+
107+
This command will extract the base name of the file
108+
(without the path and .fastq extension) and assign it
109+
to the $base variable
110+
111+
```bash
112+
$ base=$(basename $fq .fastq)
113+
$ echo "base name is $base"
114+
```
115+
116+
We will assign various file names to variables both
117+
for convenience but also to make it easier to see what
118+
is going on in the commands below.
119+
```bash
120+
$ fq=data/trimmed_fastq/$base\.fastq
121+
$ sai=results/sai/$base\_aligned.sai
122+
$ sam=results/sam/$base\_aligned.sam
123+
$ bam=results/bam/$base\_aligned.bam
124+
$ sorted_bam=results/bam/$base\_aligned_sorted.bam
125+
$ raw_bcf=results/bcf/$base\_raw.bcf
126+
$ variants=results/bcf/$base\_variants.bcf
127+
$ final_variants=results/vcf/$base\_final_variants.vcf
128+
```
129+
130+
Our data are now staged. The series of command below will run the steps of the analytical workflow
131+
132+
Align the reads to the reference genome
133+
134+
```bash
135+
$ bwa aln $genome $fq > $sai
136+
```
137+
138+
Convert the output to the SAM format
139+
140+
```bash
141+
$ bwa samse $genome $sai $fq > $sam
142+
```
143+
144+
Convert the SAM file to BAM format
145+
146+
```bash
147+
$ samtools view -S -b $sam > $bam
148+
```
149+
Sort the BAM file
150+
151+
```bash
152+
$ samtools sort -f $bam $sorted_bam
153+
```
154+
Index the BAM file for display purposes
155+
156+
```bash
157+
$ samtools index $sorted_bam
158+
```
159+
160+
Do the first pass on variant calling by counting
161+
read coverage
162+
163+
```bash
164+
$ samtools mpileup -g -f $genome $sorted_bam > $raw_bcf
165+
```
166+
Do the SNP calling with bcftools
167+
168+
```bash
169+
$ bcftools view -bvcg $raw_bcf > $variants
170+
```
171+
Filter the SNPs for the final output
172+
173+
```bash
174+
$ bcftools view $variants | /usr/share/samtools/vcfutils.pl varFilter - > $final_variants
175+
```
176+
177+
178+
****
179+
**Exercise**
180+
Run the script https://github.com/JasonJWilliamsNY/wrangling-genomics/blob/gh-pages/lessons/run_variant_calling.sh
181+
****
182+
183+
184+
185+

Gemfile

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
source 'http://rubygems.org'
2+
gem 'github-pages'

0 commit comments

Comments
 (0)