|
| 1 | +# Lesson |
| 2 | + |
| 3 | +Automating a workflow |
| 4 | +=================== |
| 5 | + |
| 6 | +Learning Objectives: |
| 7 | +------------------- |
| 8 | +#### What's the goal for this lesson? |
| 9 | + |
| 10 | +* Use a series of command line tools to perform a variant calling workflow |
| 11 | +* Use a For loop from the previous lesson to help automate repetitive tasks |
| 12 | +* Group a series of sequential commands into a script to automate a workflow |
| 13 | + |
| 14 | +To get started with this lesson, we will need to grab some data from an outside |
| 15 | +server using `wget` on the command line. |
| 16 | + |
| 17 | +Make sure you are in the dc_workshop directory first |
| 18 | + |
| 19 | +```bash |
| 20 | +$ cd ~/dc_workshop |
| 21 | +$ wget http://reactomerelease.oicr.on.ca/download/archive/variant_calling.tar.gz |
| 22 | +``` |
| 23 | + |
| 24 | +The file 'variant_calling.tar.gz' is what is commonly called a "tarball", which is |
| 25 | +a compressed archive similar to the .zip files we have seen before. We can decompress |
| 26 | +this archive using the command below. |
| 27 | + |
| 28 | +```bash |
| 29 | +$ tar -zxvf variant_calling.tar.gz |
| 30 | +``` |
| 31 | +This will create a directory tree that contains some input data (reference genome and fastq files) |
| 32 | +and a shell script that details the series of commands used to run the variant calling workflow. |
| 33 | + |
| 34 | +<pre> |
| 35 | +variant_calling |
| 36 | +├── ref_genome |
| 37 | +│ └── ecoli_rel606.fasta |
| 38 | +├── run_variant_calling.sh |
| 39 | +└── trimmed_fastq |
| 40 | + ├── SRR097977.fastq |
| 41 | + ├── SRR098026.fastq |
| 42 | + ├── SRR098027.fastq |
| 43 | + ├── SRR098028.fastq |
| 44 | + ├── SRR098281.fastq |
| 45 | + └── SRR098283.fastq |
| 46 | +</pre> |
| 47 | + |
| 48 | +Without getting into the details yet, the variant calling workflow will do the following steps |
| 49 | + |
| 50 | +1. Index the reference genome for use by bwa and samtools |
| 51 | +2. Align reads to reference genome |
| 52 | +3. Convert the format of the alignment to sorted BAM, with some intermediate steps. |
| 53 | +4. Calculate the read coverage of positions in the genome |
| 54 | +5. Detect the single nucleotide polymorphisms (SNPs) |
| 55 | +6. Filter and report the SNP variants in VCF (variant calling format) |
| 56 | + |
| 57 | +Let's walk through the commands in the workflow |
| 58 | + |
| 59 | +The first command is to change to our working directory |
| 60 | +so the script can find all the files it expects |
| 61 | + |
| 62 | +```bash |
| 63 | +$ cd ~/dc_workshop/variant_calling |
| 64 | +``` |
| 65 | + |
| 66 | +Assign the name/location of our reference genome |
| 67 | +to a variable ($genome) |
| 68 | + |
| 69 | +```bash |
| 70 | +$ genome=data/ref_genome/ecoli_rel606.fasta |
| 71 | +``` |
| 72 | + |
| 73 | +We need to index the reference genome for bwa and samtools. bwa |
| 74 | +and samtools are programs that are pre-installed on our server. |
| 75 | + |
| 76 | +```bash |
| 77 | +bwa index $genome |
| 78 | +samtools faidx $genome |
| 79 | +``` |
| 80 | + |
| 81 | +Create output paths for various intermediate and result files The -p option means mkdir will create the whole path if it does not exist (no error or message will give given if it does exist) |
| 82 | + |
| 83 | +```bash |
| 84 | +$ mkdir -p results/sai |
| 85 | +$ mkdir -p results/sam |
| 86 | +$ mkdir -p results/bam |
| 87 | +$ mkdir -p results/bcf |
| 88 | +$ mkdir -p results/vcf |
| 89 | +``` |
| 90 | + |
| 91 | +We will now use a loop to run the variant calling work flow of each of our fastq files, so the list of command below will be execute once for each fastq files. |
| 92 | + |
| 93 | +We would start the loop like this, so the name of each fastq file will by assigned to $fq |
| 94 | + |
| 95 | +```bash |
| 96 | +$ for fq in data/trimmed_fastq/*.fastq |
| 97 | +> do |
| 98 | +> # etc... |
| 99 | +``` |
| 100 | +
|
| 101 | +In the script, it is a good idea to use echo for debugging/reporting to the screen |
| 102 | +
|
| 103 | +```bash |
| 104 | +$ echo "working with file $fq" |
| 105 | +``` |
| 106 | +
|
| 107 | +This command will extract the base name of the file |
| 108 | +(without the path and .fastq extension) and assign it |
| 109 | +to the $base variable |
| 110 | +
|
| 111 | +```bash |
| 112 | +$ base=$(basename $fq .fastq) |
| 113 | +$ echo "base name is $base" |
| 114 | +``` |
| 115 | +
|
| 116 | +We will assign various file names to variables both |
| 117 | +for convenience but also to make it easier to see what |
| 118 | +is going on in the commands below. |
| 119 | +```bash |
| 120 | +$ fq=data/trimmed_fastq/$base\.fastq |
| 121 | +$ sai=results/sai/$base\_aligned.sai |
| 122 | +$ sam=results/sam/$base\_aligned.sam |
| 123 | +$ bam=results/bam/$base\_aligned.bam |
| 124 | +$ sorted_bam=results/bam/$base\_aligned_sorted.bam |
| 125 | +$ raw_bcf=results/bcf/$base\_raw.bcf |
| 126 | +$ variants=results/bcf/$base\_variants.bcf |
| 127 | +$ final_variants=results/vcf/$base\_final_variants.vcf |
| 128 | +``` |
| 129 | +
|
| 130 | +Our data are now staged. The series of command below will run the steps of the analytical workflow |
| 131 | +
|
| 132 | +Align the reads to the reference genome |
| 133 | +
|
| 134 | +```bash |
| 135 | +$ bwa aln $genome $fq > $sai |
| 136 | +``` |
| 137 | +
|
| 138 | +Convert the output to the SAM format |
| 139 | +
|
| 140 | +```bash |
| 141 | +$ bwa samse $genome $sai $fq > $sam |
| 142 | +``` |
| 143 | +
|
| 144 | +Convert the SAM file to BAM format |
| 145 | +
|
| 146 | +```bash |
| 147 | +$ samtools view -S -b $sam > $bam |
| 148 | +``` |
| 149 | +Sort the BAM file |
| 150 | +
|
| 151 | +```bash |
| 152 | +$ samtools sort -f $bam $sorted_bam |
| 153 | +``` |
| 154 | +Index the BAM file for display purposes |
| 155 | +
|
| 156 | +```bash |
| 157 | +$ samtools index $sorted_bam |
| 158 | +``` |
| 159 | +
|
| 160 | +Do the first pass on variant calling by counting |
| 161 | +read coverage |
| 162 | +
|
| 163 | +```bash |
| 164 | +$ samtools mpileup -g -f $genome $sorted_bam > $raw_bcf |
| 165 | +``` |
| 166 | +Do the SNP calling with bcftools |
| 167 | +
|
| 168 | +```bash |
| 169 | +$ bcftools view -bvcg $raw_bcf > $variants |
| 170 | +``` |
| 171 | +Filter the SNPs for the final output |
| 172 | +
|
| 173 | +```bash |
| 174 | +$ bcftools view $variants | /usr/share/samtools/vcfutils.pl varFilter - > $final_variants |
| 175 | +``` |
| 176 | + |
| 177 | + |
| 178 | +**** |
| 179 | +**Exercise** |
| 180 | +Run the script https://github.com/JasonJWilliamsNY/wrangling-genomics/blob/gh-pages/lessons/run_variant_calling.sh |
| 181 | +**** |
| 182 | +
|
| 183 | +
|
| 184 | +
|
| 185 | +
|
0 commit comments