-
Notifications
You must be signed in to change notification settings - Fork 4
Home
Hagfish is a tool that is to be used in data analysis of Next Generation Sequencing (NGS) experiments. Hagfish builds on the concept of coverage plots and aims to assist (amongst others) in quality control of de novo genome assembly or identification of structural variation in a genome re-sequencing experiment.
Hagfish requires a reference sequence and a paired end re-sequencing data set. Hagfish has more power the larger the insert size of the paired end library is.
Quick links: Installation,Operation, Read mappers, Hagfish scripts, Hagfish plots
Hagfish requires as input a BAM file with read pairs that have been mapped back to a reference genome. The first step Hagfish takes is to determine the insert size of the paired end library. With this information, Hagfish categorize each mapped read pair into one of three categories (see Figure 1).
- Category ok : the read pairs align at the expected distance (for that sequencing run)
- Category high : the read pairs align, but too far apart from each other.
- Category low : the read pairs align, but too close together.
Figure 1, schematic view of read pairs mapped to a reference sequence. Each block represents a single mapped readpair, with the darker colored ends representing the actual reads. Readpairs in the ok category are green, high is red and low is green.
The next step is to collapse the readpairs of each of the three categories into two different coverage scores (see Figure 2).
-
ECP : A regular (exclusive) coverage score - each read that covers a nucleotide increases the coverage score for that nucleotide by one.
-
ICP : As an ECP, but nucleotides inbetween a the reads of a mapped read pair also receive a coverage score increment.
For example: for these three Example score calculation for
Figure 2, example of three readpairs mapped to a reference sequence, these read pairs collapse into two different coverage scores. In the case of the ECP, only the reads account towards the coverage score. In the case of the ICP, the region inbetween the read pairs is included as well.
The six different coverage plots (an ICP & ECP for each of the three categories) are saved as numpy arrays, and used for subsequent plotting.
Hagfish is released under the GPLv3 - see COPYING for more details.