Home

Tailor’s Analysis Workflow

The Tailor RNA-Seq analysis pipeline is an end-to-end tool-suite that produces a set of biomarkers from comparative differential expression between two or more conditions. Tailor extends and simplifies the Tophat-Cufflinks tool-suite for RNA-Seq library preparation in order to perform comparative gene expression analyses and visualize significant differences in the gene expression profiles between the conditions. The pipeline provides an automated workflow that facilitates resource requests over a HPCC through multi-step, in-parallel, computational “jobs” submitted to either LSF or SGE job-schedulers. Each step is initiated with a simple two-word command (“tailor fastqc,” “tailor trimgalore”) from a command-line interface. The output from each step is provided as input to the next step.

A user provided sample sheet maps the names of the input (bcl files) to the human readable sample names from each group or condition. Tailor requires an index of the genome of the organism being studied as well as genome fasta file from which it was created to map the data to the genome. The Gene model annotations associated with the indexed genome are required to quantify data against known genes. Tailor includes several several pre-built indexes and secondary tools for genome indexing and gene-model building if your analysis utilizes a non-model organisms. Tailor’s workflow can be customized via modification of the settings configuration file. To identify the configuration file to tailor use this command:

> export TAILOR_CONFIG=./foo.config-file

This feature enables the user to initiate the workflow from any step of the pipeline and utilizing most standard RNA-Seq file formats (bcl, fasta, fastq,sam, bam, gtf, gff, cxb,diff etc). Review the contents of the settings file and make changes where necessary. Several pre-configured settings files are included with Tailor to provide different levels of feature discovery. The default workflow does not perform any novel feature (isoform, gene) discovery while the genome-guided workflow uses gene-model annotations to guide the analysis and discover potentially novel isoforms of known genes. The first step in the pipeline begins with file format conversion and de-multiplexing of ILllumina's binary proprietary bcl data to industry standard fastq via a tool the sequence manufacturer designed. FastQC quality control inspection of the sequence data was performed in order to check for over-represented sequences, anomalous features, and phred quality scores. The paired end reads were then parsed by Trim-galore in order to identify and trim the sample-specific adapter sequence that ligated the reads to the bottom of the flow-cell. After adapter trimming, the sequence libraries were mapped to UCSC’s hg38 reference genome assembly by Tophat, a splice aware mapper. After read-mapping, the transcripts were assembled by Cufflinks into de-Bruijn graphs. Cufflinks determined each gene’s isoform structure from bipartite matching the isoforms into a minimum-spanning-trees (mst) that accounts for the distribution of all of the reads arising from a particular gene. After the transcripts are assembled, a gene transfer format file was produced for each sample and compared to the reference annotation by Cuffmerge. The new gene models and the mapped bam files from the Tophat step are input to Cuffquant for gene expression quantification. The gene models and the gene expression files were input to Cuffdiff for comparative differential expression analysis. Cuffdiff modeled the gene expression using an overdispersed Poisson-model. The variance is calculuated as a function of the mean gene expression. Cuffdiff calculates dispersion from the variance present in a group beyond what is expected from a simple Poisson model. Tailor calls Cuffdiff to perform means tests of the expression values for each gene across both conditions. Cuffdiff outputs gene expression values (FPKM),log-fold-change expression values, FDR corrected p-values (q-values), p-values, and their associated test-statistic from the means tests of the distribution of expression values for each gene accross both conditions. Biomarker candidates were identified from q-values below α=0.05 . Cuffdiff output served as input for CummeRbund, a Bioconductor package that produces visually digestible summations of the salient features from RNA-Seq data.

Works Cited

Andrews, S. FastQC A Quality Control tool for High Throughput Sequence Data. 2014 http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
Stubbs T, Bonder M, Stark A, Krueger F. Multi-tissue DNA methylation age predictor in mouse. Genome Biology. 2017; 18:68. https://doi.org/10.1186/s13059-017-1203-5
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology.2009;10:R25. https://doi.org/10.1186/gb-2009-10-3-r25
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions.Genome Biology. 2013; 14(4):R36. doi: 10.1186/gb-2013-14-4-r36.
Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology. 2010; 28:511-515. http://dx.doi.org/10.1038/nbt.1621
Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology. 2011;14:R22. doi: 10.1186/gb-2011-12-3-r22.
Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nature Biotechnology. 2012;31(1):46-53.
Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley D, Pimentel H, Salzberg S, Rinn J, Pachter L. "Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks" Nature Protocols. 2012; volume 7, pages 562–578; https://doi.org/10.1038/nprot.2012.016
R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2014; http://www.R-project.org/.
L. Goff, C. Trapnell and D. Kelley. "CummeRbund: Analysis, Exploration, Manipulation, and Visualization of Cufflinks High-Throughput Sequencing Data." 2013; R package version 2.20.0.
Robinson, M. D., D. J. McCarthy, and G. K. Smyth. “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics. 2009; 26 (1). Oxford University Press:139–40. https://doi.org/10.1093/bioinformatics/btp616.
Lun, ATL, Chen, Y, and Smyth, GK. It’s DE-licious: a recipe for differential expression analyses of RNA-seq experiments using quasi-likelihood methods in edgeR. Methods in Molecular Biology. 2016; 1418, 391–416
Smyth, G. K. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology. 2004; 3, No. 1, Article 3. http://www.statsci.org/smyth/pubs/ebayes.pdf
Ritchie, M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth. “limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 2015; 43 (7):e47.
Law, Charity W., Yunshun Chen, Wei Shi, and Gordon K. Smyth. “Voom: precision weights unlock linear model analysis tools for RNA-seq read counts.” Genome Biology. 2014; 15 (2). BioMed Central Ltd:R29+. https://doi.org/10.1186/gb-2014-15-2-r29.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Tailor’s Analysis Workflow

Works Cited

Clone this wiki locally