An RNASeq pipeline designed to accomodate both standard use-cases (ie single organism) as well as the case where both a host and partner (symbiont, parasite, commensal) are of interest.
graph TD;
fastq-->fastp;
fastp-->RSEM;
fasta_host-->fasta_ref;
fasta_other-->fasta_ref;
fasta_ref-->bowtie2_build;
bowtie2_build-->RSEM;
First install and configure nextflow. See here for instructions specific to JCU machines (zodiac, genomics1, genomics2)
Run a test to make sure everything is working. This test should work on a system with singularity installed.
nextflow run marine-omics/morp -profile singularity,test_pe -r main
As a minimum, morp
requires a set of raw read data (fastq files) and a reference transcriptome. Assuming you have raw data paths in a file named samples.csv
and a host reference transcriptome in the file host.fasta
you would run an analysis with;
nextflow run marine-omics/morp -profile zodiac -r main --samples samples.csv --refa host.fasta --outdir myout
Note the profile here is zodiac
which will load predefined settings for the JCU HPC. Other alternatives include genomics
or a custom profile that you create yourself with -c custom.config
.
If you would like to map reads against a dual transcriptome (eg host and symbiont) you can provide a second reference transcriptome like this
nextflow run marine-omics/morp -profile zodiac -r main --samples samples.csv --refa host.fasta --refb symbiont.fasta --outdir myout
In this case morp
will start by combining transcripts from refa
and refb
into a single file. The names of all transcripts in refa
will be prepended with the prefix "a_" and those in refb
will be prepended with "b_". This will allow you to separate them out when you perform your statistical analysis.
The paths to fastq files must be provided in csv
format as in the example below;
sample,fastq_1,fastq_2
1,sample1_r1.fastq.gz,sample1_r2.fastq.gz
2,sample2_r1.fastq.gz,sample2_r2.fastq.gz
See here for more detail on the samples.csv format.
Reference sequences should be provided in fasta
format. The first reference (refa
) is always required, however the second refb
need only be provided for dual organism studies and can be omitted in single-organism analysis.
Note that both fasta files must be provided in uncompressed format (ie no .gz
). Also note that all references provided should be transcripts not genomic sequences. This pipeline is not suitable for gapped alignment to a genome.
If you are working with coral sequences you might be unsure of the correct reference to use for refb
. A good place to start is to run the moqc pipeline which should give you an idea of the symbiont genus that is most dominant. In the most common case this will be Cladocopium in which case a good choice for the reference sequence is the transcriptome available from reefgenomics.
The default analysis in morp
assumes that every transcript is its own gene. To account for the fact that multiple isoforms can originate from the same gene you will need to provide a transcript-to-gene-map
file for each reference.
This file should conform to the requirements of rsem-prepare-reference
. Each line should be of the form:
gene_id transcript_id
with the two fields separated by a tab character.
Note that if you are performing a dual analysis (ie both refa
and refb
specified) and you provide a mapping file for one ref you must also provide it for the other. Mapping files are provided with the --refa_map
and --refb_map
options.
Successful completion of the pipeline will produce outputs in the <outdir>
you provided including;
- *multiqc* : Read mapping statistics . These statistics are produced by rsem based on Bowtie2 alignments.
- *rsem* : Full rsem outputs. This includes the aligned reads as `.bam` files as well as read counts in `isoforms.results` and `genes.results` files.
The tximport R package is recommended for importing rsem
results into R for later analysis.
The rsem vignette provides specific advice on importing results at either the gene or isoform level.
The example below shows importing results at the gene level. Unless you have an excellent isoform-level reference for your organism this is the recommended approach.
library(tximport)
genes_results_files <- list.files("morp_outputs/rsem/","*.genes.results",full.names = TRUE)
sample_names <- basename(genes_results_files) %>% str_extract("([^\\.])*")
names(genes_results_files) <- sample_names
txi.rsem <- tximport(genes_results_files, type = "rsem", txIn = FALSE, txOut = FALSE)
The resulting tximport object can then be used for analysis with DESeq2.
When working with dual organism data the simplest approach is to keep data from both organisms together through the DESeq2 analysis. Results can then be partitioned based on the gene id prefix a_
or b_
.