Pan-Genome Analysis Pipeline 2
The input directory contains all the genome and annotation files.
PGAP2 supports multiple input formats: GFF files in the same format as those output by Prokka, GFF files with their corresponding genome FASTA files in separate files, GenBank flat files (GBFF), or just genome FASTA files (with --reannot
required).
Different formats of input files can be mixed in one input directory. PGAP2 will recognize and process them based on their prefixes and suffixes.
pgap2 main -i inputdir/ -o outputdir/
Quality checks and visualization are conducted by PGAP2 during the preprocessing step. PGAP2 generates an interactive HTML file and corresponding vector figures to help users understand their input data. The input data and pre-alignment results are stored as a pickle file for quick restarting of the same calculation step.
pgap2 prep -i inputdir/ -o outputdir/
The postprocessing pipeline is performed by PGAP2. There are various submodules integrated into the postprocessing module, such as statistical analysis, single-copy tree building, population clustering, and Tajima's D test. Regardless of which submodule you want to use, you can always run it as follows:
pgap2 post [submodule] [options] -i inputdir/ -o outputdir/
The inputdir is the outputdir of main module.
PGAP2 also support statistical analysis using a PAV file indepandently:
pgap2 post profile --pav your_pav_file -o outputdir/
The best way to install full version of PGAP2 package is using conda:
conda create -n pgap2 -c conda-forge -c bioconda -c defaults pgap2
alternatively it is often faster to use the mamba solver
conda create -n pgap2 -c conda-forge mamba
conda activate pgap2
mamba install -c conda-forge -c bioconda -c defaults pgap2
Or sometimes you only want to carry out a specific function, such as partioning and don't want install too many extra softwares for fully version of PGAP2, then you can just install PGAP2:
pip install pgap2
Or via source file:
git clone https://github.com/bucongfan/PGAP2
And then install extra software that only necessary for a specific function by yourself.
Dependencies of PGAP2 are list below, and PGAP2 will check them whether in environment path or in pgap2/dependencies folder.
- One of clustering software
- mcl
- One of alignment software
- Using
--retrieve
to retrieve missing gene loci - Using
--reannot
to re-annotate your genome
- One of MSA software
- ClipKIT
- One of phylogenetic tree construction software
- ClonalFrameML
- maskrc-svg
- fastbaps
PGAP2 will call Rscript in your environment virable. The library should have:
- ggpubr
- ggrepel
- dplyr
- tidyr
- patchwork
- optparse
Please refer documentation from wiki.