Skip to content

Latest commit



62 lines (45 loc) · 2.71 KB

File metadata and controls

62 lines (45 loc) · 2.71 KB


A short automatic metagenomic Assembly aNd gene Annotation pipeline. This pipeline is designed to assemble raw metagenomic sequencing data using Megahit software and then annotate the assembly using Prodigal.

Software information

Make sure to have the following software installed before running the pipeline:


  1. Clone this repository to your local system using the following command: git clone

  2. You can directly create the AnA conda environment with this command line:
    conda env create -f env.yml
    And then activate the environment :
    conda activate AnA

Or you can manually create a conda environment and install the following packages:
conda install -c bioconda megahit
conda install -c bioconda prodigal


  1. Your metagenomic sequencing files in FASTQ format are in a same directory (for now, only merge files). <input_dir>, you need to put the name of your directory here. This directory must be in AnA/.

  2. Customize MegaHit and Prodigal options (if needed) in the script:

    • MegaHit options:

      • min_contig_length: The minimum contig length to be reported. Default value is 1000.
      • k_min: The minimum value of k (k-mer size) for the final assembly. Default value is 21.
      • k_step: increment of kmer size of each iteration (<= 28), must be even number. Default value is 20
      • cpu: Number of threads. Default value is 20.
    • Prodigal options:

      • m: Enable metagenomic mode for gene prediction. Default value is true.
  3. Execute the pipeline by running the script:

python <input_dir> <cpu> <min_contig_len> <kstep> <kmin>

For example: If my merged reads are in a directory called 'data' I can run the following command :

python data

Make sure that the scripts are authorised to run. If not, run : chmod +x

The pipeline will perform the following steps:

  • Assembly with Megahit: The metagenomic sequences will be assembled to form contigs in the assembly/{seq_name}_megahit/ directory.
  • Annotation with Prodigal: Predicted genes from the assembly will be annotated into proteins in the .faa and .ffn files in the annotations directory.

The final results will be available in the assembly and annotations directories.


Specify the license under which you want to share the code (GNU GENERAL PUBLIC LICENSE, Version 3).


If you have any questions or comments about the pipeline (or need to add more options), feel free to contact me at [email protected].