This repository contains a snakemake
workflow for processing Illumina sequencing data, optimized and validated for metagenomics. The end-to-end workflow, up_illumina_wf.snakefile
, includes steps to process raw reads into taxonomic annotation: quality control, human read filtering, de novo assembly, annotation, and result summarization. The Snakefile is designed to be resource-aware, modular, and easy to configure, with outputs dynamically organized based on the current date.
-
Dynamic output folder & raw data linking
- Automatically creates an output folder named based on the current date (e.g.,
processed_ddmmyy
). hardlink_raw
: Hard-links or copies raw FASTQ files fromraw_data/
. The dynamically generated folder ensures easy data management.
- Automatically creates an output folder named based on the current date (e.g.,
-
Quality control & deduplication
- Performs adapter detection, quality trimming, and read deduplication in a single step using fastp.
QC_after_dedup
: Generates cleaned, deduplicated reads, and provides an HTML/JSON report detailing QC metrics.
-
Human read filtering
-
De novo assembly
- Assembles filtered reads using metaSPAdes, optimized for metagenomic data.
assemble_filtered
: Generates assembled contigs; includes post-processing such as renaming contigs to include run and barcode information for clarity.
-
Annotation
- Runs high-speed homology searches with DIAMOND BLASTX (e-value of
10-5
) to annotate contigs against a protein database. blastx_assembled
: Produces a tab-delimited output with taxonomic and functional information for each contig.
- Runs high-speed homology searches with DIAMOND BLASTX (e-value of
-
Mapping & statistics
- Maps reads back to the assembled contigs (with
bwa-mem2
+samtools
) and uses seqkit for read-count statistics. map_reads_to_contigs
: Generates BAM files, creates coverage files, and mapping statistics.
- Maps reads back to the assembled contigs (with
-
Merge results and organization
merge_results
: Merges coverage and annotation results into summary tables.store_completed_annotation_files
: Creates renamed, centrally linked annotation for easier downstream analysis. It renamescompleted_{sample}_annotation.tsv
files, links them in a centralannotations/
folder, and removes temporary files.
-
Rule prioritization
- Certain rules, such as
blastx_assembled
andassemble_filtered
, have assigned priorities to optimize scheduling and execution.
- Certain rules, such as
-
Clone the repository:
git clone https://github.com/divprasad/updated_illumina_workflow.git cd updated_illumina_workflow
-
Set up a
conda
environment:conda env create -f environment.yaml conda activate illumina_wf_env
-
Set working directory: Place raw Illumina paired-end FASTQ files in the following directory structure:
raw_data/{run}/{sample}_R1_001.fastq.gz raw_data/{run}/{sample}_R2_001.fastq.gz
This structure is critical for the workflow to recognize
run
andsample
wildcards correctly. Ensure the file names follow this format. -
Run the workflow wrapper:
Launch
execute-and-log.sh
, which provides step-by-step guidance on executingsnakemake
and manages execution logging. -
Alternatively, launch the workflow with minimal settings:
snakemake -s up_illumina_wf.snakefile --cores 8
The pipeline will start using 8 cores, and the results will be saved in a directory named
processed_ddmmyy
(default naming format based on the current date).
Example project layout after cloning the repo and executing the workflow:
updated_illumina_workflow/
├── raw_data/ # SET-UP BY USER [IMPORTANT]
│ ├── run123/ # Each run has its OWN FOLDER: runXYZ
│ │ ├── sampleA_R1_001.fastq.gz # paired-end reads for each sample
│ │ ├── sampleA_R2_001.fastq.gz # strictly follow naming convention
│ │ ├── sampleB_R1_001.fastq.gz # (can have) several samples per run
│ │ └── ...
│ └── run456/ # run names MUST be UNIQUE
│ ├── sampleC_R1_001.fastq.gz # sample names don't have to be unique
│ └── ... # across the runs
├── up_illumina_wf.snakefile # snakefile
├── environment.yaml # dependencies for conda installation
├── execute-and-log.sh # easy-to-use wrapper
├── README.md
└── processed_ddmmyy/ # RESULTS: AUTO generated by snakemake
├── run123/
│ └── sampleA/
│ ├── completed_sampleA_annotation.tsv
│ ├── sampleA_coverage.txt
│ ├── raw/
│ ├── dedup_qc/
│ ├── filtered/
│ ├── assembly/
│ ├── mappings/
│ ├── summary/
│ └── ...
├── run123/
│ └── sampleC/
│ ├── ...
└── ...
raw_data/
- Contains the raw FASTQ files to be processed.processed_ddmmyy/
- Generated by the workflow, this directory stores processed outputs. Files are organized into subfolders for each{run}
and{sample}
, following the directory structure{OUTPUT_FOLDER}/{run}/{sample}/{rule_output}/
-
completed_{sample}_annotation.tsv
- Stores mergedDIAMOND
BLASTX annotation results for downstream analysis. -
dedup_qc/
- Contains the QC'ed and deduplicated reads, along withfastp
reports. -
filtered/
- Holds the human-filtered reads (after removing host contamination.) -
assembly/contigs.fasta
- Contains the contigs generated from metaSPAdes assembly. -
{sample}_coverage.txt
- Stores the depth at each position of each contig. -
mappings/
- Contains{sample}_mappings.bam
(a binary file with filtered-reads mapped back to contigs) and{sample}_readmap.txt
(a human-readable file showing reads that map to each contig). -
summary/
-{run}_{sample}_read_processing_stats.tsv
stores read statistics across all stages of read processing;{sample}_bam_stat.tsv
stores the total number of alignments (including multiple alignments of the same read).
The workflow can be launched by specifying the number of cores, memory, or other parameters.
snakemake -s up_illumina_wf.snakefile \
--resources mem_gb=192 \
--cores 24 \
--rerun-triggers mtime \
--rerun-incomplete \
--latency-wait 30 \
--max-jobs-per-second 2 \
--max-status-checks-per-second 4
-
Default output: If no custom folder is specified by passing the
config
flag, default output directory isprocessed_ddmmyy
. -
cores
: Use up to 24 CPU cores -
resources mem_gb
: Allocate 192 GB memory (roughly 8× the number of cores). -
rerun-triggers
: Force a rerun if file modification times indicate that inputs have changed. -
rerun-incomplete
: Rerun incomplete jobs from previous executions. -
latency-wait
: Waits up to 30 seconds for input files (useful for network filesystems). -
Job submission and status check rate: Limit new job submission rate (
max-jobs-per-second
) to 2 new jobs per second; limit job status check rate (max-status-checks-per-second
) to 4 checks per second -
Configuration: Pass configuration variable flag to set a custom output folder:
snakemake -s up_illumina_wf.snakefile \ --config OUTPUT_FOLDER="processed_mydate" \ --cores 16
-
Threads and Memory: Each rule in the Snakefile can dynamically allocate threads and memory.
-
Adjusting Resources: Modify the
threads:
orresources:
directives inside each rule for finer control. -
Adjusting Priorities: Some rules have priority settings to optimize execution order. This ensures that computationally intensive steps start earlier, preventing bottlenecks and minimizing total runtime. A higher priority means the rule is executed first. Example Priority Assignments:
blastx_assembled
has priority 2 → The most time-consuming step, so it runs first.assemble_filtered
has priority 1 → Runs before all other jobs exceptblastx_assembled
.
Note: By default, all rules are priority 0. Adjust priority levels as needed to optimize execution.
Adapted by: Div Prasad (Jul'24–Feb'25)
Original Workflow by: Nathalie Worp & David Nieuwenhuijse
This snakemake
pipeline is an adaptation of the original work by Nathalie Worp and David Nieuwenhuijse, incorporating updates, enhancements, and additional features. Special thanks to both of them for their contributions, as their work laid the foundation for this repository.