Illumina Workflow for Metagenomic Data Processing

This repository contains a snakemake workflow for processing Illumina sequencing data, optimized and validated for metagenomics. The end-to-end workflow, up_illumina_wf.snakefile, includes steps to process raw reads into taxonomic annotation: quality control, human read filtering, de novo assembly, annotation, and result summarization. The Snakefile is designed to be resource-aware, modular, and easy to configure, with outputs dynamically organized based on the current date.

Workflow breakdown: from raw reads to annotation

Dynamic output folder & raw data linking
- Automatically creates an output folder named based on the current date (e.g., processed_ddmmyy).
- hardlink_raw: Hard-links or copies raw FASTQ files from raw_data/. The dynamically generated folder ensures easy data management.
Quality control & deduplication
- Performs adapter detection, quality trimming, and read deduplication in a single step using fastp.
- QC_after_dedup: Generates cleaned, deduplicated reads, and provides an HTML/JSON report detailing QC metrics.
Human read filtering
- Removes (background contaminating) human reads by aligning against the human reference genome with bwa-mem2, then uses samtools to retain only unmapped reads.
- filter_human: Generates filtered reads free of host contamination to improve downstream assembly and annotation accuracy.
De novo assembly
- Assembles filtered reads using metaSPAdes, optimized for metagenomic data.
- assemble_filtered: Generates assembled contigs; includes post-processing such as renaming contigs to include run and barcode information for clarity.
Annotation
- Runs high-speed homology searches with DIAMOND BLASTX (e-value of 10-5) to annotate contigs against a protein database.
- blastx_assembled: Produces a tab-delimited output with taxonomic and functional information for each contig.
Mapping & statistics
- Maps reads back to the assembled contigs (with bwa-mem2 + samtools) and uses seqkit for read-count statistics.
- map_reads_to_contigs: Generates BAM files, creates coverage files, and mapping statistics.
Merge results and organization
- merge_results: Merges coverage and annotation results into summary tables.
- store_completed_annotation_files: Creates renamed, centrally linked annotation for easier downstream analysis. It renames completed_{sample}_annotation.tsv files, links them in a central annotations/ folder, and removes temporary files.
Rule prioritization
- Certain rules, such as blastx_assembled and assemble_filtered, have assigned priorities to optimize scheduling and execution.

Installation and quick start

Clone the repository:

git clone https://github.com/divprasad/updated_illumina_workflow.git
cd updated_illumina_workflow

Set up a conda environment:

conda env create -f environment.yaml
conda activate illumina_wf_env

Set working directory: Place raw Illumina paired-end FASTQ files in the following directory structure:
```
raw_data/{run}/{sample}_R1_001.fastq.gz
raw_data/{run}/{sample}_R2_001.fastq.gz
```
This structure is critical for the workflow to recognize run and sample wildcards correctly. Ensure the file names follow this format.
Run the workflow wrapper:

Launch execute-and-log.sh, which provides step-by-step guidance on executing snakemake and manages execution logging.
Alternatively, launch the workflow with minimal settings:
```
snakemake -s up_illumina_wf.snakefile --cores 8
```
The pipeline will start using 8 cores, and the results will be saved in a directory named processed_ddmmyy (default naming format based on the current date).

Project structure & key outputs

Example project layout after cloning the repo and executing the workflow:

updated_illumina_workflow/
├── raw_data/                            # SET-UP BY USER [IMPORTANT]
│   ├── run123/                          # Each run has its OWN FOLDER: runXYZ
│   │    ├── sampleA_R1_001.fastq.gz     # paired-end reads for each sample
│   │    ├── sampleA_R2_001.fastq.gz     # strictly follow naming convention
│   │    ├── sampleB_R1_001.fastq.gz     # (can have) several samples per run
│   │    └── ...
│   └── run456/                          # run names MUST be UNIQUE
│        ├── sampleC_R1_001.fastq.gz     # sample names don't have to be unique
│        └── ...                         # across the runs
├── up_illumina_wf.snakefile             # snakefile
├── environment.yaml                     # dependencies for conda installation
├── execute-and-log.sh                   # easy-to-use wrapper
├── README.md                          
└── processed_ddmmyy/                    # RESULTS: AUTO generated by snakemake
    ├── run123/
    │   └── sampleA/
    │       ├── completed_sampleA_annotation.tsv
    │       ├── sampleA_coverage.txt
    │       ├── raw/
    │       ├── dedup_qc/
    │       ├── filtered/
    │       ├── assembly/
    │       ├── mappings/
    │       ├── summary/
    │       └──  ...
    ├── run123/
    │   └── sampleC/
    │       ├──  ...
    └── ...

raw_data/ - Contains the raw FASTQ files to be processed.
processed_ddmmyy/ - Generated by the workflow, this directory stores processed outputs. Files are organized into subfolders for each {run} and {sample}, following the directory structure {OUTPUT_FOLDER}/{run}/{sample}/{rule_output}/

Output directories & files

completed_{sample}_annotation.tsv - Stores merged DIAMOND BLASTX annotation results for downstream analysis.
dedup_qc/ - Contains the QC'ed and deduplicated reads, along with fastp reports.
filtered/ - Holds the human-filtered reads (after removing host contamination.)
assembly/contigs.fasta - Contains the contigs generated from metaSPAdes assembly.
{sample}_coverage.txt - Stores the depth at each position of each contig.
mappings/ - Contains {sample}_mappings.bam (a binary file with filtered-reads mapped back to contigs) and {sample}_readmap.txt (a human-readable file showing reads that map to each contig).
summary/ - {run}_{sample}_read_processing_stats.tsv stores read statistics across all stages of read processing; {sample}_bam_stat.tsv stores the total number of alignments (including multiple alignments of the same read).

Advanced usage and configuration

The workflow can be launched by specifying the number of cores, memory, or other parameters.

snakemake -s up_illumina_wf.snakefile \
    --resources mem_gb=192 \
    --cores 24 \
    --rerun-triggers mtime \
    --rerun-incomplete \
    --latency-wait 30 \
    --max-jobs-per-second 2 \
    --max-status-checks-per-second 4

Default output: If no custom folder is specified by passing the config flag, default output directory is processed_ddmmyy.
cores: Use up to 24 CPU cores
resources mem_gb: Allocate 192 GB memory (roughly 8× the number of cores).
rerun-triggers: Force a rerun if file modification times indicate that inputs have changed.
rerun-incomplete: Rerun incomplete jobs from previous executions.
latency-wait: Waits up to 30 seconds for input files (useful for network filesystems).
Job submission and status check rate: Limit new job submission rate (max-jobs-per-second) to 2 new jobs per second; limit job status check rate (max-status-checks-per-second) to 4 checks per second

Configuration: Pass configuration variable flag to set a custom output folder:

snakemake -s up_illumina_wf.snakefile \
    --config OUTPUT_FOLDER="processed_mydate" \
    --cores 16

Resource Allocation and Priorities

Threads and Memory: Each rule in the Snakefile can dynamically allocate threads and memory.
Adjusting Resources: Modify the threads: or resources: directives inside each rule for finer control.
Adjusting Priorities: Some rules have priority settings to optimize execution order. This ensures that computationally intensive steps start earlier, preventing bottlenecks and minimizing total runtime. A higher priority means the rule is executed first. Example Priority Assignments:
- blastx_assembled has priority 2 → The most time-consuming step, so it runs first.
- assemble_filtered has priority 1 → Runs before all other jobs except blastx_assembled.
Note: By default, all rules are priority 0. Adjust priority levels as needed to optimize execution.

Acknowledgements

Adapted by: Div Prasad (Jul'24–Feb'25)
Original Workflow by: Nathalie Worp & David Nieuwenhuijse

This snakemake pipeline is an adaptation of the original work by Nathalie Worp and David Nieuwenhuijse, incorporating updates, enhancements, and additional features. Special thanks to both of them for their contributions, as their work laid the foundation for this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
execute-and-log.sh		execute-and-log.sh
merge_results.R		merge_results.R
up_illumina_wf.snakefile		up_illumina_wf.snakefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Illumina Workflow for Metagenomic Data Processing

Workflow breakdown: from raw reads to annotation

Installation and quick start

Project structure & key outputs

Output directories & files

Advanced usage and configuration

Resource Allocation and Priorities

Acknowledgements

About

Releases

Packages

Languages

License

EMC-Viroscience/updated_illumina_workflow

Folders and files

Latest commit

History

Repository files navigation

Illumina Workflow for Metagenomic Data Processing

Workflow breakdown: from raw reads to annotation

Installation and quick start

Project structure & key outputs

Output directories & files

Advanced usage and configuration

Resource Allocation and Priorities

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages