Pipeline to analyze cheese samples starting from compressed FASTQ files (.fq.gz).
The program runs the main steps of a metagenomic analysis and saves results in numbered foldersβone for each stage of the process.
The pipeline automatically executes these steps, in order:
-
1. Extraction and adapter trimming
Tool: AdapterRemoval
Extracts FASTQ files from archives (if present), then removes adapter and low-quality sequences, producing βcleanβ reads.2. Host genome filtering
Tool: Bowtie2
Aligns reads to the host species genome (e.g., Bos taurus) and discards mapped reads, keeping only microbial sequences. -
3. Taxonomic profiling
Tool: MetaPhlAn
Estimates the sampleβs taxonomic composition, reporting detected species and relative abundance.4. Functional profiling
Tool: HUMAnN
Analyzes the biological/metabolic functions potentially present in the microbiome. -
5. Metagenome assembly
Tool: SPAdes
Reconstructs contiguous sequences (contigs) by assembling reads.6. Contig filtering
Tool: custom scripts
Selects contigs above length/coverage thresholds.7. Index creation for binning
Tool: Bowtie2
Builds indices for contig binning.8. Read mapping and coverage calculation
Tool: Bowtie2 + custom scripts
Maps reads back and computes coverage for each contig.9. Metagenomic binning
Tool: MetaBAT2
Groups contigs into MAGs (Metagenome-Assembled Genomes).10. MAG quality assessment
Tool: CheckM, CheckM2
Evaluates completeness and contamination of MAGs.11. Filtering and annotation of high-quality MAGs
Tool: custom scripts + TORMES
Selects the best MAGs and annotates them (genes, pathways, resistance, etc.).12. Final analysis and reporting
Tool: custom scripts, R
Produces summary tables and plots.
metacheese/
βββ config/ # Global configuration files
β βββ Config.yml # Main pipeline parameters
β
βββ data/ # Support data and references
β βββ calculate_diversity.R # Diversity analysis in R
β βββ gene/ # Reference genomes (e.g., Bos taurus)
β
βββ docs/ # Technical docs and figures
β
βββ input/ # Input data organized by sample
β βββ campione_prova1/ # Example/test
β βββ PDO/ # Real dataset
β βββ D_PR_01A_L2_1.fq.gz
β βββ ...
β
βββ output/ # Output organized by run (date + code)
β βββ 20250728_PDMIDR/
β β βββ 01_AdapterRemoval/
β β βββ 03_bowtie2_output/
β β βββ config.yml
β βββ ...
β
βββ scripts/ # Pipeline scripts and templates
β βββ run_pipeline.sh # Master script
β βββ pipeline/ # Individual step scripts
β βββ templates/ # Script templates
β βββ utils/ # Utilities (e.g., build_bowtie2_index.sh, delete.sh)
β
βββ Dockerfile # Reproducible Docker environment
βββ docker-compose.yml # Advanced setup (volumes, container)
βββ README.md # This file
- Docker (recommended version β₯ 20.10)
- docker-compose (for container and volume management)
- Recommended resources: β₯ 64 GB RAM and multi-core CPU (some steps are heavy).
No extra tools required: everything is already included in the container.
sudo apt-get update
sudo apt-get install -y docker.io docker-compose
sudo systemctl enable docker --now
Download Docker Desktop and follow the official instructions.
docker --version
docker compose version
docker run hello-world
If you see βHello from Docker!β, the installation is OK.
-
Go to the project folder (cloned or copied locally):
cd metacheese -
Build the Docker image:
docker build -t metacheese . -
Check the image exists:
docker images -
Recommended start with docker-compose (volumes already mapped):
docker compose up -d docker ps # Running containers docker exec -it metacheese-container bash- Local folders remain synced with the container.
- You can monitor progress directly under
output/.
-
(Alternative) Manual container start:
docker run -it --rm -v $PWD/scripts:/main/scripts -v $PWD/config:/main/config -v $PWD/data:/main/data -v $PWD/input:/main/input -v $PWD/output:/main/output metacheese /bin/bash -
Quick management
-
Stop the container:
docker stop metacheese-container -
List containers/images and remove:
docker ps -a docker images docker rm metacheese-container docker rmi metacheese:latest
-
For host filtering (eg. Bos taurus) you need Bowtie2 index.
-
Download the genome (genomic FASTA,
.fna/.fa/.fasta) from NCBI/Ensembl to your machine. -
Create the species folder and move the FASTA inside it (run these in the project folder βmetacheeseβ):
mkdir -p data/gene/Name_host_genome mv /path/to/your/file/<Name_file_host_genome>.fna data/gene/<Name_host_genome>/
mv data/gene/<Name_host_genome>/<Name_file_host_genome>.fna data/gene/<Name_host_genome>/genome.fna
Any filename is fine as long as itβs inside
data/gene/<Name_host_genome>/with.fa/.fna/.fasta. -
Build the index (run inside the container):
bash scripts/build_bowtie2_index.sh
When prompted, enter:
<Name_host_genome>
The script creates index files with prefix data/gene/<Name_host_genome>/<Name_host_genome>, as expected by Config.yml.
data/gene/<Name_host_genome>/
βββ genome.fna # (or your .fna/.fa/.fasta)
βββ <Name_host_genome>.1.bt2
βββ <Name_host_genome>.2.bt2
βββ <Name_host_genome>.3.bt2
βββ <Name_host_genome>.4.bt2
βββ <Name_host_genome>.rev.1.bt2
βββ <Name_host_genome>.rev.2.bt2
-
Prepare input data
Create a folder underinput/(e.g.,input/PDO/) and place the*.fq.gzfiles to analyze. -
(Optional) Configure resources
Editconfig/Config.ymlto set threads/RAM and other step parameters. FORMAT OF STEP DEFINITIONS: [stepXX]="<TEMPLATE_FILE> <OUTPUT_FILE> <ph1=YAMLkey1> <ph2=YAMLkey2> ..." - TEMPLATE_FILE = name of the template script (inside scripts/templates/) - OUTPUT_FILE = path where the generated script will be written - PREFIX = prefix used in placeholders (usually the step number, e.g. 03 or 00-01) - ph=YAMLkey = mapping between: β’ ph = placeholder name used in the template (@_@) β’ YAMLkey = key name inside config.yml (under section stepXX)HOW TO CHANGE NAME OR VALUE: - If you only want to change the value β edit config.yml - If you want to rename the placeholder in the template β also change the left part (ph) in STEPS - If you want to rename the key in config.yml β also change the right part (YAMLkey) in STEPS
Example of renaming: Template: @03_vartemplate@ STEPS: vartemplate=newvarconfig Config.yml: step03.newvarconfig: /new/path β Result: @03_vartemplate@ becomes /new/path
-
Run the pipeline
bash scripts/run_pipeline.sh- Choose
1(new run) - Enter the input folder name (e.g.,s.
PDO) - Enter a descriptive code (e.g.,
PDMIDR) - This will create
output/<date>_<code>(e.g.,output/20250728_PDMIDR).
- Choose
-
During the run
Steps are executed in order and results are saved in subfolders underoutput/<date>_<code>/. -
Resume an existing run
bash scripts/run_pipeline.sh- Choose
2(continue run) - Enter the full output folder name (e.g.,
20250728_PDMIDR) - Specify the step to restart from (e.g.,
05or04b-last).
- Choose
After each run, youβll find a new folder under output/ named data_codice (e.g., 20250728_PDMIDR).
Each phase has its own subfolder containing the files produced by that step.
Typical structure:
output/20250728_PDMIDR/
βββ 01_AdapterRemoval/ # Adapter-trimmed/cleaned reads
βββ 03_bowtie2_output/ # Microbial reads + mapping summary
βββ 04_metaphlan_output/ # Taxonomic abundance tables
βββ 04b_humann_output/ # Functional profiling (HUMAnN)
βββ 05_spades_output/ # Assembled contigs (.fasta)
βββ 06_contig_filter/ # Contigs filtered by quality/length
βββ 07_Bowtie_Index/ # Bowtie2 indices for binning
βββ 08_mapping_coverage/ # Contig coverage
βββ 10_metabat_depth/ # Depth for binning
βββ 11_metabat_MAG/ # MAGs (FASTA)
βββ 12_checkm/ # Quality reports (CheckM)
βββ 13_checkm2/ # Quality reports (CheckM2)
βββ 14_MAGs_high_quality/ # Selected MAGs + metadata
βββ 15_tormes_MAGs/ # Final annotations (TORMES)
βββ config.yml # Parameters used for the run
Tip: ensure each folder contains up-to-date files; empty folders may indicate errors in previous steps.
To free space or rerun specific steps without deleting everything, use:
bash scripts/clean_output_folders.sh
What it does:
asks which output/<data_codice> folder to clean and, for each defined subfolder, deletes contents while preserving any files/subfolders listed.
-
TO_DELETE: list of subfolders to process (relative tooutput/<data_codice>/). Comment out those you donβt want to touch. -
PRESERVE: map (associative array) of exceptions to keep for each folder. Comma-separated values; supports wildcards and subfolders:- Preconfigured example:
PRESERVE["04_metaphlan_output"]="diversity,merged_abundance_table.txt,*.txt"PRESERVE["05_spades_output"]="contigs/filtered"
- Preconfigured example:
If the value is empty (""), nothing is preserved and the entire folder is removed.
-
Run the script:
bash scripts/clean_output_folders.sh -
Choose the output folder (e.g.,
20250728_PDMIDR) and confirm.
The script will delete contents of the folders listed inTO_DELETE, keeping whatever is defined inPRESERVE.
Warning: deletions are permanent. Back up results you need to keep.
-
Docker wonβt start / permissions
Add your user to the docker group and restart the session::
sudo usermod -aG docker $USER -
yqnot found
Itβs already installed inside the container. If you runrun_pipeline.shoutside the container, you may not have it in your PATH. -
Insufficient resources (RAM/CPU)
Reduce dataset size or increase resources. Some steps (SPAdes, MetaBAT) are particularly heavy. -
Missing Bowtie2 index
Run first:
bash scripts/build_bowtie2_index.sh
AdapterRemoval v2 Schubert, Lindgreen, and Orlando (2016). AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Research Notes, 12;9(1):88 http://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-016-1900-2
MetaPhlAn https://doi.org/10.1038/s41587-023-01688-w Aitor Blanco-Miguez, Francesco Beghini, Fabio Cumbo, Lauren J. McIver, Kelsey N. Thompson, Moreno Zolfo, Paolo Manghi, Leonard Dubois, Kun D. Huang, Andrew Maltez Thomas, Gianmarco Piccinno, Elisa Piperni, Michal PunΔochΓ‘Ε, Mireia Valles-Colomer, Adrian Tett, Francesca Giordano, Richard Davies, Jonathan Wolf, Sarah E. Berry, Tim D. Spector, Eric A. Franzosa, Edoardo Pasolli, Francesco Asnicar, Curtis Huttenhower, Nicola Segata. Nature Biotechnology (2023)
HUMAnN Francesco Beghini1 ,Lauren J McIver2 ,Aitor Blanco-Mìguez1 ,Leonard Dubois1 ,Francesco Asnicar1 ,Sagun Maharjan2,3 ,Ana Mailyan2,3 ,Andrew Maltez Thomas1 ,Paolo Manghi1 ,Mireia Valles-Colomer1 ,George Weingart2,3 ,Yancong Zhang2,3 ,Moreno Zolfo1 ,Curtis Huttenhower2,3 ,Eric A Franzosa2,3 ,Nicola Segata1,4 https://doi.org/10.7554/eLife.65088
1 Department CIBIO, University of Trento, Italy 2 Harvard T. H. Chan School of Public Health, Boston, MA, USA 3 The Broad Institute of MIT and Harvard, Cambridge, MA, USA 4 IEO, European Institute of Oncology IRCCS, Milan, Italy
SPAdes https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/cpbi.102
SAMtools Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008*
Tormes Narciso M. Quijada, David RodrΓguez-LΓ‘zaro, Jose MarΓa Eiros e Marta HernΓ‘ndez (2019). TORMES: an automated pipeline for whole bacterial genome analysis. Bioinformatica , 35(21), 4207β4212, https://doi.org/10.1093/bioinformatics/btz220
Author(s): Dorin / Davide
