nf-core · HaidYi · Jul 1, 2025 · Jul 1, 2025 · Jul 1, 2025 · Jul 1, 2025
@@ -102,6 +102,10 @@
 
   > Alcock, B. P., Huynh, W., Chalil, R., Smith, K. W., Raphenya, A. R., Wlodarski, M. A., Edalatmand, A., Petkau, A., Syed, S. A., Tsang, K. K., Baker, S. J. C., Dave, M., McCarthy, M. C., Mukiri, K. M., Nasir, J. A., Golbon, B., Imtiaz, H., Jiang, X., Kaur, K., Kwong, M., Liang, Z. C., Niu, K. C., Shan, P., Yang, J. Y. J., Gray, K. L., Hoad, G. R., Jia, B., Bhando, T., Carfrae, L. A., Farha, M. A., French, S., Gordzevich, R., Rachwalski, K., Tu, M. M., Bordeleau, E., Dooley, D., Griffiths, E., Zubyk, H. L., Brown, E. D., Maguire, F., Beiko, R. G., Hsiao, W. W. L., Brinkman F. S. L., Van Domselaar, G., McArthur, A. G. (2023). CARD 2023: expanded curation, support for machine learning, and resistome prediction at the Comprehensive Antibiotic Resistance Database. Nucleic acids research, 51(D1):D690-D699. [DOI: 10.1093/nar/gkac920](https://doi.org/10.1093/nar/gkac920)
 
+- [dbCAN](https://doi.org/10.1093/nar/gkad328)
+
+  > Jinfang Zheng, Qiwei Ge, Yuchen Yan, Xinpeng Zhang, Le Huang, Yanbin Yin, dbCAN3: automated carbohydrate-active enzyme and substrate annotation, Nucleic Acids Research, Volume 51, Issue W1, 5 July 2023, Pages W115–W121. [DOI:10.1093/nar/gkad328](https://doi.org/10.1093/nar/gkad328)
+
 - [SeqKit](https://bioinf.shenwei.me/seqkit/)
 
   > Shen, W., Sipos, B., & Zhao, L. (2024). SeqKit2: A Swiss army knife for sequence and alignment processing. iMeta, e191. [https://doi.org/10.1002/imt2.191](https://doi.org/10.1002/imt2.191)

@@ -40,8 +40,9 @@ The nf-core/funcscan AWS full test dataset are contigs generated by the MGnify s
 5. Screening contigs for antimicrobial peptide-like sequences with [`ampir`](https://cran.r-project.org/web/packages/ampir/index.html), [`Macrel`](https://github.com/BigDataBiology/macrel), [`HMMER`](http://hmmer.org/), [`AMPlify`](https://github.com/bcgsc/AMPlify)
 6. Screening contigs for antibiotic resistant gene-like sequences with [`ABRicate`](https://github.com/tseemann/abricate), [`AMRFinderPlus`](https://github.com/ncbi/amr), [`fARGene`](https://github.com/fannyhb/fargene), [`RGI`](https://card.mcmaster.ca/analyze/rgi), [`DeepARG`](https://bench.cs.vt.edu/deeparg). [`argNorm`](https://github.com/BigDataBiology/argNorm) is used to map the outputs of `DeepARG`, `AMRFinderPlus`, and `ABRicate` to the [`Antibiotic Resistance Ontology`](https://www.ebi.ac.uk/ols4/ontologies/aro) for consistent ARG classification terms.
 7. Screening contigs for biosynthetic gene cluster-like sequences with [`antiSMASH`](https://antismash.secondarymetabolites.org), [`DeepBGC`](https://github.com/Merck/deepbgc), [`GECCO`](https://gecco.embl.de/), [`HMMER`](http://hmmer.org/)
-8. Creating aggregated reports for all samples across the workflows with [`AMPcombi`](https://github.com/Darcy220606/AMPcombi) for AMPs, [`hAMRonization`](https://github.com/pha4ge/hAMRonization) for ARGs, and [`comBGC`](https://raw.githubusercontent.com/nf-core/funcscan/master/bin/comBGC.py) for BGCs
-9. Software version and methods text reporting with [`MultiQC`](http://multiqc.info/)
+8. Screening contigs for carbohydrate-active enzymes (CAZymes), CAZyme gene clusters and substrates with [run_dbcan](https://github.com/bcb-unl/run_dbcan).
+9. Creating aggregated reports for all samples across the workflows with [`AMPcombi`](https://github.com/Darcy220606/AMPcombi) for AMPs, [`hAMRonization`](https://github.com/pha4ge/hAMRonization) for ARGs, and [`comBGC`](https://raw.githubusercontent.com/nf-core/funcscan/master/bin/comBGC.py) for BGCs
+10. Software version and methods text reporting with [`MultiQC`](http://multiqc.info/)
 
 ![funcscan metro workflow](docs/images/funcscan_metro_workflow.png)
 

@@ -1,4 +1,4 @@
-sample,fasta,protein,gbk
+sample,fasta,protein,gbk,gff
 sample_1,https://raw.githubusercontent.com/nf-core/test-datasets/funcscan/wastewater_metagenome_contigs_1.fasta.gz,https://raw.githubusercontent.com/nf-core/test-datasets/funcscan/wastewater_metagenome_contigs_prokka_1.faa,https://raw.githubusercontent.com/nf-core/test-datasets/funcscan/wastewater_metagenome_contigs_prokka_1.gbk
 sample_2,https://raw.githubusercontent.com/nf-core/test-datasets/funcscan/wastewater_metagenome_contigs_2.fasta.gz,https://raw.githubusercontent.com/nf-core/test-datasets/funcscan/wastewater_metagenome_contigs_prokka_2.faa.gz,https://raw.githubusercontent.com/nf-core/test-datasets/funcscan/wastewater_metagenome_contigs_prokka_2.gbk.gz
 sample_3,https://raw.githubusercontent.com/nf-core/test-datasets/funcscan/wastewater_metagenome_contigs.fasta
@@ -33,12 +33,26 @@
                 "exists": true,
                 "pattern": "^\\S+\\.(gbk|gbff)(\\.gz)?$",
                 "errorMessage": "Input file for feature annotations has incorrect file format. File must end in `.gbk`, `.gbk.gz` or `.gbff`, or `.gbff.gz`"
+            },
+            "gff": {
+                "type": "string",
+                "format": "file-path",
+                "exists": true,
+                "pattern": "^\\S+\\.(gff|gff3)(\\.gz)?$",
+                "errorMessage": "Input file for feature annotations has incorrect file format. File must end in `.gff`, `.gff.gz` or `.gff3`, or `.gff3.gz`"
+            },
+            "gff_type": {
+                "type": "string",
+                "enum": ["NCBI_prok", "prodigal", "NCBI_euk", "JGI"],
+                "errorMessage": "GFF type must be one of: NCBI_prok, prodigal, NCBI_euk, or JGI",
+                "meta": ["gff_type"]
             }
         },
         "required": ["sample", "fasta"],
         "dependentRequired": {
             "protein": ["gbk"],
-            "gbk": ["protein"]
+            "gbk": ["protein"],
+            "gff": ["protein"]
         }
     },
     "uniqueItems": true

@@ -740,4 +740,39 @@ process {
             saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
         ]
     }
+
+    withName: RUNDBCAN_DATABASE {
+        publishDir = [
+            path: { "${params.outdir}/databases/dbcan/" },
+            mode: params.publish_dir_mode,
+            enabled: params.save_db,
+            saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
+        ]
+    }
+
+    withName: RUNDBCAN_CAZYMEANNOTATION {
+        publishDir = [
+            path: { "${params.outdir}/cazyme/dbcan/cazyme_annotation/${meta.id}" },
+            mode: params.publish_dir_mode,
+            saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
+        ]
+    }
+
+    withName: RUNDBCAN_EASYCGC {
+        publishDir = [
+            path: { "${params.outdir}/cazyme/dbcan/cgc/${meta.id}" },
+            mode: params.publish_dir_mode,
+            pattern: "*_{cgc.gff,cgc_standard_out.tsv,diamond.out.tc,TF_hmm_results.tsv,STP_hmm_results.tsv}",
+            saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
+        ]
+    }
+
+    withName: RUNDBCAN_EASYSUBSTRATE {
+        publishDir = [
+            path: { "${params.outdir}/cazyme/dbcan/substrate/${meta.id}" },
+            mode: params.publish_dir_mode,
+            pattern: "*_{total_cgc_info.tsv,substrate_prediction.tsv,synteny_pdf}",
+            saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
+        ]
+    }
 }
@@ -33,4 +33,6 @@ params {
     run_amp_screening          = true
     amp_run_hmmsearch          = true
     amp_hmmsearch_models       = params.pipelines_testdata_base_path + 'funcscan/hmms/mybacteriocin.hmm'
+
+    run_cazyme_screening       = true
 }
@@ -0,0 +1,34 @@
+/*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    Nextflow config file for running minimal tests
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    Defines input files and everything required to run a fast and simple pipeline test.
+
+    Use as follows:
+        nextflow run nf-core/funcscan -profile test_dbcan_pyrodigal,<docker/singularity> --outdir <OUTDIR>
+
+----------------------------------------------------------------------------------------
+*/
+
+process {
+    resourceLimits = [
+        cpus: 4,
+        memory: '15.GB',
+        time: '1.h'
+    ]
+}
+
+params {
+    config_profile_name        = 'CAZyme Pyrodigal test profile'
+    config_profile_description = 'Minimal test dataset to check CAZyme workflow function'
+
+    // Input data
+    input                      = params.pipelines_testdata_base_path + 'funcscan/samplesheet_reduced.csv'
+
+    annotation_tool            = 'pyrodigal'
+
+    run_arg_screening          = false
+    run_amp_screening          = false
+    run_bgc_screening          = false
+    run_cazyme_screening       = true
+}
@@ -0,0 +1,37 @@
+/*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    Nextflow config file for running minimal tests
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    Defines input files and everything required to run a fast and simple pipeline test.
+
+    Use as follows:
+        nextflow run nf-core/funcscan -profile test_preannotated_dbcan,<docker/singularity> --outdir <OUTDIR>
+
+----------------------------------------------------------------------------------------
+*/
+
+process {
+    resourceLimits = [
+        cpus: 4,
+        memory: '15.GB',
+        time: '1.h'
+    ]
+}
+
+params {
+    config_profile_name        = 'CAZyme test profile - preannotated input'
+    config_profile_description = 'Minimal test dataset to check CAZyme workflow function'
+
+    // Input data
+    input                      = params.pipelines_testdata_base_path + 'funcscan/samplesheet_preannotated.csv'
+
+    annotation_tool            = 'pyrodigal'
+
+    run_arg_screening          = false
+    run_amp_screening          = false
+    run_bgc_screening          = false
+    run_cazyme_screening       = true
+
+    dbcan_skip_cgc             = true   // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet
+    dbcan_skip_substrate       = true   // Skip substrate annotation as .gbk (not .gff) is provided in samplesheet
+}
@@ -7,10 +7,11 @@ The output of nf-core/funcscan provides reports for each of the functional group
 - **antibiotic resistance genes** (tools: [ABRicate](https://github.com/tseemann/abricate), [AMRFinderPlus](https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/AMRFinder), [DeepARG](https://bitbucket.org/gusphdproj/deeparg-ss/src/master), [fARGene](https://github.com/fannyhb/fargene), [RGI](https://card.mcmaster.ca/analyze/rgi) – summarised by [hAMRonization](https://github.com/pha4ge/hAMRonization). Results from ABRicate, AMRFinderPlus, and DeepARG are normalised to [ARO](https://obofoundry.org/ontology/aro.html) by [argNorm](https://github.com/BigDataBiology/argNorm).)
 - **antimicrobial peptides** (tools: [Macrel](https://github.com/BigDataBiology/macrel), [AMPlify](https://github.com/bcgsc/AMPlify), [ampir](https://ampir.marine-omics.net), [hmmsearch](http://hmmer.org) – summarised by [AMPcombi](https://github.com/Darcy220606/AMPcombi))
 - **biosynthetic gene clusters** (tools: [antiSMASH](https://docs.antismash.secondarymetabolites.org), [DeepBGC](https://github.com/Merck/deepbgc), [GECCO](https://gecco.embl.de), [hmmsearch](http://hmmer.org) – summarised by [comBGC](#combgc))
+- **carbohydrate-active enzymes (CAZymes)**, CAZyme gene clusters and substrates (tools: [run_dbcan](https://github.com/bcb-unl/run_dbcan))
 
 As a general workflow, we recommend to first look at the summary reports ([ARGs](#hamronization), [AMPs](#ampcombi), [BGCs](#combgc)), to get a general overview of what hits have been found across all the tools of each functional group. After which, you can explore the specific output directories of each tool to get more detailed information about each result. The tool-specific output directories also includes the output from the functional annotation steps of either [prokka](https://github.com/tseemann/prokka), [pyrodigal](https://github.com/althonos/pyrodigal), [prodigal](https://github.com/hyattpd/Prodigal), or [Bakta](https://github.com/oschwengers/bakta) if the `--save_annotations` flag was set. Additionally, taxonomic classifications from [MMseqs2](https://github.com/soedinglab/MMseqs2) are saved if the `--taxa_classification_mmseqs_db_savetmp` and `--taxa_classification_mmseqs_taxonomy_savetmp` flags are set.
 
-Similarly, all downloaded databases are saved (i.e. from [MMseqs2](https://github.com/soedinglab/MMseqs2), [antiSMASH](https://docs.antismash.secondarymetabolites.org), [AMRFinderPlus](https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/AMRFinder), [Bakta](https://github.com/oschwengers/bakta), [DeepARG](https://bitbucket.org/gusphdproj/deeparg-ss/src/master), [RGI](https://github.com/arpcard/rgi), and/or [AMPcombi](https://github.com/Darcy220606/AMPcombi)) into the output directory `<outdir>/databases/` if the `--save_db` flag was set.
+Similarly, all downloaded databases are saved (i.e. from [MMseqs2](https://github.com/soedinglab/MMseqs2), [antiSMASH](https://docs.antismash.secondarymetabolites.org), [AMRFinderPlus](https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/AMRFinder), [Bakta](https://github.com/oschwengers/bakta), [DeepARG](https://bitbucket.org/gusphdproj/deeparg-ss/src/master), [RGI](https://github.com/arpcard/rgi), [AMPcombi](https://github.com/Darcy220606/AMPcombi), and/or [run_dbcan](https://github.com/bcb-unl/run_dbcan)) into the output directory `<outdir>/databases/` if the `--save_db` flag was set.
 
 Furthermore, for reproducibility, versions of all software used in the run is presented in a [MultiQC](http://multiqc.info) report.
 
@@ -41,6 +42,8 @@ results/
 |   ├── deepbgc/
 |   ├── gecco/
 |   └── hmmsearch/
+├── cazyme/
+|      └── dbcan/
 ├── databases/
 ├── multiqc/
 ├── pipeline_info/
@@ -63,11 +66,11 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes p
 
 Input contig QC with:
 
-- [SeqKit](https://bioinf.shenwei.me/seqkit/) (default) - for separating into long- and short- categories
+- [SeqKit](https://bioinf.shenwei.me/seqkit/) (default) – for separating into long- and short- categories
 
 Taxonomy classification of nucleotide sequences with:
 
-- [MMseqs2](https://github.com/soedinglab/MMseqs2) (default) - for contig taxonomic classification using 2bLCA.
+- [MMseqs2](https://github.com/soedinglab/MMseqs2) (default) – for contig taxonomic classification using 2bLCA.
 
 ORF prediction and annotation with any of:
 
@@ -98,18 +101,22 @@ Antimicrobial Peptides (AMPs):
 Biosynthetic Gene Clusters (BGCs):
 
 - [antiSMASH](#antismash) – biosynthetic gene cluster detection.
-- [deepBGC](#deepbgc) - biosynthetic gene cluster detection, using a deep learning model.
+- [deepBGC](#deepbgc) – biosynthetic gene cluster detection, using a deep learning model.
 - [GECCO](#gecco) – biosynthetic gene cluster detection, using Conditional Random Fields (CRFs).
 - [hmmsearch](#hmmsearch) – biosynthetic gene cluster detection, based on hidden Markov models.
 
+Carbohydrate-active enzymes (CAZYMEs)
+
+- [run_dbcan](https://github.com/bcb-unl/run_dbcan) – carbohydrate-active enzyme (CAZyme), CAZyme gene clusters and substrate detection.
+
 Output Summaries:
 
-- [AMPcombi](#ampcombi) – summary report of antimicrobial peptide gene output from various detection tools.
-- [hAMRonization](#hamronization) – summary of antimicrobial resistance gene output from various detection tools.
-- [argNorm](#argNorm) - Normalize ARG annotations from [ABRicate](#abricate), [AMRFinderPlus](#amrfinderplus), and [DeepARG](#deeparg) to the ARO
-- [comBGC](#combgc) – summary of biosynthetic gene cluster output from various detection tools.
-- [MultiQC](#multiqc) – report of all software and versions used in the pipeline.
-- [Pipeline information](#pipeline-information) – report metrics generated during the workflow execution.
+- [AMPcombi](#ampcombi) – summary report of antimicrobial peptide gene output from various detection tools
+- [hAMRonization](#hamronization) – summary of antimicrobial resistance gene output from various detection tools
+- [argNorm](#argNorm) – Normalize ARG annotations from [ABRicate](#abricate), [AMRFinderPlus](#amrfinderplus), and [DeepARG](#deeparg) to the ARO
+- [comBGC](#combgc) – summary of biosynthetic gene cluster output from various detection tools
+- [MultiQC](#multiqc) – report of all software and versions used in the pipeline
+- [Pipeline information](#pipeline-information) – report metrics generated during the workflow execution
 
 ## Tool details
 
@@ -466,6 +473,35 @@ Note that filtered FASTA is only used for BGC workflow for run-time optimisation
 
 [GECCO](https://gecco.embl.de) is a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).
 
+### CAZyme annotation tools
+
+#### run_dbcan
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `cazyme/`
+  - `dbcan/`
+    - `cazyme_annotation/`
+      - `<sample.id>_overview.tsv`: TSV file containing the results of dbCAN CAZyme annotation
+      - `<sample.id>_dbCAN_hmm_results.tsv`: TSV file containing the detailed dbCAN HMM results for CAZyme annotation
+      - `<sample.id>_dbCANsub_hmm_results.tsv`: TSV file containing the detailed dbCAN subfamily results for CAZyme annotation
+      - `<sample.id>_diamond.out`: TSV file containing the detailed dbCAN diamond results for CAZyme annotation
+    - `cgc/`
+      - `<sample.id>_cgc.gff`: GFF file containing the CAZyme gene clusters (CGC) identified by dbCAN. This file is generated from the dbCAN annotation and contains the locations of CAZyme gene clusters in the genome
+      - `<sample.id>_cgc_standard_out.tsv`: Standard output file from dbCAN for CAZyme gene clusters (CGC) in a tabular format. This file summarizes the CAZyme gene clusters identified in the genome
+      - `<sample.id>_diamond.out.tc`: TSV file containing the diamond output for transporter annotation
+      - `<sample.id>_TF_hmm_results.tsv`: TSV file containing the results of transcription factor screening
+      - `<sample.id>_STP_hmm_results.tsv`: TSV file containing the results of signaling transduction proteins (STP) annotation
+    - `substrate/`
+      - `<sample.id>_total_cgc_info.tsv`: TSV file summarizing the total additional genes in the genome
+      - `<sample.id>_substrate_prediction.tsv`: TSV file containing the substrate predictions based on the CGC annotations from dbCAN
+      - `<sample.id>_synteny_pdf/`: Directory containing one or more PDF files showing the syntenic regions of the CGCs in DNA sequence as identified by dbCAN
+
+</details>
+
+[run_dbcan](https://github.com/bcb-unl/run_dbcan) is an automated tool for carbohydrate-active enzyme (CAZyme), CAZyme gene cluster and substrate annotation.
+
 ### Summary tools
 
 [AMPcombi](#ampcombi), [hAMRonization](#hamronization), [comBGC](#combgc), [MultiQC](#multiqc), [pipeline information](#pipeline-information), [argNorm](#argnorm).