From 50eb4c912c570b1bf97cdc02f129c72c77a1fab6 Mon Sep 17 00:00:00 2001
From: giorgiagandolfi <giorgia.gandolfi4@studio.unibo.it>
Date: Thu, 23 Jan 2025 11:19:05 +0100
Subject: [PATCH] finalize output.md

---
 docs/output.md | 427 +++++++++++++++++++++++++------------------------
 1 file changed, 222 insertions(+), 205 deletions(-)
diff --git a/docs/output.md b/docs/output.md
index 92ff31f..9b40db3 100644
--- a/docs/output.md
+++ b/docs/output.md
@@ -2,37 +2,72 @@
 
 ## Introduction
 
-The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
-
 This document describes the output produced by the pipeline. All plots generated in each step are summarised into the final report.
+The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
 
 ## Pipeline overview
 
 The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
 
-- [Variant Annotation](#variant-annotation) - annotation of variants and cohort summary visualization
-- [Formatter](#formatter) - coversion of files to different formats (unpubblished)
-- [Lifter](#lifter) - pileup of private mutations of the other samples in multi-sample setting
-- [Driver Annotation](#driver-annotation) - _add description_ (unpubblished)
-- [QC](#qc) - quality control of copy-number and somatic mutation calling and creation of multi-CNAqc object
-- [Subclonal Deconvolution](#subclonal-deconvolution) -
-- [Signature Deconvolution](#signature-deconvolution) -
-
-Intermediate steps of the pipeline will output unpublished results which will be available for the user in the working directory of the pipeline. -->
-
-The pipeline is built using [Nextflow](https://www.nextflow.io/) and consists in five main subworkflows:
-
 - [Variant Annotation](#variant-annotation)
-- [Driver Annotation](#driver-annotation)
+- [Formatter](#formatter)
+- [Lifter](#lifter)
+- [Catalogue Driver Annotation](#driver-annotation)
 - [QC](#qc)
-- [Signature Deconvolution](#signature-deconvolution)
 - [Subclonal Deconvolution](#subclonal-deconvolution)
-<!-- * [Genome Interpreter](#genome-interpreter) -->
+- [Signature Deconvolution](#signature-deconvolution)
 
-Intermediate steps connetting the main subworkflows will output [unpublished results](#unpublished-results) which will be available in the working directory of the pipeline. These steps consist in:
-
-- [Formatter](#formatter)
-- [Lifter](#lifter)
+## Directory Structure
+
+The default directory structure is as follows:
+
+```
+{outdir}
+├── variant_annotation
+|   └── vep
+│       └── <sample>
+├── driver_annotation
+|   └── annotate_driver
+│       └── <sample>
+├── pipeline_info
+├── formatter
+│   ├── cna2cnaqc
+│       └── <sample>
+│   ├── cnaqc2tsv
+│       └── <patient>
+|   └── vcf2cnaqc
+│       └── <sample>
+├── lifter
+│   ├── mpileup
+│       └── <patient>
+│   └── positions
+│       └── <sample>
+├── QC
+│   ├── tinc
+│       └── <sample>
+│   ├── CNAqc
+│       └── <sample>
+│   └── join_CNAqc
+│       └── <patient>
+├── signature_deconvolution
+|   ├── SigProfiler
+│       └── <dataset>
+|   └── SparseSignatures
+│       └── <dataset>
+└── subclonal_deconvolution
+|   ├── mobster
+│       └── <sample>
+|   ├── viber
+│       └── <patient>
+|   ├── ctree
+│       └── <patient>,<sample>
+|   └── pyclonevi
+│       └── <patient>
+work/
+.nextflow.log
+```
+
+<!--Intermediate steps connetting the main subworkflows will output [unpublished results](#unpublished-results) which will be available in the working directory of the pipeline. These steps consist in-->
 
 ## Variant Annotation
 
@@ -44,13 +79,13 @@ This directory contains results from the variant annotation subworkflow. At the
 This step starts from VCF files.
 
 <details markdown="1">
-<summary>Output files for all samples </summary>
-<strong>Output directory: <code>{publish_dir}/variant_annotation/VEP/dataset/patient/sample/</code></strong>
+<summary>Output files for all samples</summary>
+
+**Output directory: `{outdir}/variant_annotation/vep/<dataset>/<patient>/<sample>/`**
+
+- `<dataset>_<patient>_<sample>.vcf.gz` and `<dataset>_<patient>_<sample>.vcf.gz.tbi`
+  - VCF file and tabix index with called mutations
 
-<ul>
-<li> <code><dataset>_<patient>_<sample>.vcf.gz</code> and <code><dataset>_<patient>_<sample>.vcf.gz.tbi</code>
-<li> annotated VCF file with tabix index
-</ul>
 </details>
 
 <!-- ### vcf2maf
@@ -63,7 +98,7 @@ This step starts from VCF files.
 <details markdown="1">
 <summary>Output files for all samples</summary>
 
-**Output directory: `{publish_dir}/VariantAnnotation/VCF2MAF/<dataset>/<patient>/<sample>/`**
+**Output directory: `{outdir}/VariantAnnotation/VCF2MAF/<dataset>/<patient>/<sample>/`**
 * `data_vep.maf`
     * annotated MAF file
 
@@ -83,7 +118,7 @@ MAF fields requirements:
 <details markdown="1">
 <summary>Output files for the dataset</summary>
 
-**Output directory: `{publish_dir}/VariantAnnotation/MAFTOOLS/<dataset>/`**
+**Output directory: `{outdir}/VariantAnnotation/MAFTOOLS/<dataset>/`**
 * `maf_merged.rds`
     * summarized MAF object
 * `maf_summary.pdf`
@@ -93,36 +128,110 @@ MAF fields requirements:
 
 </details> -->
 
-<!-- ## Lifter
+## Formatter
+
+The Formatter subworkflow is used to convert file to other formats and to standardize the output files resulting from different mutation (Mutect2, Strelka) and cna callers (ASCAT,Sequenza).
+
+### vcf2cnaqc
+
+This parser is designed to process VCF files generated by various variant calling tools and to convert them into a unified RDS file format.
+
+<details markdown="1">
+<summary>Output files for all samples</summary>
+
+**Output directory: `{outdir}/formatter/vcf2cnaqc/<dataset>/<patient>/<sample>/`**
+
+- `<dataset>_<patient>_<sample>_snv.rds`
+  - RDS file containing parsed VCF in table format
+
+</details>
+
+### cna2cnaqc
+
+This parser is designed to standardize copy number calls and purity estimates from various callers into a unified format.
+
+<details markdown="1">
+<summary>Output files for all samples</summary>
+
+**Output directory: `{outdir}/formatter/cna2cnaqc/<dataset>/<patient>/<sample>/`**
+
+- `<dataset>_<patient>_<sample>_cna.rds`
+  - RDS file containing parsed segments and purity estimate output in table format
+
+</details>
+
+### cnaqc2tsv
+
+This parser is designed to convert mutations data of joint CNAqc analysis from CNAqc format (RDS file) into a tabular format (TSV file). This step is mandatory for running python-based tools (e.g. PyClone-VI, SigProfiler) and it is mandatory if `--tools` contains either `pyclone-vi` or `sigprofiler`.
+
+<details markdown="1">
+<summary>Output files for all patients</summary>
+
+**Output directory: `{outdir}/formatter/cnaqc2tsv/<dataset>/<patient>/`**
+
+- `<dataset>_<patient>_joint_table.tsv`
+  - TSV file containing mutations mapped to corrsponding copy number segments.
+
+</details>
 
-The Lifter subworkflow is an optional step and it is run when `--mode multisample` is used. When multiple samples from the same patient are provided, the user can specify either a single joint VCF file, containing variant calls from all tumor samples of the patient, or individual sample specific VCF files. In the latter case, path to tumor BAM files must be provided in order to collect all mutations from the samples and perform pile-up of sample's private mutations in all the other samples. Two intermediate steps, [get_positions](#get_positions) and [mpileup](#mpileup), are performed to identify private mutations in all the samples and retrieve their variant allele frequency. Once private mutations are properly defined, they are merged back into the original VCF file during the [join_positions](#join_positions) step. The updated VCF file is then converted into a `vcfR` RDS object.
+## Lifter
 
-The output files of [get_positions](#get_positions) and [mpileup](#mpileup) are intermediate and by default not kept in the result directory.
+The Lifter subworkflow is an optional step and it is run when single sample VCF file are provided. When multiple samples from the same patient are provided, the user can specify either a single joint VCF file, containing variant calls from all tumor samples of the patient (see [joint variant calling](<(https://nf-co.re/sarek/3.4.2/parameters/#joint_mutect2)>)), or individual sample specific VCF files. In the latter case, path to tumor BAM files must be provided in order to collect all mutations from the samples and perform pile-up of sample's private mutations in all the other samples. Two intermediate steps, [get_positions](#get_positions) and [mpileup](#mpileup), are performed to identify private mutations in all the samples and retrieve their variant allele frequency. Once private mutations are properly defined, they are merged back into the original VCF file during the [join_positions](#join_positions) step. The updated VCF file is then converted into a RDS object.
 
-### join_positions
-In this step, all retrieved mutations are joined with original mutations present in input VCF, which is in turn converted into an RDS object using [vcfR](https://cran.r-project.org/web/packages/vcfR/vignettes/intro_to_vcfR.html).
+### mpileup
+
+At this stage, [bcftools](https://samtools.github.io/bcftools/bcftools.html) mpileup is run to retrieve frequency information of private mutations across all samples.
 
 <details markdown="1">
 <summary>Output files for all samples</summary>
 
-**Output directory: `{publish_dir}/lifter/mpileup/<dataset>/<patient>/<sample>/`**
-- `pileup_VCF.rds`
+**Output directory: `{outdir}/lifter/mpileup/<dataset>/<patient>/<sample>/`**
+
+- `<dataset>_<patient>_<sample>.bcftools_stats.txt`
+  - TXT file with statistics on the called mutations
+- `<dataset>_<patient>_<sample>.vcf.gz` and `<dataset>_<patient>_<sample>.vcf.gz.tbi`:
+  - VCF file and tabix index with called mutations
+
+</details>
+
+### positions
+
+This step allows to retrieve private and shared mutations across samples originated from the same patient. Previously retrieved mutations are joined with original mutations present in input VCF, which is in turn converted into an RDS object using [vcfR](https://cran.r-project.org/web/packages/vcfR/vignettes/intro_to_vcfR.html).
+
+<details markdown="1">
+<summary>Output files for all samples</summary>
+
+**Output directory: `{outdir}/lifter/positions/<dataset>/<patient>/<sample>/`**
+
+- `<dataset>_<patient>_<sample>.pileup_VCF.rds`
   - RDS containing shared and private mutations
+- `<dataset>_<patient>_<sample>.positions_missing`
+  - TXT file containing mutations to be retrieved for a given sample
 
-</details> -->
+</details>
 
-## Driver Annotation
+<details markdown="1">
+<summary>Output files for all patients</summary>
+
+**Output directory: `{outdir}/lifter/positions/<dataset>/<patient>/`**
+
+- `<dataset>_<patient>__all_positions.rds`
+  - RDS containing shared and private mutations
+
+</details>
+
+## Catalogue Driver Annotation
 
 This directory contains results from the driver annotation subworkflow.
 
 ### Tumour-type driver annotation
 
-According to the specified tumour type, potential driver mutations are identified and annotated using [IntOGen database](<(https://www.intogen.org/search)>).
+According to the specified tumour type, potential driver mutations are identified and annotated using [IntOGen database](<(https://www.intogen.org/search)>). The user can also provide a custom table of driver genes.
 
 <details markdown="1">
 <summary>Output files for all samples</summary>
 
-**Output directory: `{publish_dir}/driver_annotation/annotate_driver/<dataset>/<patient>/<sample>/`**
+**Output directory: `{outdir}/driver_annotation/annotate_driver/<dataset>/<patient>/<sample>/`**
 
 - `<dataset>_<patient>_<sample>_driver.rds`
   - RDS with annotated mutations
@@ -136,7 +245,7 @@ Add description
 <details markdown="1">
 <summary>Output files for all samples</summary>
 
-**Output directory: `{publish_dir}/driver_annotation/BuildReference/<dataset>/<patient>/<sample>/`**
+**Output directory: `{outdir}/driver_annotation/BuildReference/<dataset>/<patient>/<sample>/`**
 * `fit.rds`
     * add description
 * `plot.rds`
@@ -151,7 +260,7 @@ Add description
 <details markdown="1">
 <summary>Output files for all samples</summary>
 
-**Output directory: `{publish_dir}/driver_annotation/DNDSCV/<dataset>/<patient>/<sample>/`**
+**Output directory: `{outdir}/driver_annotation/DNDSCV/<dataset>/<patient>/<sample>/`**
 * `dnds.rds`
     * DNDSCV object in RDS format
 
@@ -159,17 +268,17 @@ Add description
 
 ## QC
 
-The QC subworkflows requires in input a segmentation file from allele-specific copy number callers (either [Sequenza](https://sequenzatools.bitbucket.io/#/home), [ASCAT](https://github.com/VanLoo-lab/ascat)) and the joint VCF file from [join_positions](#join_positions) subworkflow. The QC sub-workflows first conduct quality control on CNV and somatic mutation data for individual samples in [CNAqc](#cnaqc) step, and subsequently summarize validated information at patient level in [join_CNAqc](#join_cnaqc) step.
+The QC subworkflows requires in input a segmentation file from allele-specific copy number callers (either [Sequenza](https://sequenzatools.bitbucket.io/#/home), [ASCAT](https://github.com/VanLoo-lab/ascat)) and the joint VCF file. As a first step, the QC subworkflow provides an estimate of normal and tumour samples contamination in [TINC](#tinc) step, in order to have a measure of experimental quality. Then,it first conducts a quality control on copy number and somatic mutation data for individual samples in [CNAqc](#cnaqc) step, and subsequently summarize validated information at patient level in [join_CNAqc](#join_cnaqc) step.
 The QC subworkflow is a crucial step of the pipeline as it ensures high confidence in identifying clonal and subclonal events while accounting for variations in tumor purity.
 
 ### TINC
 
-[TINC](https://caravagnalab.github.io/TINC/index.html) is a package to calculate the contamination of tumor DNA in a matched normal sample. TINC provides estimates of the proportion of cancer cells, containing the normal sample, and the proportion of cancer cells in the tumor sample (tumor purity).
+[TINC](https://caravagnalab.github.io/TINC/index.html) is a package to calculate the contamination of tumor DNA in a matched normal sample. TINC provides estimates of the proportion of cancer cells containing the normal sample (TIN), and the proportion of cancer cells in the tumor sample (TIT).
 
 <details markdown="1">
 <summary>Output files for all samples</summary>
 
-**Output directory: `{publish_dir}/QC/tinc/<dataset>/<patient>/<sample>/`**
+**Output directory: `{outdir}/QC/tinc/<dataset>/<patient>/<sample>/`**
 
 - `<dataset>_<patient>_<sample>_fit.rds`
   - TINC fit containing TIN and TIT estimates in RDS;
@@ -182,12 +291,12 @@ The QC subworkflow is a crucial step of the pipeline as it ensures high confiden
 
 ### CNAqc
 
-[CNAqc](https://caravagnalab.github.io/CNAqc/) is a package to quality control (QC) bulk cancer sequencing data for validating copy number segmentations against variant allele frequencies of somatic mutations.
+[CNAqc](https://caravagnalab.github.io/CNAqc/) is a package that performs quality control of bulk cancer sequencing data for validating copy number segmentations against variant allele frequencies of somatic mutations.
 
 <details markdown="1">
 <summary>Output files for all samples</summary>
 
-**Output directory: `{publish_dir}/QC/CNAqc/<dataset>/<patient>/<sample>/`**
+**Output directory: `{outdir}/QC/CNAqc/<dataset>/<patient>/<sample>/`**
 
 - `<dataset>_<patient>_<sample>_data_plot.rds` and `<dataset>_<patient>_<sample>_data.pdf`
   - CNAqc report with genome wide mutation and allele specific copy number plots in RDS and PDF
@@ -200,17 +309,17 @@ The QC subworkflow is a crucial step of the pipeline as it ensures high confiden
 
 ### join_CNAqc
 
-This module creates a multi-CNAqc object for patient by summarizing the quality check performed at the single sample level. For more information about the strucutre of multi-CNAqc object see [CNAqc documentation](<(https://caravagnalab.github.io/CNAqc/)>).
+This module creates a multi-CNAqc object for patient by summarizing the quality check performed at the single sample level. For more information about the structure of multi-CNAqc object see [CNAqc documentation](<(https://caravagnalab.github.io/CNAqc/)>).
 
 <details markdown="1">
 <summary>Output files for all patients</summary>
 
-**Output directory: `{publish_dir}/QC/join_CNAqc/<dataset>/<patient>/`**
+**Output directory: `{outdir}/QC/join_CNAqc/<dataset>/<patient>/`**
 
 - `<dataset>_<patient>_multi_cnaqc_ALL.rds`
-  - unfiltered multi-CNAqc RDS object
+  - unfiltered mCNAqc RDS object
 - `<dataset>_<patient>_multi_cnaqc_PASS.rds`
-  - filtered multi-CNAqc RDS object
+  - filtered mCNAqc RDS object
 
 </details>
 
@@ -218,7 +327,7 @@ This module creates a multi-CNAqc object for patient by summarizing the quality
 
 <!-- Mutational signatures represent characteristic patterns of somatic mutations in cancer genomes, reflecting the underlying mutational processes at the basis of tumor evolution and progression. Mutational signatures are discovered by analyzing ensemble point-mutation counts from a set of individual samples. Validated mutations from [join_CNAqc](#join_cnaqc) step are converted into a TSV joint table in (see [tsvparse](#tsvparse) module), subsequently given as input to signature deconvolution subworkflow, which performs de novo extraction, inference, deciphering or deconvolution of mutational counts.  -->
 
-Mutational signatures are distinctive patterns of somatic mutations in cancer genomes that reveal the underlying mutational processes driving tumor evolution and progression. These signatures are identified by analyzing aggregated point-mutation counts from multiple samples. Validated mutations from the [join_CNAqc](#join_cnaqc) step are converted into a joint TSV table (see [tsvparse](#tsvparse)) and then input into the signature deconvolution subworkflow, which performs de novo extraction, inference, interpretation, or deconvolution of mutational counts.
+Mutational signatures are distinctive patterns of somatic mutations in cancer genomes that reveal the underlying mutational processes driving tumor evolution and progression. These signatures are identified by analyzing aggregated point-mutation counts from multiple samples. Validated mutations from the [join_CNAqc](#join_cnaqc) step are converted into a joint TSV table (see [cnaqc2tsv](#cnaqc2tsv)) and then input into the signature deconvolution subworkflow, which performs de novo extraction, inference, interpretation, or deconvolution of mutational counts.
 
 The results of this step are collected in `{pubslish_dir}/signature_deconvolution/`. Two tools can be specified by using `--tools` parameter: [SparseSignatures](#sparsesignatures) and [SigProfiler](#sigprofiler).
 
@@ -229,32 +338,46 @@ The results of this step are collected in `{pubslish_dir}/signature_deconvolutio
 <details markdown="1">
 <summary>Output files for dataset</summary>
 
-**Output directory: `{publish_dir}/signatures_deconvolution/SparseSig/<dataset>/`**
+**Output directory: `{outdir}/signatures_deconvolution/SparseSig/<dataset>/`**
 
-- `best_params_config.rds`
+- `<dataset>_best_params_config.rds`
   - signatures best configiration object
-- `cv_means_mse.rds`
+- `<dataset>_cv_means_mse.rds`
   - cross validation output RDS
-- `nmf_Lasso_out.rds`
+- `<dataset>_nmf_Lasso_out.rds`
   - NMF Lasso output RDS
-- `plot_signatures.pdf`
-  - exposure PDF plot
-- `plot_signatures.rds`
-  - exposure RDS plot
+- `<dataset>_plot_signatures.pdf` and `<dataset>_plot_signatures.rds`
+  - exposure plot in PDF and RDS
 
 </details>
 
 ### SigProfiler
 
-[SigProfiler](https://osf.io/t6j7u/wiki/home/) is a python framework that allows _de novo_ extraction of mutational signatures from data generated in a matrix format. The tool identifies the number of operative mutational signatures, their activities in each sample, and the probability for each signature to cause a specific mutation type in a cancer sample. The tool makes use of `SigProfilerMatrixGenerator` and `SigProfilerPlotting`, seamlessly integrating with other `SigProfiler` tools.
+[SigProfiler](https://osf.io/t6j7u/wiki/home/) is a python framework that allows _de novo_ extraction of mutational signatures from data generated in a matrix format. The tool identifies the number of operative mutational signatures, their activities in each sample, and the probability for each signature to cause a specific mutation type in a cancer sample. The tool makes use of `SigProfilerMatrixGenerator` and `SigProfilerExtractor`, seamlessly integrating with other `SigProfiler` tools.
 
 <details markdown="1">
 <summary>Output files for all samples</summary>
 
-**Output directory: `{publish_dir}/signatures_deconvolution/SigProfiler/<dataset>/`**
-
-- `<dataset>.` _missing_
-  - _missing_
+**Output directory: `{outdir}/signatures_deconvolution/SigProfiler/<dataset>/results`**
+
+- `input/`
+  - folder containing a copy of the user-provided input files for SigProfilerMatrixGenerator step
+- `input_data.txt`
+  - join table of all mutations in the dataset in TXT
+- `logs/`
+  - folder containing the error and log files for SigProfilerMatrixGenerator step
+- `output/`
+  - folder containing the DBS, SBS, INDEL nucleotide matrices resulting from SigProfilerMatrixGenerator step
+- `SBS96/`
+  - folder containing the results of SigProfilerExtractor step in the SBS96 mutational context. This directory will contain:
+    - `All_Solutions/`
+      - subdirectory containing the results from running extractions at each rank within the range of the input. For more details visit the [official website](https://osf.io/t6j7u/wiki/5.%20Output%20-%20All%20Solutions/)
+    - `Suggested_Solution/`
+      - subdirectory containing the optimal solution. For more details visit the [official website](https://osf.io/t6j7u/wiki/6.%20Output%20-%20Suggested%20Solution/)
+- `JOB_METADATA.txt`
+  - TXT file containing all the metadata about the system and runtime of the job
+- `Seeds.txt`
+  - TXT file containing the replicate IDs and preset seeds
 
 </details>
 
@@ -272,7 +395,7 @@ The subclonal deconvolution subworkflow requires in input a joint `mCNAqc` objec
 
 <!-- Various tools can be specified using the `--tools` parameter, leading to different methods for performing subclonal deconvolution analysis. Among the available tools, [MOBSTER](https://caravagnalab.github.io/mobster/) operates only on single samples but can still be specified for use in multi-sample mode, while [VIBER](https://caravagnalab.github.io/VIBER/index.html) and [PyClone-VI](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03919-2) can operates in both modalities. In case `--mode multisample` and `--tools mobster,pyclone-vi` are specified, first MOBSTER is run on individual samples to remove tail mutations and then PyClone-VI operates a multivariate subclonal deconvolution on the preprocessed MOBSTER mutations. A similar procedure is perfomed when  `--mode multisample` and `--tools mobster,viber` are defined. More detailed explanation is provided in the following sections.  -->
 
-The results of subclonal decovnultion step are collected in `{publish_dir}/subclonal_deconvolution/` directory.
+The results of subclonal decovnultion step are collected in `{outdir}/subclonal_deconvolution/` directory.
 
 <!-- ### Single sample
 
@@ -280,12 +403,12 @@ If `--mode singlesample` is provided, each sample is analysed individually provi
 
 ### MOBSTER
 
-[MOBSTER](https://caravagnalab.github.io/mobster/) processes mutant allelic frequencies to identify and remove neutral tails from the input data, so that subclonal reconstruction algorithms can be applied downstream to find subclones from the processed read counts.
+[MOBSTER](https://caravagnalab.github.io/mobster/) is a package that models mutant allelic frequencies and copy-number status by integrating evolutionary theory and Bayesian proabilistic modelling to identify clusters of variants with similar cellular proportions. Futhermore, MOBSTER models the dynamics of passenger mutations via a Pareto distribution giving rise to the so called neutral tail.
 
 <details markdown="1">
 <summary>Output files for all samples</summary>
 
-**Output directory: `{publish_dir}/subclonal_deconvolution/mobster/<dataset>/<patient>/<sample>/`**
+**Output directory: `{outdir}/subclonal_deconvolution/mobster/<dataset>/<patient>/<sample>/`**
 
 - `<dataset>_<patient>_<sample>_mobsterh_st_fit.rds`
   - RDS object contains all fits of subclonal deconvolution
@@ -300,12 +423,12 @@ If `--mode singlesample` is provided, each sample is analysed individually provi
 
 ### PyClone-VI
 
-[PyClone-VI](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03919-2) is a computationally efficient Bayesian statistical method for inferring the clonal population structure of cancers, by considering allele fractions and coincident copy number variation using a variational inference approach.
+[PyClone-VI](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03919-2) is a computationally efficient Bayesian statistical method for inferring the clonal population structure of cancers, by considering allele fractions and coincident copy number variation using a variational inference approach. It works for patients with both single and multiple samples.
 
 <details markdown="1">
 <summary>Output files for all samples</summary>
 
-**Output directory: `{publish_dir}/subclonal_deconvolution/pyclonevi/<dataset>/<patient>`**
+**Output directory: `{outdir}/subclonal_deconvolution/pyclonevi/<dataset>/<patient>`**
 
 - `<dataset>_<patient>_pyclone_input.tsv`
   - TSV file with Pyclone-VI input table
@@ -314,18 +437,18 @@ If `--mode singlesample` is provided, each sample is analysed individually provi
 - `<dataset>_<patient>_best_fit.txt`
   - TSV file for the best fit
 - `<dataset>_<patient>_cluster_table.csv`
-  - CSV file wtih clone assignment
+  - CSV file with clone assignment
 
 </details>
 
 ### VIBER
 
-[VIBER](https://caravagnalab.github.io/VIBER/index.html) is an R package that implements a variational Bayesian model to fit multi-variate Binomial mixtures. In the context of subclonal deconvolution in singlesample modality, VIBER models read counts that are associated with the most represented karyotype.
+[VIBER](https://caravagnalab.github.io/VIBER/index.html) is an R package that implements a variational Bayesian model to fit multi-variate Binomial mixtures. In the context of subclonal deconvolution VIBER models read counts that are associated with the most represented karyotype. It works for patients with both single and multiple samples.
 
 <details markdown="1">
 <summary>Output files for all samples</summary>
 
-**Output directory: `{publish_dir}/subclonal_deconvolution/viber/<dataset>/<patient>`**
+**Output directory: `{outdir}/subclonal_deconvolution/viber/<dataset>/<patient>`**
 
 - `<dataset>_<patient>_viber_best_st_fit.rds`
   - RDS file for best standard fit
@@ -358,9 +481,9 @@ This folder contains the results of multivariate analysis using Pyclone-VI, whic
 <details markdown="1">
 <summary>Output files for all patients with MOBSTER</summary>
 
-**Output directory: `{publish_dir}/subclonal_deconvolution/pyclonevi/<dataset>/<patient>/`**
+**Output directory: `{outdir}/subclonal_deconvolution/pyclonevi/<dataset>/<patient>/`**
 
-<!-- - `<patient>_with_mobster_all_fits.h5`
+- `<patient>_with_mobster_all_fits.h5`
   - HDF5 file for all possible fit
 - `<patient>_with_mobster_best_fit.txt`
   - TSV file for the best fit
@@ -375,7 +498,7 @@ This folder contains the results of multivariate analysis using Pyclone-VI, whic
 <details markdown="1">
 <summary>Output files for all patients without MOBSTER</summary>
 
-**Output directory: `{publish_dir}/subclonal_deconvolution/pyclonevi/<dataset>/<patient>/`**
+**Output directory: `{outdir}/subclonal_deconvolution/pyclonevi/<dataset>/<patient>/`**
 
 - `without_mobster_all_fits.h5`
   - HDF5 file for all possible fit and summary stats
@@ -431,7 +554,7 @@ VIBER and MOBSTER fits are already compatible for ctree analysis.
 <details markdown="1">
 <summary>Output files for all samples</summary>
 
-**Output directory: `{publish_dir}/subclonal_deconvolution/ctree/<dataset>/{<patient>,<patient>/<sample>/}`**
+**Output directory: `{outdir}/subclonal_deconvolution/ctree/<dataset>/{<patient>,<patient>/<sample>/}`**
 
 - `{<dataset>_<patient>,<dataset>_<patient>_<sample>}_ctree_<tool>.rds`
   - RDS file containing inferred clone tree
@@ -446,7 +569,7 @@ VIBER and MOBSTER fits are already compatible for ctree analysis.
 <details markdown="1">
 <summary>Output files for all patients</summary>
 
-**Output directory: `{publish_dir}/subclonal_deconvolution/ctree/<patient>/`**
+**Output directory: `{outdir}/subclonal_deconvolution/ctree/<patient>/`**
 
 - `ctree_<tool>.rds`
   - RDS file containing inferred clone tree
@@ -455,148 +578,42 @@ VIBER and MOBSTER fits are already compatible for ctree analysis.
 - `ctree_input_pyclonevi.csv`
   - CSV file required for clone tree inference from pyclone
 
-</details>
-<!--
-## Genome Interpreter
-
-Add description
-
-<details markdown="1">
-<summary>Output files for all samples</summary>
-
-**Output directory: `{publish_dir}/subclonal_deconvolution/ctree/<dataset>/<patient>/<sample>/`**
-
-- `name_of_the_file`
-  - add description on this part
-
-<!-- - `ctree_input_pyclonevi.csv`
-  - CSV file required for clone tree inference from pyclone -->
-
-</details>
-
-## Unpublished results
-
-### Formatter
-
-The Formatter subworkflow is used to convert file to other formats and to standardize the output files resulting from different mutation (Mutect2, Strelka) and cna callers (ASCAT,Sequenza). Output files from this step are not published.
-
-#### cna2CNAqc
-
-This parser aims at standardize into a unique format copy number calls and purity estimate from different callers.
-
-<details markdown="1">
-<summary>Output files for all samples</summary>
-
-**Output directory: `{work_dir}/formatter/cna2cnaqc/<dataset>/<patient>/<sample>/`**
-
-- `<dataset>_<patient>_<sample>_cna.rds`
-  - RDS file containing parsed cna output in table format
-
-</details>
-
-#### vcf2cnaqc
-
-This parser aims at standardize into a unique format single nucleotide variants from different callers.
-
-<details markdown="1">
-<summary>Output files for all samples</summary>
-
-**Output directory: `{work_dir}/formatter/vcf2cnaqc/<dataset>/<patient>/<sample>/`**
-
-- `<dataset>_<patient>_<sample>_snv.rds`
-  - RDS file containing parsed vcf in table format
-
-</details>
-
-#### cnaqc2tsv
-
-This parser aims at converting mutations data of joint CNAqc analysis from CNAqc format (RDS file) into a tabular format (TSV file). This step is mandatory for running python-based tools (e.g. PyClone-VI, SigProfiler).
-
-<details markdown="1">
-<summary>Output files for all patients</summary>
-
-**Output directory: `{work_dir}/formatter/cnaqc2tsv/<dataset>/<patient>/`**
-
-- `<dataset>_<patient>_joint_table.tsv`
-  - TSV file containing cna and variants joint
-    .
-
-</details>
-
-### Lifter
-
-<!-- The Lifter subworkflow is optional in multi-sample mode, when for a patient more samples are provided. The sub-workflow collect all mutations from the samples and perform pile-up of sample's private mutations in all the other samples. -->
-
-The Lifter subworkflow is an optional step and it is run when single sample VCF file are provided. When multiple samples from the same patient are provided, the user can specify either a single joint VCF file, containing variant calls from all tumor samples of the patient (see [joint variant calling](<(https://nf-co.re/sarek/3.4.2/parameters/#joint_mutect2)>)), or individual sample specific VCF files. In the latter case, path to tumor BAM files must be provided in order to collect all mutations from the samples and perform pile-up of sample's private mutations in all the other samples. Two intermediate steps, [get_positions](#get_positions) and [mpileup](#mpileup), are performed to identify private mutations in all the samples and retrieve their variant allele frequency. Once private mutations are properly defined, they are merged back into the original VCF file during the [join_positions](#join_positions) step. The updated VCF file is then converted into a `vcfR` RDS object.
-
-#### mpileup
-
-At this stage, [bcftools](https://samtools.github.io/bcftools/bcftools.html) is used to perform the pileup in order to retrieve frequency information of private mutations across all samples.
-
-<details markdown="1">
-<summary>Output files for all samples</summary>
-
-**Output directory: `{work_dir}/lifter/mpileup/<dataset>/<patient>/<sample>/`**
-
-- `<dataset>_<patient>_<sample>.bcftools_stats.txt`
-  - TXT file with statistics on the called mutations
-- `<dataset>_<patient>_<sample>.vcf.gz` and `<dataset>_<patient>_<sample>.vcf.gz.tbi`:
-  - VCF file and tabix index with called mutations
-
-</details>
-
-#### positions
-
-This intermediate step allows to retrieve private and shared mutations across samples originated from the same patient. Retrieved mutations are joined with original mutations present in input VCF, which is in turn converted into an RDS object using [vcfR](https://cran.r-project.org/web/packages/vcfR/vignettes/intro_to_vcfR.html).
-
-<details markdown="1">
-<summary>Output files for all samples</summary>
-
-**Output directory: `{publish_dir}/lifter/positions/<dataset>/<patient>/<sample>/`**
-
-- `<dataset>_<patient>_<sample>.pileup_VCF.rds`
-  - RDS containing shared and private mutations
-- `<dataset>_<patient>_<sample>.positions_missing`
-  - TXT file containing mutations to be retrieved for a given sample
+</details> -->
 
-</details>
+### Pipeline information
 
 <details markdown="1">
-<summary>Output files for all patients</summary>
-
-**Output directory: `{publish_dir}/lifter/positions/<dataset>/<patient>/`**
+<summary>Output files</summary>
 
-- `<dataset>_<patient>__all_positions.rds`
-  - RDS containing shared and private mutations
+- `pipeline_info/`
+  - Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`.
+  - Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.yml`. The `pipeline_report*` files will only be present if the `--email` / `--email_on_fail` parameter's are used when running the pipeline.
+  - Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`.
+  - Parameters used by the pipeline run: `params.json`.
 
 </details>
 
-<!--
+[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
 
-### Driver Annotation
+## Reference files
 
-Variants are annotated according to [IntOGen latest release](https://www.nature.com/articles/s41568-020-0290-x).
+Different tools of the pipeline generate references files. Once reference file for VEP and SigProfiler are not provided they are stored in the tool-specific folder.
 
-<details markdown="1">
-<summary>Output files for all samples</summary>
+### VEP
 
-**Output directory: `{work_dir}/DriverAnnotation/<dataset>/<patient>/<sample>/`**
+When VEP cache is not specified, the desired VEP cache is downladed in `{outdir}/references/VEP/vep_cache/homo_sapiens/{VEP_version}_{ref_genome}`.
 
-* `annotated_drivers.rds`
-    * RDS file containing variants with annotated drivers.
-</details> -->
+### SigProfiler
 
-### Pipeline information
+Reference genome for SigProfiler is store in the following folder:
 
 <details markdown="1">
 <summary>Output files</summary>
 
-- `pipeline_info/`
-  - Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`.
-  - Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.yml`. The `pipeline_report*` files will only be present if the `--email` / `--email_on_fail` parameter's are used when running the pipeline.
-  - Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`.
-  - Parameters used by the pipeline run: `params.json`.
+**Output directory: `{outdir}/subclonal_deconvolution/signature_deconvolution/SigProfiler/genome/tsb/{ref_genome}`**
 
-</details>
-
-[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
+- `{chromosome}.txt`
+  - genome assemby chromosme level
+- `{ref_genome}_proportions.txt`
+  - genome assemby proportions
+  </details>