A pipeline that uses read-level features and extra trees/random forest algorithms for accurate and fast detection of somatic mutations in next generation sequencing data.
We have created a docker image with all dependencies installed:
- If you don't have docker already installed in your system, please follow these instructions.
docker pull labxa/rfcaller:1.0.0The image has the following structure:
databasesdirectory contains dbSNP and ploidy files.RFcallerhas the scripts used by the pipeline.outputis the working directory.
/home
|-- databases
|-- dbSNP
|-- UCSC_dbSNP153Common_hg19_combined.vcf.gz
|-- UCSC_dbSNP153Common_hg19_combined.vcf.gz.tbi
|-- UCSC_dbSNP153Common_hg38_combined.vcf.gz
|-- UCSC_dbSNP153Common_hg38_combined.vcf.gz.tbi
|-- ploidy_files
|-- GRCm38.ploidy.file
|-- GRCm39.ploidy.file
|-- RFcaller
/outputHere is basic configuration, for more information see docker and RFcaller options.
docker run --rm -v /BAMS_PATH/:/bams/ -v /GENOME_PATH/:/genome/ -v $(pwd):/output/ -u $(id -u):$(id -g) -it labxa/rfcaller:1.0.0 -@ INT -i INPUT.LIST --genome /genome/GENOME.FA --dbSNP hg19/hg38Among all the options offered by docker (docker run --help), we recommend:
--rm: Automatically remove the container when it exits.-v, --volume: Mount local volumes in the container.- With the option
-v $(pwd):/output/, RFcaller results will be in your current directory.
- With the option
-u, --user: Specify the user ID and its group ID. It's useful to not run the pipeline as root.-i, --interactive: Keep STDIN open even if not attached.-t, --tty: Allocate a pseudo-TTY. When combined with-iit allows you to connect your terminal with the container terminal.-e, --env: Set environment variables.
RFcaller has the following required inputs:
-i, --input: The input is a TSV (tab-separated values) file with five required columns and an optional extra one. This last column is used to specify the file with the positions you want to analyze.
| Normal name | Normal BAM (CRAM supported) |
Tumor name | Tumor BAM (CRAM supported) |
Output name | VCF/BED (optional) |
-g, --genome: Reference genome in FASTA format (bgzip supported).-p, --dbSNP: VCF file provided by the image with common SNPs (MAF ≥ 1%) to eliminate these positions from the analysis. In case you want to use your own VCF (bgzip supported), use also the--ploidy_fileargument.- Choices:
[hg19, hg38, your_dbSNP]
- Choices:
RFcaller provides additional configuration through the following optional inputs:
-@: Max number of threads to use.- Default: 20
-b, --bamsDir: Main directory where BAMs are located inside the container. By default, the program will look in the/bams/directory and that's why we set the docker option-v /BAMS_PATH/:/bams/. However, if you mount your BAMs in other directory inside the container, you have to use this option to specify it.- Default:
/bams/
- Default:
-o, --outputDir: Name for the output directory. By default, the results will be saved in your current directory, but with this option a new folder will be created to save them.--positions: This argument refers to the file format of the--inputoptional column.- Default:
VCF - Choices:
[VCF, BED]
- Default:
--ploidy_file:bcftools call --ploidy_fileto use in variant calling. By default, if the user specifies--dbSNP hg19or--dbSNP hg38,--ploidy_fileis GRCh37 or GRCh38 respectively. Otherwise, you must select between the following options:- Choices:
[GRCh37, GRCh38, GRCm38 (mouse), GRCm39 (mouse), None (consider all sites as diploid), your_ploidy_file]
- Choices:
--includePatches: Tag used to analyze both canonical chromosomes and patches.--keep: Tag used to keep intermediate files.
RFcaller has some extra parameters to use as filters:
| Parameter | Description | Default |
|---|---|---|
| contamination | Percentage of tumor contamination in normal sample. For example, if you expect that 10% of reads in normal comes from tumor (10% contamination),you have to set "contamination": 0.1. This argument is used to increase the maximum number of mutated reads allowed in normal sample |
0.05 |
| TD_cov_SNV | Minimum coverage for tumor sample (SNVs) | 7 |
| TD_cov_INDEL | Minimum coverage for tumor sample (INDELs) | 7 |
| ND_cov_SNV | Minimum coverage for normal sample (SNVs) | 7 |
| ND_cov_INDEL | Minimum coverage for normal sample (INDELs) | 7 |
| TD_mut_SNV | Minimum number of mutated reads for a position in tumor sample to not discard it (SNVs) | 3 |
| TD_mut_INDEL | Minimum number of mutated reads for a position in tumor sample to not discard it (INDELs) | 4 |
| ND_mut_SNV | Maximum number of mutated reads allowed for a position in normal sample (SNVs) | 3 |
| ND_mut_INDEL | Maximum number of mutated reads allowed for a position in normal sample (INDELs) | 2 |
| ND_window | Window size around a position to look for mutations in normal (INDELs) | 10 |
| SNV_threshold | Minimum QUAL value to consider a SNV as good | 10.1774 |
| INDEL_threshold | Minimum QUAL value to consider an INDEL as good | 32.1418 |
| polyINDEL_threshold | Minimum QUAL value to consider an homopolymerINDEL (mononucleotide microsatellites) as good | 0.7723 |
Imagine that you are working in the directory /home/example which has the following structure:
/home/example/
|-- example.metadata
|-- bams
|-- normal.bam
|-- normal.bam.bai
|-- tumor.bam
|-- tumor.bam.bai
|-- genome
|-- hg19.fa
|-- hg19.fa.faiIf we use the following command to run RFcaller:
docker run --rm -v /home/example/bams/:/bams/ -v /home/example/genome/:/genome/ -v $(pwd):/output/ -u $(id -u):$(id -g) -it labxa/rfcaller:1.0.0 -i example.metadata --genome /genome/hg19.fa --dbSNP hg19The example.metadata file should be:
normal normal.bam tumor tumor.bam exampleThis is because thanks to the -v command, the files inside the container are located in the following way:
| Computer | Docker |
|---|---|
| /home/example/example.metadata | /output/example.metadata |
| /home/example/bams/normal.bam | /bams/normal.bam |
| /home/example/bams/tumor.bam | /bams/tumor.bam |
| /home/example/genome/hg19.fa | /genome/hg19.fa |
A more detailed explanation:
- We use
-i example.metadatabecause inside the container, we are working in the/output/directory, which is where theexample.metadatafile is located. It would also be valid to write-i /output/example.metadata. - We have said before that the default directory for the BAMs inside the container is
/bams/. In this sense, as our BAMs are in this directory, we don't have to specify the full path in theexample.metadatafile, just write where these BAMs are located inside this folder. - In the case of genome, we have to specify the full path inside the container, that is why we use
--genome /genome/hg19.fa.
To make it clear how to use the Docker options together with RFcaller, we have prepared a more complex example. For this time, we start with the following directory structure:
/home/complex_example/
|-- example.metadata
/data/
|-- bams
|-- normal_dir
|-- normal.bam
|-- normal.bam.bai
|-- tumor_dir
|-- tumor.bam
|-- tumor.bam.bai
|-- hg19_genome
|-- hg19.fa
|-- hg19.fa.faiIn the event that we were working in the /home/ directory:
docker run --rm -v /data/bams/:/example_bams/ -v /data/hg19_genome/:/genome/ -v $(pwd):/output/ -u $(id -u):$(id -g) -it labxa/rfcaller:1.0.0 -i complex_example/example.metadata --bamsDir /example_bams/ --genome /genome/hg19.fa --dbSNP hg19 The distribution of the files will be:
| Computer | Docker |
|---|---|
| /home/complex_example/example.metadata | /output/complex_example/example.metadata |
| /data/bams/normal_dir/normal.bam | /example_bams/normal_dir/normal.bam |
| /data/bams/tumor_dir/tumor.bam | /example_bams/tumor_dir/tumor.bam |
| /data/hg19_genome/genome/hg19.fa | /genome/hg19.fa |
Thus, the metadata.example file should be:
normal normal_dir/normal.bam tumor tumor_dir/tumor.bam exampleA more detailed explanation:
- As we are using
-v $(pwd):/output/and our working directory outside the container is/home/, theexample.metadatafile in the container is inside the foldercomplex_example. This is why we use-i complex_example/example.metadata. - We have mounted the BAMs directory in
/example_bams/instead of/bams/, so we also have to changed BAMs directory for RFcaller with the-b, --bamsDiroption. Now, for theexample.metadatawe have to write the full BAMs path removing the main directory because it is defined with--bamsDir.
The time zone of the docker image is Europe/Madrid and it's used to set the time for the log file. To change this, add the -e option to set environment variables:
docker run --rm -e "TZ=$(cat /etc/timezone)" -v /BAMS_PATH/:/bams/ -v /GENOME_PATH/:/genome/ -v $(pwd):/output/ -u $(id -u):$(id -g) -it labxa/rfcaller:1.0.0 -@ INT -i INPUT.LIST --genome /genome/GENOME.FA --dbSNP hg19/hg38Or set it manually (to know your TZ visit: TZ database)
docker run --rm -e "TZ=America/Toronto" -v /BAMS_PATH/:/bams/ -v /GENOME_PATH/:/genome/ -v $(pwd):/output/ -u $(id -u):$(id -g) -it labxa/rfcaller:1.0.0 -@ INT -i INPUT.LIST --genome /genome/GENOME.FA --dbSNP hg19/hg38For each case three directories and a final file will be created:
- calling: bcftools basic call results
${name}.vcf.gz-> Original calling from bcftools${name}.norm.vcf.gz-> After normalizing the indels${name}.filter.norm.vcf.gz-> After removing SNPs and very low quality mutations${name}.SNVs.filter.norm.vcfand${name}.INDELs.filter.norm.vcf-> Separate SNVs and INDELs into two filespolyIndel.pos-> BED file with those positions which contain an homopolymer INDEL
- somaticSNV: RFcaller results for SNVs
reduced.positions-> All positions to analyze in${name}.SNVs.filter.norm.vcfreduced_*.bam-> A small bam with just the reads overlapping positions inreduced.positions*.mini.pileupand${name}.mini.pileup.vcf-> Filtered mini.pileup and its VCF for the positions inreduced.positionscommon.positions-> File with the positions in${name}.mini.pileup.vcfwith the formatCHR \t POS${tumor}.read.names,${name}.mutations.interval,${name}.sequence,${name}.stats,${name}.cigarand${name}.mapq-> Intermediate files with read-level features${name}.prepare_muts.csv-> File generated from gathering the above features and filtering themregression_results_SNVs_${name}.txt-> Regression algorithm result for SNVsmutations_SNVs_${name}.vcf.gz-> Final set of SNVs
- somaticINDEL: RFcaller results for INDELs
reduced.positions-> All positions to analyze in${name}.INDELs.filter.norm.vcfshort_reduced.positionsandlong_reduced.positions-> Positions separated by the size of the INDEL (short < 7 and long >= 7)reduced_*.bam-> A small bam with just the reads overlapping positions inreduced.positions*.mini.pileupand${name}.mini.pileup.vcf-> Filtered mini.pileup and its VCF for the positions inreduced.positionsshort_common.positions,long_common.positionsandcommon.positions-> File with the positions in${name}.mini.pileup.vcfwith the formatCHR \t POS${name}.features,${name}.mutations.interval,${name}.sequence,${name}.distribution,${name}.stats,${name}.cigarand${name}.mapq-> Intermediate files with read-level features${name}.prepare_muts.csv-> File generated from gathering the above features and filtering themregression_results_INDELs_${name}.txt-> Regression algorithm result for INDELsmutations_INDELs_${name}.vcf.gz-> Final set of INDELs
mutations_${name}.vcf-> Final VCF with the mutations from SNV and INDEL pipelines
If the tag --keep is set, all above directories are removed and it is only conserved the final VCF file mutations_${name}.vcf.
Two additional global files are created:
- RFcaller.log -> Log file with the commands that have been executed for each case and their results
- .stderr.log -> A file with errors and warnings that have occurred during the pipeline
Authors: Ander Díaz-Navarro, Pablo Bousquets-Muñoz & Xose S. Puente -- Universidad de Oviedo