Skip to content

PrecisionProDB parameters

Xiaolong Cao edited this page Jan 18, 2025 · 3 revisions

To get help for the main program, run

python Path_of_PrecisionProDB/src/PrecisionProDB.py -h

The output will be like

usage: PrecisionProDB [-h] [-g GENOME] [-f GTF] [-m MUTATIONS] [-p PROTEIN] [-t THREADS]
                      [-o OUT] [-a {GENCODE_GTF,GENCODE_GFF3,RefSeq,Ensembl_GTF,gtf}]
                      [-k PROTEIN_KEYWORD] [-F] [-s SAMPLE] [-A]
                      [-D {GENCODE,RefSeq,Ensembl,Uniprot,CHM13,}] [-U UNIPROT]
                      [--uniprot_min_len UNIPROT_MIN_LEN] [--PEFF] [--keep_all] [-S SQLITE]

PrecisionProDB, a personal proteogenomic tool which outputs a new reference protein based
on the variants data. A VCF or /a tsv file can be used as the variant input. If the variant
file is in tsv format, at least four columns are required in the header: chr, pos, ref,
alt. Additional columns will be ignored. Try to Convert the file to proper format if you
have a bed file or other types of variant file. The pos column is 1-based like in the vcf
file. Additionally, a string like "chr1-788418-CAG-C" can used as variant input. It has to
be combined with the --sqlite for quick check of the mutation effects

options:
  -h, --help            show this help message and exit
  -g GENOME, --genome GENOME
                        the reference genome sequence in fasta format. It can be a gzip
                        file
  -f GTF, --gtf GTF     gtf file with CDS and exon annotations. It can be a gzip file
  -m MUTATIONS, --mutations MUTATIONS
                        a file stores the variants. If the file ends with ".vcf" or
                        ".vcf.gz", treat as vcf input. Otherwise, treat as TSV input. A
                        string like "chr1-788418-CAG-C" or
                        "chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G" can used as variant
                        input, too. In this mode, --sample will not be used. If multiple
                        vcf files are provided, use "," to join the file names. For
                        example, "--mutations file1.vcf,file2.vcf". A pattern match is also
                        supported for input vcf, but quote is required to get it work. For
                        example '--mutations "file*.vcf" '
  -p PROTEIN, --protein PROTEIN
                        protein sequences in fasta format. It can be a gzip file. Only
                        proteins in this file will be checked
  -t THREADS, --threads THREADS
                        number of threads/CPUs to run the program. default, use 20 or all
                        CPUs available, whichever is smaller
  -o OUT, --out OUT     output prefix, folder path could be included. Three or five files
                        will be saved depending on the variant file format. Outputs include
                        the annotation for mutated transcripts, the mutated or all protein
                        sequences, two variant files from vcf.
                        {out}.pergeno.aa_mutations.csv, {out}.pergeno.protein_all.fa,
                        {out}.protein_changed.fa, {out}.vcf2mutation_1/2.tsv. default
                        "perGeno"
  -a {GENCODE_GTF,GENCODE_GFF3,RefSeq,Ensembl_GTF,gtf}, --datatype {GENCODE_GTF,GENCODE_GFF3,RefSeq,Ensembl_GTF,gtf}
                        input datatype, could be GENCODE_GTF, GENCODE_GFF3, RefSeq,
                        Ensembl_GTF or gtf. default "gtf". Ensembl_GFF3 is not supported.
  -k PROTEIN_KEYWORD, --protein_keyword PROTEIN_KEYWORD
                        field name in attribute column of gtf file to determine ids for
                        proteins. default "auto", determine the protein_keyword based on
                        datatype. "transcript_id" for GENCODE_GTF, "protein_id" for
                        "RefSeq" and "Parent" for gtf and GENCODE_GFF3
  -F, --no_filter       default only keep variant with value "PASS" FILTER column of vcf
                        file. if set, do not filter
  -s SAMPLE, --sample SAMPLE
                        sample name in the vcf to extract the variant information. default:
                        None, extract the first sample. For multiple samples, use "," to
                        join the sample names. For example, "--sample
                        sample1,sample2,sample3". To use all samples, use "--sample
                        ALL_SAMPLES". To use all variants regardless where the variants
                        from, use "--sample ALL_VARIANTS".
  -A, --all_chromosomes
                        default keep variant in chromosomes and ignore those in short
                        fragments of the genome. if set, use all chromosomes including
                        fragments when parsing the vcf file
  -D {GENCODE,RefSeq,Ensembl,Uniprot,CHM13,}, --download {GENCODE,RefSeq,Ensembl,Uniprot,CHM13,}
                        download could be 'GENCODE','RefSeq','Ensembl','Uniprot', 'CHM13'.
                        If set, PrecisonProDB will try to download genome, gtf and protein
                        files from the Internet. Download will be skipped if "--genome,
                        --gtf, --protein, (--uniprot)" were all set. Settings from "--
                        genome, --gtf, --protein, (--uniprot), --datatype" will not be used
                        if the files were downloaded by PrecisonProDB. default "". Note, if
                        --sqlite is set, will not download any files
  -U UNIPROT, --uniprot UNIPROT
                        uniprot protein sequences. If more than one file, use "," to join
                        the files. default "". For example, "UP000005640_9606.fasta.gz", or
                        "UP000005640_9606.fasta.gz,UP000005640_9606_additional.fasta"
  --uniprot_min_len UNIPROT_MIN_LEN
                        minimum length required when matching uniprot sequences to proteins
                        annotated in the genome. default 20
  --PEFF                If set, PEFF format file(s) will be generated. Default: do not
                        generate PEFF file(s).
  --keep_all            If set, do not delete files generated during the run
  -S SQLITE, --sqlite SQLITE
                        A path of sqlite file for re-use of annotation info. default '', do
                        not use sqlite. The program will create a sqlite file if the file
                        does not exist. If the file already exists, the program will use
                        data stored in the file. It will cause error if the content in the
                        sqlite file is not as expected.

Notes

  • -p PROTEIN, --protein PROTEIN is a file with proteins matching the GTF file provided!
  • -k PROTEIN_KEYWORD is a keyword used to match the GTF file and the protein sequences. If not provided, the program will try to determine the keyword based on the datatype. The program needs the data to know the location of proteins in the genome, and codon matches to allow non-standard codons.
  • -a {GENCODE_GTF,GENCODE_GFF3,RefSeq,Ensembl_GTF,gtf}, --datatype {GENCODE_GTF,GENCODE_GFF3,RefSeq,Ensembl_GTF,gtf} should be set if you use the format above. For "gtf" format, the PROTEIN_KEYWORD and PROTEIN should match.

We do not need to set all params to get it run. See the examples to run for different input combinations.

Clone this wiki locally