Skip to content

Collection of bioinformatic tools to work with fastq sequences, DNA and RNA sequences and protein sequences

Notifications You must be signed in to change notification settings

yvolko/Bioinformatics-Tools-Repository

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 

Repository files navigation

Bioinformatics-Tools-Repository

Collection of bioinformatic tools to work with fastq format, gbk, DNA and RNA sequences and protein sequences.

Working with DNA and RNA sequences

This tool helps to get reverse, complement, reverse-complement and transcribed sequences. As well as define if sequense is DNA or is RNA.

Working with protein sequences

This tool supports standard 20 amino acids. Any modifications of amino acids are not supported. You can write amino acids in any case (lower, upper or mixed).

  1. Alanine (A, Ala)
  2. Arginine (R, Arg)
  3. Asparagine (N, Asn)
  4. Aspartic Acid (D, Asp)
  5. Cysteine (C, Cys)
  6. Glutamine (Q, Gln)
  7. Glutamic Acid (E, Glu)
  8. Glycine (G, Gly)
  9. Histidine (H, His)
  10. Isoleucine (I, Ile)
  11. Leucine (L, Leu)
  12. Lysine (K, Lys)
  13. Methionine (M, Met)
  14. Phenylalanine (F, Phe)
  15. Proline (P, Pro)
  16. Serine (S, Ser)
  17. Threonine (T, Thr)
  18. Tryptophan (W, Trp)
  19. Tyrosine (Y, Tyr)
  20. Valine (V, Val)

This project consists of one function "protein_analysis" that helps user to:

  • predict molecular weight of amino acid (aa) sequences
  • translate aa sequences from one-letter to three-letter code
  • calculate total amount of each amino acid in the sequences
  • make DNA based codon optimization for the introduced amino acid sequences with the support for 3 cell types: Esherichia coli, Pichia pastoris, Mouse
  • calculate length of amino acid sequences
  • count the number of atoms of each type in a sequence (brutto formula)

Working with fastq format

This tools choses sequences that satisfy conditions of sertain length, GC content and sequncing quality (phred33 standard).
Learn more about fastq files here.
Learn more about quality score encoding in fastq files here.

Convert multiline fasta file to oneline fasta file

Converts fasta file where sequences are written on multiple lines to fasta file where each sequence is written as single line.

Select genes from gbk to fasta

Selects neighbouring genes to the genes of interest.

Change fasta start position

Changes start positon of sequence to a values given by user.

How to use:

DNA/RNA tool

run_dna_rna_tools(**args, procedure)
Parametrs:

*args : sequence of str
    Any number of lines with DNA only or RNA only sequences
procedure : str
    The name of the operation you want to perform. The following types of procedures are supported:

  • transcribe: transcribes DNA sequnces
  • reverse: return reverse DNA or RNA sequence
  • complement: return complement DNA sequence
  • reverse_complement: return reverse-complement DNA sequence
  • get_nucl_acid_type: checks wether give sequence is DNA or RNA (ND-not defined)

Call the "run_dna_rna_tools" funcion with following arguments. Requred arguments:

  • sequence of strings representing only DNA or only RNA. Please do not use a mixture of DNA/RNA sequences in the same function call!
  • name of procedure as string (see list of precedures)

Protein tool

protein_analysis(*args, procedure, cell_type=None, letter_format=1)
Parametrs:

*args : sequence of str
    Any number of lines with amino acid sequences
procedure : str
    The name of the operation you want to perform. The following types of procedures are supported:

  • molecular_weight: calculates predicted molecular weight of amino acid sequences in kDa
  • one_letter_to_three: translate aa sequences from one-letter to three-letter code
  • get_amino_acid_sum: calculates total amount of each amino acid in the sequences
  • codon_optimization: makes DNA based codon optimization for the introduced amino acid sequences, support 3 types of cells. Can only be used in conjunction with cell_type: Esherichia coli, Pichia pastoris, Mouse
  • length: calculates length of amino acid sequences
  • brutto_count: counts the number of atoms of each type in a sequence

cell_type : str, defalut None
    The type of cells for which optimization is applied. Cell types supported:

  • Esherichia coli or E.coli
  • Pichia pastoris or P.pastoris
  • Mouse or mouse

letter_format : int, defalut 1
    Specifies the format for receiving amino acid sequences. Either one-letter (letter_format = 1) or three-letter sequences (letter_format = 3)

Call the "protein_analysis" funcion with following arguments. Requred arguments:

  • tuple of protein sequences written one letter or three letter code without stop codos. Please do not use sequences in different formats in the same function call!
  • name of procedure as string (see list of precedures)
  • format of code for the protein sequences as int: 1 for one letter, 3 for three letter code Optional argument:
  • cell type (required only for codon_optimization procedure). Accepted cell types Esherichia coli, Pichia pastoris, Mouse

FASTQ tool

fastq_thresholding(input_path, output_filename = '', gc_bounds = (0, 100), length_bounds = (0, 2 ** 32), quality_threshold = 0)
Parametrs:

*input_path : str
    String, valid path to the input file with the name of file and extension of the file

*output_filename : str
    String, name of the output file. By default - input fasta file name is taken.

gc_bounds : tuple, defalut (0, 100)
    Tuple: first value is lower boundary of GC-content and second value - upper boundry. Function will filter sequences that are in between given values (including given values). If only one number is given it is taken as upper bound and lower bound is taken as 0.

length_bounds : tuple, defalut (0, 2 power(32))
    Tuple: first value is lower boundary of length and second value - upper boundry. Function will filter sequences that are in between given values (including given values). If only one number is given it is taken as upper bound and lower bound is taken as 0.

quality_threshold : int, defalut 1
    If average sequnce quality of given sequence is above or equal to given number sequence will be selected.

Call the "fastq_thresholding" funcion with following arguments.

Requred arguments:

  • input_path

Optional argument:

  • output_filename
  • gc_bounds
  • length_bounds
  • quality_threshold

Convert multiline fasta file to oneline fasta file

convert_multiline_fasta_to_oneline(input_fasta, output_fasta = '')
Parametrs:

*input_fasta : str
    String, valid path to the input file with the name of file and extension of the file

*output_fasta : str
    String, name of the output file. By default - input fasta file name is taken and "_oneline" is added to it.

Call the "convert_multiline_fasta_to_oneline" funcion with following arguments.

Requred arguments:

  • input_fasta

Optional argument:

  • output_fasta

Select genes from gbk to fasta

select_genes_from_gbk_to_fasta(input_gbk, genes, n_before = 1, n_after = 1, output_fasta = '')
Parametrs:

*input_gbk : str
    String, valid path to the input file with the name of file and extension of the file

*genes : str
    String, genes of interest separated with space.

*n_before : int
    Integer, how many genes to select before gene of interest. By default equal to 1.

*n_after : int
    Integer, how many genes to select after gene of interest. By default equal to 1.

*output_fasta : str
    String, name of the output fasta file. By default name of the input gbk file is taken and "_selected_genes" is added.

Call the "select_genes_from_gbk_to_fasta" funcion with following arguments.

Requred arguments:

  • input_gbk
  • genes

Optional argument:

  • n_before
  • n_after
  • output_fasta

Change fasta start position

*input_fasta : str
    String, valid path to the input file with the name of file and extension of the file

*shift : int
    Integer, shift to move start position.

*output_fasta : str
    String, name of the output file.

Call the "change_fasta_start_pos" funcion with following arguments.

Requred arguments:

  • input_fasta
  • shift
  • output_fasta

List of procedures:

  • transcribe — returns list of strings with transcribed sequences (or one string if one sequence is given)
  • reverse — returns list of strings with reversed sequences (or one string if one sequence is given)
  • complement — returns list of strings with complement sequences (or one string if one sequence is given)
  • reverse_complement — returns list of strings with reverse-complement sequences (or one string if one sequence is given)
  • get_nucl_acid_type — returns list of string with DNA, RNA or ND ()not defined in it (or one string if one sequence is given)
  • molecular_weight — returns list of float values, that indicate predicted molecular weights of given aa sequences (in kDa)
  • one_letter_to_three — will return list of strings, containing the same sequences written in three-letter code
  • get_amino_acid_sum — сounts the amount of each amino acid in the injected protein sequences
  • codon_optimization — makes codon-optimized DNA based on the introduced amino acid sequences for 3 types of cells: Esherichia coli, Pichia pastoris, Mouse
  • length — calculates length of amino acid sequences
  • brutto_count — counts the number of atoms of each type in a sequence

Example of use:

run_dna_rna_tools('ATG', 'transcribe') # 'AUG'
run_dna_rna_tools('ATG', 'reverse') # 'GTA'
run_dna_rna_tools('AtG', 'complement') # 'TaC'
run_dna_rna_tools('ATg', 'reverse_complement') # 'cAT'
run_dna_rna_tools('ATG', 'aT', 'reverse') # ['GTA', 'Ta']
run_dna_rna_tools('AUG', 'aT', 'aCg', 'is_dna_is_rna') # ['RNA', 'DNA', 'ND']
protein_analysis("ACD", "AD", procedure="one_letter_to_three", letter_format=1) # ['AlaCysAsp', 'AlaAsp']
protein_analysis("AlaAspLys", "AlaAsp", procedure="molecular_weight", letter_format=3) # [0.37, 0.22]
protein_analysis("ACD", "AD", procedure="get_amino_acid_sum") # [{'A': 1, 'C': 1, 'D': 1, 'E': 0, 'F': 0, 'G': 0, 'H': 0, 'I': 0, 'K': 0, 'L': 0, 'M': 0, 'N': 0, 'P': 0, 'Q': 0, 'R': 0, 'S': 0, 'T': 0, 'V': 0, 'W': 0, 'Y': 0},
                                                                        # {'A': 1, 'C': 0, 'D': 1, 'E': 0, 'F': 0, 'G': 0, 'H': 0, 'I': 0, 'K': 0, 'L': 0, 'M': 0, 'N': 0, 'P': 0, 'Q': 0, 'R': 0, 'S': 0, 'T': 0, 'V': 0, 'W': 0, 'Y': 0}]
protein_analysis("ACD", "AD", procedure="codon_optimization", cell_type = 'E.coli', letter_format=1) # ['GCGTGCGAT', 'GCGGAT']
protein_analysis("acDEFGHIKLMNPQRSTVwy", "ad", procedure="length", letter_format=1) # [20, 2]
protein_analysis("FGHIKLMNPQ", "PQRSTVwy", "adN", procedure="brutto_count", letter_format=1)
# [{'C': 54, 'H': 103, 'N': 15, 'O': 22, 'S': 1}, {'C': 48, 'H': 83, 'N': 23, 'O': 18, 'S': 3}, {'C': 11, 'H': 22, 'N': 4, 'O': 9, 'S': 0}]
fastq_thresholding("C:\\Users\\name\\Desktop\\pythonProject\\example_data.txt", "my_file22", length_bounds=(1,1000)) # in windows OS
fastq_thresholding("/Users/volko/Desktop/pythonProject/example_data.txt", "my_file22", length_bounds=(1,1000)) # in Linux/Mac OS
select_genes_from_gbk_to_fasta('C:\\Users\\name\\Desktop\\pythonProject\\example_gbk.gbk', "rrrD_1 pxpB", n_before=2, n_after=2) # in windows OS
fastq_thresholding("/Users/name/Desktop/pythonProject/example_gbk.gbk", "my_file22", length_bounds=(1,1000)) # in Linux/Mac OS
convert_multiline_fasta_to_oneline('C:\\Users\\name\\Desktop\\pythonProject\\example_multiline_fasta.fasta') # in windows OS
convert_multiline_fasta_to_oneline("/Users/name/Desktop/pythonProject/example_multiline_fasta.fasta", "my_file22", length_bounds=(1,1000)) # in Linux/Mac OS
change_fasta_start_pos("C:\\Users\\name\\Desktop\\pythonProject\\docshift.txt", 2, "shifted_line") # in windows OS
convert_multiline_fasta_to_oneline("/Users/name/Desktop/pythonProject/docshift.txt", 2, "shifted_line") # in Linux/Mac OS

Input requirements and possible errors:

  • It is important to indicate the type of operation. An error occurs when you enter an incorrect operation type
protein_analysis("FGHIKLMNPQ", "PQRSTVwy", "adN", procedure="brutto", letter_format=1)
# ValueError: Requested procedure is not defined
  • To perform the coden_optimization operation, you must enter cell_type (None by default). Otherwise an error message is displayed
protein_analysis('AlaCysAsp', 'AlaAsp', procedure="codon_optimization", cell_type='Rat', letter_format=3) 
# ValueError: Type Rat is not supported. The following types of organisms are available for codon optimization: Esherichia coli, Pichia pastoris, Mouse
  • By default, entering amino acid sequences in a single-letter format in any case is supported. To enter in three-letter format in any case, you need to specify letter_format = 3.
    If an unknown format is entered, an error message is displayed.
protein_analysis("ACD", "AD", procedure="one_letter_to_three", cell_type='E.coli', letter_format=2)
# ValueError: Error unsupported letter_format. Only letter_formats 1 and 3 are supported
  • If letter_format = 1 is specified, but all sequences are similar to the three-letter amino slot encoding, a notification will be displayed warning
protein_analysis("LYSlys", "HishisHis", procedure="get_amino_acid_sum", letter_format=1)
# Warning: all your sequences are similar to three-letter ones. Check the letter_format value
  • If a single-letter amino acid input format is specified, but at least one amino acid slot is not standard or is written incorrectly, an error message is displayed
protein_analysis("BBB", procedure="get_amino_acid_sum", letter_format=1))
# ValueError: Error B is not an amino acid. Correct your input
  • If a three-letter amino acid input format is specified, but at least one amino acid slot is not standard or is written incorrectly, an error message is displayed
protein_analysis("Al", procedure="get_amino_acid_sum", letter_format=3)
# ValueError: Error al is incorrect form of amino acid notation. Correct your input
protein_analysis("AluLysArg", procedure="get_amino_acid_sum", letter_format=3)
# ValueError: Error alu is not an amino acid. Correct your input
  • If file with the name equal to output_filename (in fastq_thresholding function) or equal output_fasta (in convert_multiline_fasta_to_oneline, in select_genes_from_gbk_to_fasta or in change_fasta_start_pos functions) already exists FileExistsError will occure
fastq_thresholding("C:\\Users\\name\\Desktop\\pythonProject\\example_data.txt", "my_file22", length_bounds=(1,1000))
# FileExistsError: File with the provided name already exist. Please use another name.
convert_multiline_fasta_to_oneline("C:\\Users\\name\\Desktop\\pythonProject\\example_multiline_fasta.fasta", "my_file22", )
# FileExistsError: File with the provided name already exist. Please use another name.
select_genes_from_gbk_to_fasta("C:\\Users\\name\\Desktop\\pythonProject\\example_gbk.gbk", "my_file22", "rrrD_1 pxpB", n_before=2, n_after=2)
# FileExistsError: File with the provided name already exist. Please use another name.
  • If at least one of the genes in the parameter genes (function select_genes_from_gbk_to_fasta) is not in the input gbk file ValueError will occure
select_genes_from_gbk_to_fasta('C:\\Users\\name\\Desktop\\pythonProject\\example_gbk.gbk', "AMMMttt", n_before=2, n_after=2)
# ValueError: 'The gene AMMMttt is not in the data.'

Personal contribution

Protein tool was written in team with: Ivan Kozin wrote functions:

  • length
  • brutto_count
  • is_amino_acid
  • name_transform
  • is_length_divisible_by_3
  • is_amino_acid_three_letter
  • managed work with guthub repository

Dasha Sokolova wrote functions:

  • get_amino_acid_sum
  • codon_optimization functions

Everyting else has been written by Yulia Volkova.

All tools are coded with Python.

About

Collection of bioinformatic tools to work with fastq sequences, DNA and RNA sequences and protein sequences

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages