Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
299cd03
Create file protein_tools.py
Sep 28, 2023
e5cae11
Add main function
Sep 28, 2023
8e0ed9b
Add optionds placeholder
Sep 28, 2023
3eede33
Add compare function
zmitserbio Sep 28, 2023
1bf4f43
Add function count_length for length option
Sep 28, 2023
f025570
Correct if else structure for find_pattern function
zmitserbio Sep 28, 2023
6917d98
Add find_pattern and supporting compare_pattern functions
zmitserbio Sep 28, 2023
ad7bf5b
Add function transform_to_DNA_code for plasmid_code
Sep 28, 2023
843eb68
Change the key name plasmid_code to DNA_code
Sep 28, 2023
fa17b0d
Merge pull request #1 from zmitserbio/HW4_Bobkov
GlebBobkov Sep 28, 2023
d2b0584
Add count_percentage function
Olga-Bagrova Sep 29, 2023
b7a67dc
Add rename_three_letter_name function
Olga-Bagrova Sep 29, 2023
441b615
Fix rename_three_letter_name function
Olga-Bagrova Sep 29, 2023
6574d2c
Merge pull request #2 from Olga-Bagrova/HW4_Bobkov
GlebBobkov Sep 29, 2023
1e2fd63
Fix function transform_to_DNA_code
Sep 29, 2023
968a091
Fix function rename_three_letter_name
Sep 29, 2023
fea73ed
Fix 2 annotation and key for function DNA_transform
Sep 29, 2023
4ef3eac
Fix 2 last elif call in main function
Sep 29, 2023
80be85d
Fix key name of persantage
Sep 29, 2023
5f2770d
Add verify function and is_protein and strink_check subfunctions
zmitserbio Sep 30, 2023
c54a7ae
Update README.md
GlebBobkov Sep 30, 2023
627721b
Update README.md
GlebBobkov Sep 30, 2023
fec205a
Update README.md
GlebBobkov Sep 30, 2023
26f6b8b
Merge pull request #3 from zmitserbio/HW4_Bobkov
GlebBobkov Sep 30, 2023
61eed0b
Update README.md
GlebBobkov Sep 30, 2023
5891196
Update README.md
GlebBobkov Sep 30, 2023
4a80750
add folder HW4_Bobkvo with files
Sep 30, 2023
6b78362
folder
Sep 30, 2023
b0f927d
add new folder again and remobe old version
Sep 30, 2023
da10f7b
add files to new fold
Sep 30, 2023
612dceb
finally i delited files_copies
Sep 30, 2023
4cda1c1
Update README.md
GlebBobkov Sep 30, 2023
1aa6619
Update README.md
GlebBobkov Sep 30, 2023
763eb0d
Add files via upload
GlebBobkov Sep 30, 2023
d91ccdd
Update README.md
GlebBobkov Sep 30, 2023
fb74369
Update README.md
GlebBobkov Sep 30, 2023
a15a4c3
Update README.md
GlebBobkov Sep 30, 2023
a730b35
Update README.md
GlebBobkov Sep 30, 2023
3a6fbc8
Change main() to protein_tool()
Sep 30, 2023
859fead
Merge branch 'HW4_Bobkov' of github.com:GlebBobkov/HW4_Bobkov into HW…
Sep 30, 2023
1e5c186
Update README.md
GlebBobkov Sep 30, 2023
12fc927
Update README.md
GlebBobkov Sep 30, 2023
84d8b4b
Update README.md
GlebBobkov Sep 30, 2023
28f2075
Update README.md
GlebBobkov Sep 30, 2023
fdd87f5
Update README.md
GlebBobkov Sep 30, 2023
f0cb458
Update README.md
GlebBobkov Sep 30, 2023
b683dd9
Update README.md
GlebBobkov Sep 30, 2023
04de74f
Update README.md
GlebBobkov Sep 30, 2023
ce992ed
Update README.md
GlebBobkov Sep 30, 2023
06ff93d
Update README.md
GlebBobkov Sep 30, 2023
b6bd43c
Correct minor spelling and wording errors
zmitserbio Sep 30, 2023
0f37939
Upload find_pattern() explanation picture
zmitserbio Sep 30, 2023
4b820d7
Merge branch 'GlebBobkov:HW4_Bobkov' into HW4_Bobkov
zmitserbio Sep 30, 2023
af86f0d
Delete function explanation picture
zmitserbio Sep 30, 2023
d40e3e9
Upload find_pattern() explanation picture
zmitserbio Sep 30, 2023
5f83179
Merge pull request #4 from zmitserbio/HW4_Bobkov
GlebBobkov Sep 30, 2023
ab3d60c
Update README.md
GlebBobkov Sep 30, 2023
015d60e
Merge pull request #5 from GlebBobkov/HW4_Bobkov
GlebBobkov Sep 30, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 120 additions & 0 deletions HW4_Bobkov/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# protein_tools.py
> *Discription how the protein_tools.py works:*
This program contains the function `protein_tool`. The `protein_tool` function takes as input an arbitrary number of arguments in the form of amino acid (aa)/protein sequences of type *str*, as well as the name for the procedure to be performed. After this, the function performs the specified action on all provided sequences. Carefully read the rules of usage for each option, because they specify correct ways of entering arguments, as well as the output and the type of data in the output.
### :warning: Attention:
### 1)> The programm is register-dependent.
### 2)> Before using some of the options read 'Procedures description' carefully.
### 3)> If you input sequenses or 'options' incorrectly, the program will provide you with helpful error messages.

**list of options:**

- 'compare' - Compare amino acids between reference sequence and other sequences;
- 'length'- Сount the number of amino acids in protein sequence(s);
- 'percentage' - Count percentage of each amino acid in sequence;
- 'pattern' - Find all non-overlaping instances of a given pattern in sequences;
- '3Letter_name' - Transform into a three-letter amino acids entry;
- 'DNA_code' - Transform protein sequence(s) to DNA sequence(s).


# Procedures description
## compare
### Introduction
The **compare** procedure compares the first amino acid sequence provided with the following ones.
### Inputs
To start using the length procedure, enter sevreal arguments:
- _an arbitrary number_ of sequences, where the first sequence is a reference to which the following sequences are compared; each argument should be of type 'str'.
- _second-to-last_ argument is the number of decimals to round the number to; type 'int'
- _last_ argument determines whether percentages are returned instead of fractions; type 'bool'
### Outputs
It returns a 'dict' object where:
- *keys* are compared-to sequences (type str)
- *values* are either fractions or percentages (type float).
### Usage example
```python
protein_tool('LAlLAlwWGPdPA', 'LAlLAl', 3, False, options = 'compare') # {'LAlLAl': 1.0}
protein_tool('LAlLAlwWGPdPA', 'LAlLAl', 'GPdPA', 3, True, options = 'compare')) # {'LAlLAl': 100.0, 'GPdPA': 20.0}
```

## length
### Introduction
The **length** procedure calculates the length of protein sequence(s) (equal to the number of amino acids).
### Inputs
To start using the length procedure, enter one or more protein sequences for which you want to get a summary, and at the end add `options = ‘length’`.
### Outputs
The result of the procedure is a list with the numbers of amino acids in each sequence. The list contains only numbers of amico acids in the sequence.
### Usage example
```python
protein_tool('LAlLAlwWGPdPA', options = 'length') # [13]
protein_tool('RRRrrrR', 'WGPdPA', 'LAlLAlw', options = 'length') # [7, 6, 7]
```

## percentage
### Introduction
The **percentage** procedure calculates the percentage of all 20 proteinogenic amino acid residues, case-sensitive in the protein sequence.
### Inputs
To start using the count_percentage procedure, enter one or more protein sequences for which you want to get a summary, and at the end add `options = ‘percentage’`.
### Outputs
The result of the procedure is a list of dictionaries with the percentages of the corresponding amino acids in each sequence. The dictionary contains only amino acid residues whose percentage in the sequence is not equal to 0 (which are contained in the sequence at all). Also, the dictionary is ordered from the largest percentage of content to the smallest. Cases of amino acid residues are taken into account.
> :warning: Attention: We use rounding to 2 decimal places. In some cases, **the sum of percentages** of all amino acid residues for sequence **may not be exactly 100%** due to rounding.
### Usage example
```python
protein_tool('LAlLAlwWGPdPA', options = 'percentage') # [{'A': 23.08, 'L': 15.38, 'l': 15.38, 'P': 15.38, 'w': 7.69, 'W': 7.69, 'G': 7.69, 'd': 7.69}]
protein_tool('RRRrrrR', 'WGPdPA', 'LAlLAlw', options = 'percentage') # [{'R': 57.14, 'r': 42.86}, {'P': 33.33, 'W': 16.67, 'G': 16.67, 'd': 16.67, 'A': 16.67}, {'L': 28.57, 'A': 28.57, 'l': 28.57, 'w': 14.29}]
```

## pattern
### Introduction
The **pattern** procedure finds all non-overlaping cases of a given pattern in amino acid sequence(s) provided.
### Inputs
To start using the pattern procedure, enter one or more protein sequences for which you want to get a summary, where the first sequence is a pattern, which is searched for in the following sequences; each argument should be of type 'str' and at the end add `options = ‘pattern’`.
The *find_pattern()* function goes through a sequence in the following way: it takes a subsequence of amino acids in front of an index equal in length to the pattern and compares it to the pattern. If there is no match, index is moved one amino acid to the end of the sequence. If there is a match, the index is saved, and the function jumps to an aminoacid next to the end of the subsequence, then the algorithm repeats. Comparison is performed by *compare_pattern* subfunction.
The image explanation of that function.
![The image explanation of that function **pattern**](https://github.com/GlebBobkov/HW4_Bobkov/raw/HW4_Bobkov/HW4_Bobkov/explanation.jpg)

### Outputs
The result of this procedure is a 'dict' object where:
- *keys* are amino acid sequences (type 'str')
- _values_ are lists where the first element is a number of pattern instances in a given sequence, and the following elements are indexes of these occurances
### Usage example
```python
protein_tool('LAlLAlwWGPdPA', 'LAlLAl', 'GPdPA', options = 'pattern') # {'LAlLAl': [2, 0, 3], 'GPdPA': [0]}
protein_tool('LAlLAlwWGPdPA', 'AlLAl', options = 'pattern') # {'AlLAl': [1, 2]}
```

## 3Letter_name
### Introduction
The **3Letter_name** procedure transforms one-letter amino acid entry sequences to three-letter amino acid sequences, separated by a specified separator. It is a case-sensitive procedure.
### Inputs
To start using the rename_three_letter_name procedure, enter one or more protein sequences for which you want to get three-letter sequences. After the protein sequences put a symbol (type 'str') that will be a separator. And specify the `options = ‘3Letter_name’`.
### Outputs
The result of the procedure is a list of three-letter sequences. Each amino acid is separated by the specified separator. The case of the three-letter amino acid coincides with the case of the one-letter designation at the input.
### Usage example
```python
protein_tool('wWGPdPA', '', options = '3Letter_name') # ['trpTRPGLYPROaspPROALA']
protein_tool('LAlLAlwWGPdPA', '-', options = '3Letter_name') # ['LEU-ALA-leu-LEU-ALA-leu-trp-TRP-GLY-PRO-asp-PRO-ALA']
protein_tool('RRRrrrR', 'WGPdPA', 'LAlLAlw', options = 'percentage') # [{'R': 57.14, 'r': 42.86}, {'P': 33.33, 'W': 16.67, 'G': 16.67, 'd': 16.67, 'A': 16.67}, {'L': 28.57, 'A': 28.57, 'l': 28.57, 'w': 14.29}]
protein_tool('qwerty', 'G', options = '3Letter_name') # ['glnGtrpGgluGargGthrGtyr']
```

## DNA_code
### Introduction
The **DNA_code** procedure transforms a protein into a DNA sequence that may encode it (this can be used in genetic ingeneering).
P.S. codons chosen at the discretion of the tool authors.
### Inputs
To start using the DNA_code procedure, enter one or more protein sequences for which you want to get a summary, and at the end add `options = ‘DNA_code’`.
### Outputs
The result of the procedure is a list with type 'str' elements - nucleotide sequence that corresponds to the amino acid sequence.
### Usage example
```python
protein_tool('LAlLAlwWGPdPA', options = 'DNA_code') # ['TTAGCAttaTTAGCAttatggTGGGGGCCCgcaCCCGCA']
protein_tool('RRRrrrR', 'WGPdPA', 'LAlLAlw', options = 'DNA_code') # ['CGACGACGAcgacgacgaCGA', 'TGGGGGCCCgcaCCCGCA', 'TTAGCAttaTTAGCAttatgg']
```

# Contacts
[Gleb Bobkov](https://github.com/GlebBobkov): teamlead, count_length and transform_to_DNA_code functions;
[Dmitry Matach](https://github.com/zmitserbio): compare, find_pattern functions, is_protein, string_check, verify;
[Olga Bagrova](https://github.com/Olga-Bagrova): count_percentage and rename_three_letter_name functions.


![OUR COMMON PHOTO FROM THE GOOGLE MEET ](https://github.com/GlebBobkov/HW4_Bobkov/raw/HW4_Bobkov/HW4_Bobkov/photo_2023-09-28_23-38-46.jpg)

Binary file added HW4_Bobkov/explanation.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added HW4_Bobkov/photo_2023-09-28_23-38-46.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
220 changes: 220 additions & 0 deletions HW4_Bobkov/protein_tools.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
def compare(sequences: list, round_dec=3, percentages=False)->dict:
"""
Compare aminoacids between reference sequence and other sequences
arguments:
- sequences (list): reference sequence and other sequences for comparison
- round_dec (int): a number of decimals to round the number to
- percentages (bool): whether percentages are returned instead of fractions
return:
- comparisons (dict): dictionary with compared sequences as keys and percentages/fractions as their values
"""
comparisons={}
for seq in range(1,len(sequences)):
comparison=[]
for j in range(0,len(sequences[seq])):
comparison.append(sequences[0][j]==sequences[seq][j])
if percentages:
comparisons[sequences[seq]]=round(sum(comparison)*100/len(sequences[seq]),round_dec)
else:
comparisons[sequences[seq]]=round(sum(comparison)/len(sequences[seq]),round_dec)
return comparisons


def count_length(protein: str) -> list:
"""
Сounting the length of an amino acid sequence/protein in the number of amino acids
:param protein: sequence of protein
:return: number of amino acids in an amino acid sequence/protein
"""
length_p = len(protein)
return length_p


def count_percentage(seq: str)->dict:
"""
Count percentage of each amino acid in sequence
arguments:
- seq (str): sequence for counting
return:
- dict: dictionary with counted percentage
"""
l = count_length(seq)
res = {}
for aa in seq:
if aa not in res:
res[aa] = 1
else:
res[aa]+=1
res.update((key, round(value/l*100, 2)) for key, value in res.items())
res={key: value for key, value in sorted(res.items(), key=lambda item: item[1], reverse=True)}
return res


def compare_pattern(sequence: str, pattern: str)->bool:
"""
Compare a given pattern to a fragment of sequence of the same length
arguments:
- sequence (str): sequence fragment to compare with the pattern
- pattern (str): pattern for comparison
return:
- (bool): whether pattern and fragment match
"""
for i in range(0,len(sequence)):
if not sequence[i]==pattern[i]:
return False
break
return True

def find_pattern(sequences: list, pattern: str)->dict:
"""
Find all non-overlaping instances of a given pattern in sequences
arguments:
- sequences (list): sequences to find the pattern in
- pattern (str): pattern in question
return
- finds(dict): dictionary with sequences as keys and lists of indexes of patterns and the number of patterns as values
"""
finds={}
for j in range(0, len(sequences)):
find=[]
for i in range(0, len(sequences[j])):
if compare_pattern(sequences[j][i:i+len(pattern)], pattern):
find.append(i)
i+=len(pattern)
else:
continue
finds[sequences[j]]=[len(find)]+find
return finds


def transform_to_DNA_code(protein):
"""
Transforming of an amino acid sequence/protein to DNA sequence
:param protein: amino acid sequence of protein
:return: sequence of protein in the DNA sequence form
"""
retrnaslation_dict = {
'F': 'TTC', 'f': 'ttc',
'L': 'TTA', 'l': 'tta',
'S': 'TCG', 's': 'tcg',
'Y': 'TAC', 'y': 'tac',
'C': 'TGC', 'c': 'tgc',
'W': 'TGG', 'w': 'tgg',
'P': 'CCC', 'p': 'ccc',
'H': 'CAT', 'h': 'cat',
'Q': 'GAA', 'q': 'gaa',
'R': 'CGA', 'r': 'cga',
'I': 'ATT', 'i': 'att',
'M': 'ATG', 'm': 'atg',
'T': 'ACC', 't': 'acc',
'N': 'AAT', 'n': 'aat',
'K': 'AAA', 'k': 'aaa',
'V': 'GTT', 'v': 'gtt',
'A': 'GCA', 'a': 'gca',
'D': 'GAT', 'd': 'gca',
'E': 'GAG', 'e': 'gag',
'G': 'GGG', 'g': 'ggg'
}

return ''.join([retrnaslation_dict[i] for i in protein])


def rename_three_letter_name (seqs: list, sep = '')->list:
"""
Transform into a three-letter amino acids entry.
arguments:
- seqs (list): list of sequences for transforming to three-letter entire
- sep (str): separator between aminoacids, default = ''
return:
- list: transformed sequences with separators
"""
res=[]
threel = {'A': 'ALA', 'R': 'ARG', 'N': 'ASN', 'D': "ASP", 'V': 'VAL',
'H': 'HIS', 'G': "GLY", 'Q': "GLN", 'E': 'GLU', 'I': 'ILE',
'L': 'LEU', 'K': 'LYS', 'M': 'MET', 'P': 'PRO', 'S': 'SER',
'Y': 'TYR', 'T': 'THR', 'W': 'TRP', 'F': 'PHE', 'C': 'CYS',
'a': 'ala', 'r': 'arg', 'n': 'asn', 'd': "asp", 'v': 'val',
'h': 'his', 'g': "gly", 'q': "gln", 'e': 'glu', 'i': 'ile',
'l': 'leu', 'k': 'lys', 'm': 'met', 'p': 'pro', 's': 'ser',
'y': 'tyr', 't': 'thr', 'w': 'trp', 'f': 'phe', 'c': 'cys'}
for seq in seqs:
threel_form = ''
for aa in seq:
threel_form = threel_form + threel[aa] + sep
if sep:
threel_form = threel_form[:-1]
res.append(threel_form)
return res

def is_protein(seq):
"""
Checking wheter a sequence is a protein sequence
"""
aminoacids=['F','f','L','l','S','s','Y','y','C','c','W','w','P','p','H','h','Q','q','R','r','I','i','M','m','T','t','N','n','K','k','V','v','A','a','D','d','E','e','G','g']
for i in seq:
if i not in aminoacids:
raise ValueError('Incorrect input: protein sequences containing 20 common aminoacids in one-letter format were expected. Please try again')

def string_check(sequences):
"""
Checking whether a sequence is a protein sequence and is of type str
"""
for seq in sequences:
if type(seq) != str:
raise ValueError('Incorrect input type: protein sequences of type str were expected. Please try again')
is_protein(seq)

def verify(sequences,options):
"""
Argument verification for all options
"""
if options=='length' or options=='percentage' or options=='DNA_code':
string_check(sequences)
elif options=='3Letter_name':
string_check(sequences[:-1])
elif options=='compare':
string_check(sequences[:-2])
for i in range(0,len(sequences[:-2])):
if len(sequences[i])!=len(sequences[0]):
raise ValueError('Incorrect input: same length protein sequences were expected. Please try again')
if type(sequences[-2]) != int or sequences[-2]<0:
raise ValueError('Incorrect input type: positive integer value was expected as the second-to-last argument. Please try again')
if type(sequences[-1]) != bool:
raise ValueError('Incorrect input type: bool value was expected as the last argument. Please try again')
elif options=='pattern':
string_check(sequences)
for i in range(1,len(sequences)):
if len(sequences[0])>len(sequences[i]):
raise ValueError('Incorrect input: pattern length shorter or equal to protein sequence length was expected. Please try again')

def protein_tool(*proteins, options = None):
proteins = list(proteins)
verify(proteins, options)
operations = {
'compare': compare,
'length': count_length,
'percentage': count_percentage,
'pattern': find_pattern,
'3Letter_name': rename_three_letter_name,
'DNA_code': transform_to_DNA_code
}

if options == 'compare':
result = operations[options](proteins[:-2], proteins[-2], proteins[-1])
return (result)
elif options == 'pattern':
result = operations[options](proteins[1:len(proteins)],proteins[0])
return (result)
elif options == '3Letter_name':
result = operations[options](proteins[:-1], proteins[-1])
return (result)
elif options == 'length' or options =='percentage' or options == 'DNA_code':
result = []
for protein in proteins:
res = operations[options](protein)
result.append(res)
return (result)
else:
raise ValueError('Incorrect options input, please try again')

protein_tool()
Loading