-
Notifications
You must be signed in to change notification settings - Fork 45
Hw4 chesnokova #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Hw4 chesnokova #25
Changes from all commits
4d05621
ab1a021
9bb80c2
85220ad
31d2944
50f0724
a1378a0
05bdabf
386caea
4cc1564
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
| ||
# Protein sequence utility | ||
This tool is designed to work with amino acid sequences consisting of _22 proteinogenic amino acid_ residues (including pyrrolizine and selenocysteine) recorded in a standard one-letter format. It is not intended to process sequences with post-translational and other amino acid modifications | ||
|
||
## Usage | ||
You call the `amino_acid_tools` function, which takes as input an arbitrary number of arguments with amino-acid sequences (str), as well as the name of the procedure to be executed (it is always the last argument, str). After that the command performs the specified action on all the given sequences. If one sequence is submitted, a string with the result is returned. If several sequences are submitted, a list of strings is returned. | ||
Input sequences can contain both uppercase and lowercase letters, but the last argument with the function name must correspond to the listed functions. | ||
|
||
### Remark | ||
- if the sequences passed by you contain inappropriate characters (not from the single-letter aminoxylot encoding), the result of the function will be a list without them | ||
- the fewer amino acids a sequence contains, the less reliable the 'folding' function is | ||
|
||
## Options | ||
The following options for aminoacid sequence processing are available at the moment: | ||
|
||
- **molecular_weight**: calculate the molecular weight of the amino acid chain in Da, according to the average amino acid residues molecular masses rounded to 1 or 2 decimal places. | ||
- **three_letter_code**: converts standard single letter translations to three letter translations | ||
- **show_length**: count the overall number of amino acids in the given | ||
- **sequence folding**: count the number of amino acids characteristic separately for alpha helixes and beta sheets,and give out what will be the structure of the protein more. This function has been tested on proteins such as 2M3X, 6DT4 (PDB ID) and MHC, CRP. The obtained results corresponded to reality. | ||
- **seq_charge**: evaluates the overall charge of the aminoacid chain in neutral aqueous solution (pH = 7), according to the pKa of amino acid side chains, lysine, pyrrolizine and arginine contribute +1, while asparagine and glutamic amino acids contribute -1. The total charge of a protein is evaluated as positive, negative, or neutral as the sum of these contributions | ||
|
||
## Examples | ||
Below is an example of processing an amino acid sequence. | ||
|
||
### Using the function for molecular weight calculation | ||
|
||
```shell | ||
amino_acid_tools('EGVIMSELKLK', 'PLPKvelPPDFVD', 'DVIGISILGKEV', 'molecular_weight') | ||
``` | ||
|
||
Input: 'EGVIMSELKLK', 'PLPKvelPPDFVD', 'DVIGISILGKEV', 'molecular_weight' | ||
Output: '[1228.66, 1447.8400000000001, 1224.6399999999999]' | ||
|
||
### Using the function to convert one-letter translations to three-letter translations | ||
|
||
```shell | ||
amino_acid_tools('EGVIMSELKLK', 'PLPKvelPPDFVD', 'DVIGISILGKEV', 'three_letter_code') | ||
``` | ||
|
||
Input: 'EGVIMSELKLK', 'PLPKvelPPDFVD', 'DVIGISILGKEV', 'three_letter_code' | ||
Output: '['GluGlyValIleMetSerGluLeuLysLeuLys', 'ProLeuProLysValGluLeuProProAspPheValAsp', 'AspValIleGlyIleSerIleLeuGlyLysGluVal']' | ||
|
||
### Using the function to counts the number of amino acids in the given sequence | ||
|
||
```shell | ||
amino_acid_tools('EGVIMSELKLK', 'PLPKvelPPDFVD', 'DVIGISILGKEV', 'show_length') | ||
``` | ||
|
||
Input: 'EGVIMSELKLK', 'PLPKvelPPDFVD', 'DVIGISILGKEV', 'show_length' | ||
Output: '[11, 13, 12]' | ||
|
||
### Using the function to determine the predominant secondary structure | ||
|
||
```shell | ||
amino_acid_tools('EGVIMSELKLK', 'PLPKvelPPDFVD', 'DVIGISILGKEV', 'folding') | ||
``` | ||
Input: 'EGVIMSELKLK', 'PLPKvelPPDFVD', 'DVIGISILGKEV', 'folding' | ||
Output: '['alfa_helix', 'equally', 'equally']' | ||
|
||
### Using the function to estimate relative charge | ||
|
||
```shell | ||
amino_acid_tools('EGVIMSELKLK', 'PLPKvelPPDFVD', 'DVIGISILGKEV', 'seq_charge') | ||
``` | ||
|
||
Input: 'EGVIMSELKLK', 'PLPKvelPPDFVD', 'DVIGISILGKEV', 'seq_charge' | ||
Output: '['neutral', 'negative', 'negative']' | ||
|
||
|
||
|
||
## Contacts | ||
- [Cesnokova Anna] [email protected] | ||
- [Lukina Maria] | ||
[email protected] | ||
|
||
 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,162 @@ | ||
AMINO_ACIDS = 'ARNDCEQGHILKMFPSTWYVUOarndceqghilkmfpstwyvuo' | ||
SHORT_CODE = list(AMINO_ACIDS) | ||
LONG_CODE = ['Ala', 'Arg', 'Asn', 'Asp', 'Cys', 'Glu', 'Gln', 'Gly', 'His', 'Ile', 'Leu', 'Lys', 'Met', 'Phe', 'Pro', | ||
'Ser', 'Thr', 'Trp', 'Tyr', 'Val', 'Sec', 'Pyl', | ||
'Ala', 'Arg', 'Asn', 'Asp', 'Cys', 'Glu', 'Gln', 'Gly', 'His', 'Ile', 'Leu', 'Lys', 'Met', 'Phe', 'Pro', | ||
'Ser', 'Thr', 'Trp', 'Tyr', 'Val', 'Sec', 'Pyl'] | ||
MASSE = [71.08, 156.2, 114.1, 115.1, 103.1, 129.1, 128.1, 57.05, 137.1, 113.2, 113.2, 128.2, 131.2, 147.2, 97.12, 87.08, | ||
101.1, 186.2, 163.2, 99.13, 168.05, 255.3, | ||
71.08, 156.2, 114.1, 115.1, 103.1, 129.1, 128.1, 57.05, 137.1, 113.2, 113.2, 128.2, 131.2, 147.2, 97.12, 87.08, | ||
101.1, 186.2, 163.2, 99.13, 168.05, 255.3] | ||
|
||
|
||
def molecular_weight(seq: str) -> float: | ||
""" | ||
Function calculates molecular weight of the amino acid chain | ||
Parameters: | ||
seq (str): each letter refers to one-letter coded proteinogenic amino acids | ||
Returns: | ||
(float) Molecular weight of tge given amino acid chain in Da | ||
""" | ||
d_mass = dict(zip(SHORT_CODE, MASSE)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ну вот вроде круто, но нет. Лучше все-таки в MASS положить словарь, где ключ -- аминокислота и значение -- масса |
||
m = 0 | ||
for acid in seq: | ||
m = m + d_mass[acid] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Есть такой оператор |
||
return m | ||
|
||
|
||
def three_letter_code(seq: str) -> str: | ||
""" | ||
Function converts single letter translations to three letter translations | ||
Parameters: | ||
seq (str): each letter refers to one-letter coded proteinogenic amino acids | ||
Returns: | ||
(str) translated in three-letter code | ||
""" | ||
d_names = dict(zip(SHORT_CODE, LONG_CODE)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. То же круто, но словарь опять же лучше ))0)0 |
||
recording = seq.maketrans(d_names) | ||
return seq.translate(recording) | ||
|
||
|
||
def show_length(seq: str) -> int: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Почему |
||
""" | ||
Function counts the number of amino acids in the given sequence | ||
Parameters: | ||
seq (str): amino acid sequence | ||
Returns: | ||
(int): integer number of amino acid residues | ||
""" | ||
return len(seq) | ||
|
||
|
||
def folding(seq: str) -> str: | ||
""" | ||
Counts the number of amino acids characteristic separately for alpha helixes and beta sheets, | ||
and gives out what will be the structure of the protein more. | ||
This function has been tested on proteins such as 2M3X, 6DT4 (PDB ID) and MHC, CRP. | ||
The obtained results corresponded to reality. | ||
Parameters: | ||
seq (str): amino acid sequence | ||
Returns: | ||
(str): overcoming structure ('alfa_helix', 'beta_sheet', 'equally') | ||
""" | ||
alfa_helix = ['A', 'E', 'L', 'M', 'G', 'Y', 'S', 'a', 'e', 'l', 'm', 'g', 'y', 's'] | ||
beta_sheet = ['Y', 'F', 'W', 'T', 'V', 'I', 'y', 'f', 'w', 't', 'v', 'i'] | ||
alfa_helix_counts = 0 | ||
beta_sheet_counts = 0 | ||
for amino_acid in seq: | ||
if amino_acid in alfa_helix: | ||
alfa_helix_counts += 1 | ||
elif amino_acid in beta_sheet: | ||
beta_sheet_counts += 1 | ||
if alfa_helix_counts > beta_sheet_counts: | ||
return 'alfa_helix' | ||
elif alfa_helix_counts < beta_sheet_counts: | ||
return 'beta_sheet' | ||
elif alfa_helix_counts == beta_sheet_counts: | ||
return 'equally' | ||
|
||
|
||
def seq_charge(seq: str) -> str: | ||
""" | ||
Function evaluates the overall charge of the aminoacid chain in neutral aqueous solution (pH = 7) | ||
Parameters: | ||
seq (str): amino acid sequence of proteinogenic amino acids | ||
Returns: | ||
(str): "positive", "negative" or "neutral" | ||
Function realized by Anna Chesnokova | ||
""" | ||
aminoacid_charge = {'R': 1, 'D': -1, 'E': -1, 'K': 1, 'O': 1, 'r': 1, 'd': -1, 'e': -1, 'k': 1, 'o': 1} | ||
charge = 0 | ||
for aminoacid in seq: | ||
if aminoacid in 'RDEKOrdeko': | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Лучше |
||
charge += aminoacid_charge[aminoacid] | ||
if charge > 0: | ||
return 'positive' | ||
elif charge < 0: | ||
return 'negative' | ||
else: | ||
return 'neutral' | ||
|
||
|
||
def aminoacid_seqs_only(seqs: list) -> list: | ||
""" | ||
Leaves only the amino acid sequences from the fed into the function. | ||
Parameters: | ||
seqs (list): amino acid sequence list | ||
Returns: | ||
aminoacid_seqs (list): amino acid sequence list without non amino acid sequence | ||
""" | ||
aminoacid_seqs = [] | ||
for seq in seqs: | ||
unique_chars = set(seq) | ||
amino_acid = set(AMINO_ACIDS) | ||
if unique_chars <= amino_acid: | ||
aminoacid_seqs.append(seq) | ||
return aminoacid_seqs | ||
|
||
|
||
def amino_acid_tools(*args: str): | ||
""" | ||
Performs functions for working with protein sequences. | ||
|
||
Parameters: | ||
The function must accept an unlimited number of protein sequences (str) as input, | ||
the last variable must be the function (str) you want to execute. | ||
The amino acid sequence can consist of both uppercase and lowercase letters. | ||
Input example: | ||
amino_acid_tools('PLPKVEL','VDviRIkLQ','PPDFGKT','folding') | ||
Function: | ||
molecular_weight: calculates molecular weight of the amino acid chain | ||
three_letter_code: converts single letter translations to three letter translations | ||
show_length: counts the number of amino acids in the given sequence | ||
folding: counts the number of amino acids characteristic separately for alpha helixes and beta sheets, | ||
and gives out what will be the structure of the protein more | ||
seq_charge: evaluates the overall charge of the aminoacid chain in neutral aqueous solution (pH = 7) | ||
|
||
Returns: | ||
If one sequence is supplied, a string with the result is returned. | ||
If several are submitted, a list of strings is returned. | ||
Depending on the function performed, the following returns will occur: | ||
molecular_weight (int) or (list): amino acid sequence molecular weight number or list of numbers | ||
three_letter_code (str) or (list): translated sequence from one-letter in three-letter code | ||
show_length (int) or (list): integer number of amino acid residues | ||
folding (str) or (list): 'alpha_helix', if there are more alpha helices | ||
'beta_sheet', if there are more beta sheets | ||
'equally', if the probability of alpha spirals and beta sheets are the same | ||
seq_charge(str) or (list): "positive", "negative" or "neutral" | ||
""" | ||
*seqs, function = args | ||
d_of_functions = {'molecular_weight': molecular_weight, | ||
'three_letter_code': three_letter_code, | ||
'show_length': show_length, | ||
'folding': folding, | ||
'seq_charge': seq_charge} | ||
answer = [] | ||
aminoacid_seqs = aminoacid_seqs_only(seqs) | ||
for sequence in aminoacid_seqs: | ||
answer.append(d_of_functions[function](sequence)) | ||
if len(answer) == 1: | ||
return answer[0] | ||
else: | ||
return answer | ||
Comment on lines
+159
to
+162
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if len(answer) == 1:
return answer[0]
return answer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MASS
)))