Skip to content

Hw4_Grigoriants #19

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 46 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
f4861e7
Add folder HW4_Grigoriants, create README.md
VovaGrig Sep 27, 2023
c3b919c
Add protein_tools.py with run_protein_tools and check_for_motif funct…
VovaGrig Sep 29, 2023
bc24a41
Add 'search_for_alt_frames' function
EkaterinShitik Sep 29, 2023
f81d442
Add 'convert_to_nucl_acids' function
EkaterinShitik Sep 29, 2023
cbeb58a
Add conditions in 'main' function
EkaterinShitik Sep 29, 2023
1cd3287
Merge pull request #1 from EkaterinShitik/HW4_Grigoriants
VovaGrig Sep 29, 2023
d91cfd4
Add minor fix to protein_tools.py
VovaGrig Sep 29, 2023
7f54bec
Merge branch 'HW4_Grigoriants' of github.com:VovaGrig/HW4_Functions2 …
VovaGrig Sep 29, 2023
39b8acd
Add check_and_parse_user_input in protein_tools.py, add fixes
VovaGrig Sep 29, 2023
29fd752
Add minor fixes in protein_tools.py
VovaGrig Sep 29, 2023
de4e146
Add check_and_parse_user_input in protein_tools.py, add fixes
VovaGrig Sep 29, 2023
620a551
Add three_one_letter_code and define_molecular_weight functions and f…
Sep 29, 2023
c641b5e
Merge pull request #2 from vladislavi27/HW4_Grigoriants
VovaGrig Sep 30, 2023
93d2d5f
Add minor fixes in protein_tools.py
VovaGrig Sep 30, 2023
ac9a165
Merge branch 'HW4_Grigoriants' of github.com:VovaGrig/HW4_Functions2 …
VovaGrig Sep 30, 2023
d731697
Add minor fixes in protein_tools.py
VovaGrig Sep 30, 2023
e670429
Add minor changes to 'convert_to_nucl_acids' function
EkaterinShitik Sep 30, 2023
fe41d85
Change transcription rule in 'convert_to_nucl_acids' function
EkaterinShitik Sep 30, 2023
c8e9823
Correct inaccuracies in the dockstring of 'convert_to_nucl_acids'
EkaterinShitik Sep 30, 2023
cb03cf4
Change inaccuracies in the dockstring of 'convert_to_nucl_acids'
EkaterinShitik Sep 30, 2023
b193a6b
Change annotation of 'search_for_alt_frames' function
EkaterinShitik Sep 30, 2023
f53914a
Add minor fixes in protein_tools.py
VovaGrig Sep 30, 2023
2ce8ada
Add plan of README.md
EkaterinShitik Sep 30, 2023
18c1a76
Complete 'Usage'
EkaterinShitik Sep 30, 2023
ea3be7e
Add preliminary 'Options'
EkaterinShitik Sep 30, 2023
a1c1c23
Add preliminary 'Examples'
EkaterinShitik Sep 30, 2023
6a4e2b1
Merge branch 'VovaGrig:HW4_Grigoriants' into HW4_Grigoriants
EkaterinShitik Sep 30, 2023
454d703
Complete 'Examples'
EkaterinShitik Sep 30, 2023
e5628a5
Complete four first parts
EkaterinShitik Sep 30, 2023
53a7556
Complete all parts except for contacts
EkaterinShitik Sep 30, 2023
33744ad
Complete all parts
EkaterinShitik Sep 30, 2023
0fbb184
Add minor changes in 'Options'
EkaterinShitik Sep 30, 2023
d9bdb50
Add dockstrings to main function, search_for_motifs function, add min…
VovaGrig Sep 30, 2023
78fc1e0
Add docstrings to three_one_letter_code and define_molecular_weight f…
Sep 30, 2023
fdf4b60
Add minor fixes
VovaGrig Sep 30, 2023
5b30b20
Merge branch 'HW4_Grigoriants' into HW4_Grigoriants
VovaGrig Sep 30, 2023
bb79bb0
Merge pull request #7 from vladislavi27/HW4_Grigoriants
VovaGrig Sep 30, 2023
39db203
Merge branch 'HW4_Grigoriants' into HW4_Grigoriants
VovaGrig Sep 30, 2023
20f86fe
Merge pull request #6 from EkaterinShitik/HW4_Grigoriants
VovaGrig Sep 30, 2023
6794624
Add mifixes to docstrings
VovaGrig Sep 30, 2023
d3b21d1
Add mminor fixes
VovaGrig Sep 30, 2023
7412e71
Update README.md: add information, pictures, team photo
VovaGrig Oct 1, 2023
4d23561
Update README.md
VovaGrig Oct 1, 2023
dd6f4a6
Update README.md
VovaGrig Oct 1, 2023
a3bec1b
Add fixes based on feedback to dictionaries.py and protein_tools.py
VovaGrig Oct 14, 2023
0416e82
Merge branch 'HW4_Grigoriants' of github.com:VovaGrig/HW4_Functions2 …
VovaGrig Oct 14, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
223 changes: 223 additions & 0 deletions HW4_Grigoriants/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
# Protein_tools.py
## A tool to work with protein sequences

*Proteins* are under the constant focus of scientists. Currently, there are an enormous amount of tools to operate with nucleotide sequences, however, the same ones for proteins are extremely rare.


`protein_tools.py` is an open-source program that facilitates working with protein sequences.

## Usage
The programm is based on `run_protein_tools` function that takes the list of **one-letter amino acid sequences**, a name of procedure and a relevant argument. If you have three-letter amino acids sequences you could convert them by using `three_one_letter_code` procedure in advance. Please convert your three-letter coded sequences with `three_one_letter_code` procedure before using any other procedures on them.

To start with the program run the following command:

`run_protein_tools(sequences, procedure="procedure", ...)`

Where:
- sequences - positional argument, a list of protein sequences
- procedure - keyword argument, a type of procedure to use that is inputed in *string* type
- ... - an additional keyword arguments that are to be inputed in *string* type
-
Before start, check the *Options* and *Examples*.
Comment on lines +9 to +21

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Супер!

## Options

The program has five types of procedures, for more information please see provided docstrings:

`three_one_letter_code`

![image](https://drive.google.com/uc?export=view&id=1eACjU_CXFbqeu1iW3ekwcg81n-X3WvTG)

- The main aim - to convert three-letter amino acid sequences to one-letter ones and vice-versa
- In case of three-to-one translation the names of amino acids **must be separated with hyphen**
- An additional argument: no
```
"""
Reverse the protein sequences from one-letter to three-letter format and vice-versa

Case 1: get three-letter sequence\n
Use one-letter amino-acids sequences of any letter case

Case 2: get one-letter sequence\n
Use three-letter amino-acid separated by "-" sequences.
Please note that sequences without "-" are parsed as one-letter code sequences\n
Example: for sequence "Ala" function will return "Ala-leu-ala"

Arguments:
- sequences (tuple[str] or list[str]): protein sequences to convert\n
Example: ["WAG", "MkqRe", "msrlk", "Met-Ala-Gly", "Met-arg-asn-Trp-Ala-Gly", "arg-asn-trp"]

Return:
- list: one-letter/three-letter protein sequences\n
Example: ["Met-Ala-Gly", "Met-arg-asn-Trp-Ala-Gly", "arg-asn-trp", "WAG", "MkqRe", "rlk"]
"""
```

`define_molecular_weight`

![image](https://drive.google.com/uc?export=view&id=1i9_4ys64XsAxnw-08zbgyBQnGzJoGJfr)

- The main aim - to determine the exact molecular weight of protein sequences
- An additional argument: no
```
"""
Define molecular weight of the protein sequences

Use one-letter amino-acids sequences of any letter case
The molecular weight is:
- a sum of masses of each atom constituting a molecule
- expressed in units called daltons (Da)
- rounded to hundredths

Arguments:
- sequences (tuple[str] or list[str]): protein sequences to convert

Return:
- dictionary: protein sequences as keys and molecular masses as values\n
Example: {"WAG": 332.39, "MkqRe": 690.88, "msrlk": 633.86}
"""
```

`search_for_motifs`

![image](https://drive.google.com/uc?export=view&id=1_bVKRn4RblrfukIxoQc0NZ_FXaJliGAH)

- The main aim - to search for the motif of interest in protein sequences
- An additional arguments: motif (*str*), overlapping (*bool*)
```
"""
Search for motifs - conserved amino acids residues in protein sequence

Search for one motif at a time\n
Search is letter case sensitive\n
Use one-letter aminoacids code for desired sequences and motifs\n
Positions of AA in sequences are counted from 0\n
By default, overlapping matches are counted

Arguments:
- sequences (tuple[str] or list[str]): sequences to check for given motif within\n
Example: sequences = ["AMGAGW", "GAWSGRAGA"]
- motif (str]: desired motif to check presense in every given sequence\n
Example: motif = "GA"
- overlapping (bool): count (True) or skip (False) overlapping matches. (Optional)\n
Example: overlapping = False
Return:
- dictionary: sequences (str] as keys , starting positions for presented motif (list) as values\n
Example: {"AMGAGW": [2], "GAWSGRAGA": [0, 7]}
"""
```
`search_for_alt_frames`

![image](https://drive.google.com/uc?export=view&id=1AdXnkRDIRiC_5yiiI2qiAMSMWbZf1RIm)

- The main aim - to look for alternative frames that start with methyonine or other non-canonical start amino acids
- Ignores the last three amino acids due to the insignicance of alternative frames of this length
- An additional argument: alt_start_aa (*str*)
- Use alt_start_aa **only for non-canonical start amino acids**
- Without alt_start_aa the procedure find alternative frames that start with methyonine
```
"""
Search for alternative frames in a protein sequences

Search is not letter case sensitive\n
Without an alt_start_aa argument search for frames that start with methionine ("M")
To search frames with alternative start codon add alt_start_aa argument\n
In alt_start_aa argument use one-letter code

The function ignores the last three amino acids in sequences

Arguments:
- sequences (tuple[str] or list[str]): sequences to check
- alt_start_aa (str]: the name of an amino acid that is encoded by alternative start AA (Optional)\n
Example: alt_start_aa = "I"

Return:
- dictionary: the number of a sequence and a collection of alternative frames
"""
```
`convert_to_nucl_acids`

![image](https://drive.google.com/uc?export=view&id=1_pZJ0Gc-EVcR1zddpDW4Ok3w8t65fW_z)

- The main aim - to convert protein sequences to DNA, RNA or both nucleic acid sequences
- The program use the most frequent codons in human that could be found [here](https://www.genscript.com/tools/codon-frequency-table)
- An additional argument: nucl_acids (*str*)
- Use as nucl_acids only DNA, RNA or both (for more detailes, check *Examples*)
```
"""
Convert protein sequences to RNA or DNA sequences.

Use the most frequent codons in human. The source - https://www.genscript.com/tools/codon-frequency-table\n
All nucleic acids (DNA and RNA) are showed in 5"-3" direction

Arguments:
- sequences (tuple[str] or list[str]): sequences to convert
- nucl_acids (str]: the nucleic acid that is prefered\n
Example: nucl_acids = "RNA" - convert to RNA\n
nucl_acids = "DNA" - convert to DNA\n
nucl_acids = "both" - convert to RNA and DNA
Return:
- dictionary: nucleic acids (str) as keys, collection of sequences (list) as values
"""
```

## Examples
```python
# three_one_letter_code
run_protein_tools(['met-Asn-Tyr', 'Ile-Ala-Ala'], procedure='three_one_letter_code') # ['mNY', 'IAA']
run_protein_tools(['mNY','IAA'], procedure='three_one_letter_code') # ['met-Asn-Tyr', 'Ile-Ala-Ala']


# define_molecular_weight
run_protein_tools(['MNY','IAA'], procedure='define_molecular_weight') # {'MNY': 426.52, 'IAA': 273.35}


# check_for_motifs
run_protein_tools(['mNY','IAA'], procedure='search_for_motifs', motif='NY')
#Sequence: mNY
#Motif: NY
#Motif is present in protein sequence starting at positions: 1

#Sequence: IAA
#Motif: NY
#Motif is not present in protein sequence

{'mNY': [1], 'IAA': []}


# search_for_alt_frames
run_protein_tools(['mNYQTMSPYYDMId'], procedure='search_for_alt_frames') # {'mNYQTMSPYYDMId': ['MSPYYDMId']}
run_protein_tools(['mNYTQTSP'], procedure='search_for_alt_frames', alt_start_aa='T') # {'mNYTQTSP': ['TQTSP']}


# convert_to_nucl_acids
run_protein_tools(['MNY'], procedure='convert_to_nucl_acids', nucl_acids = 'RNA') # {'RNA': ['AUGAACUAU']}
run_protein_tools(['MNY'], procedure='convert_to_nucl_acids', nucl_acids = 'DNA') # {'DNA': ['TACTTGATA']}
run_protein_tools(['MNY'], procedure='convert_to_nucl_acids', nucl_acids = 'both') # {'RNA': ['AUGAACUAU'], 'DNA': ['TACTTGATA']}

```

## Troubleshooting

| Type of the problem | Probable cause
| ------------------------------------------------------------ |--------------------
| Output does not correspond the expected resultes | The name of procedure is wrong. You see the results of another procedure
| ValueError: No sequences provided | A list of sequences are not inputed
| ValueError: Wrong procedure | The procedure does not exist in this program
| TypeError: takes from 0 to 1 positional arguments but n were given | Sequences are not collected into the list type
| ValueError: Invalid sequence given | The sequences do not correspond to standard amino acid code
| ValueError: Please provide desired motif | There are no an additional argument *motif* in `search_for_motifs`
| ValueError: Invalid start AA | There is more than one letter in an additional argument *alt_start_aa* in `search_for_alt_frames`
| ValueError: Please provide desired type of nucl_acids | There are no an additional argument *nucl_acids* in `convert_to_nucl_acids`
| ValueError: Invalid nucl_acids argument | An additional argument in `convert_to_nucl_acids` is written incorrectly
## Contacts
Vladimir Grigoriants ([email protected])
Team-leader. Bioinformatician, immunologist, MiLaborary inc. TCR-libraries QC developer

Ekaterina Shitik ([email protected])
Doctor of medicine, molecular biologist with the main interests on gene engineering, AAV vectors and CRISPR/Cas9 technologies

Vlada Tuliavko ([email protected])
MiLaboratory inc. manager&designer, immunologist

## Our team
![image](https://drive.google.com/uc?export=view&id=1tdSGpNl6GorFPZIqweB0PaGxQW5wK5Oo)
66 changes: 66 additions & 0 deletions HW4_Grigoriants/dictionaries.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
AMINO_ACIDS = {
"A": "Ala",
"C": "Cys",
"D": "Asp",
"E": "Glu",
"F": "Phe",
"G": "Gly",
"H": "His",
"I": "Ile",
"K": "Lys",
"L": "Leu",
"M": "Met",
"N": "Asn",
"P": "Pro",
"Q": "Gln",
"R": "Arg",
"S": "Ser",
"T": "Thr",
"V": "Val",
"W": "Trp",
"Y": "Tyr",
}
TRANSLATION_RULE = {
"F": "UUU",
"L": "CUG",
"I": "AUU",
"M": "AUG",
"V": "GUG",
"P": "CCG",
"T": "ACC",
"A": "GCG",
"Y": "UAU",
"H": "CAU",
"Q": "CAG",
"N": "AAC",
"K": "AAA",
"D": "GAU",
"E": "GAA",
"C": "UGC",
"W": "UGG",
"R": "CGU",
"S": "AGC",
"G": "GGC",
}
AMINO_ACID_WEIGHTS = {
"A": 89.09,
"C": 121.16,
"D": 133.10,
"E": 147.13,
"F": 165.19,
"G": 75.07,
"H": 155.16,
"I": 131.17,
"K": 146.19,
"L": 131.17,
"M": 149.21,
"N": 132.12,
"P": 115.13,
"Q": 146.15,
"R": 174.20,
"S": 105.09,
"T": 119.12,
"V": 117.15,
"W": 204.23,
"Y": 181.19,
}
Loading