-
Notifications
You must be signed in to change notification settings - Fork 3
Feature Extraction Developer Guide
Throughout this project, we aspire to fully predict the behavoir of proteins given their amino-acid sequence.
As known in the biology field, this is extremely difficult since proteins are 3D structures which completely determine their characteristics. And such cannot be just seen from their 1D AA sequence. Thus, extra effort is needed to predict the secondary and tertiary structure of the inspected protein.
We tempted to use several already existing tools to attain all structures and then predict their features. In this project we focus on solubility.
We took the insight from the GraphSol algorithm developed by https://github.com/jcchan23/GraphSol.
In order to achieve the 3D structure many smaller steps have to first be exerted:
| Name | Tool | Description | language |
|---|---|---|---|
| PSSM | PSI-blast | scoring matrix obtained from a multiple alignment of the highest scoring hits in an initial BLAST search | User Interface |
| HMM | HH-Suite3 (HHBlits) | Sensitive protein sequence searching based on the pairwise alignment of hidden Markov models | C using CMake |
| Spider3 | SPIDER3 | Prediction on secondary structure and other structural properties of proteins by LSTM | Python Numpy |
| Evolutionary coupling information | DCA | Provides a correlated mutation analysis in an AA and codon level | R |
| Contact Map | CCMPred | Markov Random Field for learning protein residue-residue contacts | C using CMake |
| SPOT-Contact-Helical | SPOT-Contact | Prediction of protein contact maps by coupling residual two-dimensional bidirectional long short-term memory with CNNs | Python |
A first challenge for this project is that many different tools have to be set up beforehand. As we see in the following table, the inputs of some tools are the output of other, thus, it is essential to be able to make all work. In this guide a more detailed configuration of all tools will be presented.
| Software | Version | Input | Output |
|---|---|---|---|
| PSI-BLAST | 2.7.1 | abgB | abgB.bla, abgB.pssm |
| HH-Suite3 | 3.0.3 | abgB | abgB.hhr, abgB.hhm, abgB.a3m |
| SPIDER3 | 1.0 | abgB, abgB.pssm, abgB.hhm | abgB.spd33 |
| DCA | 1.0 | abgB.a3m | abgB.di |
| CCMPRED | 1.0 | abgB.a3m | abgB.mat |
| SPOT | 1.0 | abgB.fasta, abgB.pssm, abgB.hhm, abgB.di, abgB.mat | abgB.spotcon |
abgB is an example protein, and the extentions of the file are own to the tool used. GraphSol includes a folder with examples files of all extensions.
General note: all these specific steps are the latest version as in January 2023, it might be that later versions exist!
The PSI Blast can be run from your local machine or from their interface. The following link presents a detailed guide how to do in both cases.
https://biochem.slu.edu/bchm628/handouts/2013/PSI-BLAST_tutorial_2011.pdf
download the package from https://github.com/soedinglab/hh-suite.git
it needs >4.8 version of GCC, C Compiler, and >2.8.12 version of CMake. CMake is a tool to run C Programs in an efficient manner. You can download it with conda in windows platform or with brew on macsOS. The application is not needed, just the binary files or the source code. You can also refer to https://cmake.org/download/
To download the source code and compile the HH-suite execute the following commands:
git clone https://github.com/soedinglab/hh-suite.git
mkdir -p hh-suite/build && cd hh-suite/build
cmake -DCMAKE_INSTALL_PREFIX=. ..
make -j 4 && make install
export PATH="$(pwd)/bin:$(pwd)/scripts:$PATH"
For performing a single search iteration of HHblits, run HHblits with the following command:
hhblits -i <input-file> -o <result-file> -n 1 -d <database-basename>
*if you have a macOS and you are having some problems running HHblits, you can refer to the original github for other tips https://github.com/soedinglab/hh-suite
from https://sparks-lab.org/server/spider3/ go to the download sections (it redirects into another website). In the 'Protein Local Structural Prediction' section downlaod 'spider3-Single (Numpy): Single-sequence-based prediction of structural features for proteins'.
Requirements:python == 2.7, sqlparse==0.4.1
You will see in the brief readme file they show how to input the files needed for the prediction. You can enter only the FASTA file but for the process of the project we add the PSIblast (.pssm) file and the HHfile (.hhm). Let protein_id be the "sequence name", its unique identifier
protein_id ./path_pssm_file ./path_hhm_file
You will also need to modify the variables SAVE_DIR and INPUT_LIST as needed in the input_script_np.sh
And to run ./impute_scipt_np.sh
You can find more information in the readme file in the download zip
download package from https://github.com/etaijacob/CMA
to install the library, in a new R file:
- install.package(("PATH_TO_PACKAGE", repos = NULL, type = "source"))
- library("CMA")
- cma <- dca("aas.a3m", seqid_or_excluded_indices = 3, outputfile = "result_file.csv")
*before running, set up working directory to your current file. Session/Set Working Directory/ Choose Directory
**make sure the library appears in the environment
from https://github.com/soedinglab/CCMpred
It also runs with CMake, you can reuse the previous commands to create its build directory. You will see in the readme file how to best run the program given your graphics card. There are some requirements for the memory available. All is in their readme file!
This repository is the faster, more accurate version of https://github.com/susannvorberg/CCMpredPy_master. We didn't try this one, but they have one collaborator in common.
Also from https://sparks-lab.org In the download webpage select 'SPOT-Contact: Sequence-based contact map prediction'. It requires: tensorflow v1.4, pandas, numpy, tqdm, cPickle.
As you can see in the latest table, SPOT needs all generated files of the above programs. You need to add these into the "input folder" of the project (per protein). Indeed this is very time consuming, but its the last step before putting the generated file into the Graph ML of graph_sol .
you can then run with ./run_spotcontact.sh
**a possibility is that you only run SPOT and it tecnically generates all the files it needs (as in the .pssm or .spd33) that you dont provide in the input folder! So if it works it can save you a lot of time, nevertheless, it was advised to run all programs independently.