Feature Extraction Developer Guide

Possible future development

Throughout this project, we aspire to fully predict the behavoir of proteins given their amino-acid sequence.

As known in the biology field, this is extremely difficult since proteins are 3D structures which completely determine their characteristics. And such cannot be just seen from their 1D AA sequence. Thus, extra effort is needed to predict the secondary and tertiary structure of the inspected protein.

We tempted to use several already existing tools to attain all structures and then predict their features. In this project we focus on solubility.

Material

We took the insight from the GraphSol algorithm developed by https://github.com/jcchan23/GraphSol.

In order to achieve the 3D structure many smaller steps have to first be exerted:

Name	Tool	Description	language
PSSM	PSI-blast	scoring matrix obtained from a multiple alignment of the highest scoring hits in an initial BLAST search	User Interface
HMM	HH-Suite3 (HHBlits)	Sensitive protein sequence searching based on the pairwise alignment of hidden Markov models	C using CMake
Spider3	SPIDER3	Prediction on secondary structure and other structural properties of proteins by LSTM	Python Numpy
Evolutionary coupling information	DCA	Provides a correlated mutation analysis in an AA and codon level	R
Contact Map	CCMPred	Markov Random Field for learning protein residue-residue contacts	C using CMake
SPOT-Contact-Helical	SPOT-Contact	Prediction of protein contact maps by coupling residual two-dimensional bidirectional long short-term memory with CNNs	Python

A first challenge for this project is that many different tools have to be set up beforehand. As we see in the following table, the inputs of some tools are the output of other, thus, it is essential to be able to make all work. In this guide a more detailed configuration of all tools will be presented.

Software	Version	Input	Output
PSI-BLAST	2.7.1	abgB	abgB.bla, abgB.pssm
HH-Suite3	3.0.3	abgB	abgB.hhr, abgB.hhm, abgB.a3m
SPIDER3	1.0	abgB, abgB.pssm, abgB.hhm	abgB.spd33
DCA	1.0	abgB.a3m	abgB.di
CCMPRED	1.0	abgB.a3m	abgB.mat
SPOT	1.0	abgB.fasta, abgB.pssm, abgB.hhm, abgB.di, abgB.mat	abgB.spotcon

abgB is an example protein, and the extentions of the file are own to the tool used. GraphSol includes a folder with examples files of all extensions.

General note: all these specific steps are the latest version as in January 2023, it might be that later versions exist!

PSI-BLAST

The PSI Blast can be run from your local machine or from their interface. The following link presents a detailed guide how to do in both cases.

https://biochem.slu.edu/bchm628/handouts/2013/PSI-BLAST_tutorial_2011.pdf

HH-Suite3

download the package from https://github.com/soedinglab/hh-suite.git

it needs >4.8 version of GCC, C Compiler, and >2.8.12 version of CMake. CMake is a tool to run C Programs in an efficient manner. You can download it with conda in windows platform or with brew on macsOS. The application is not needed, just the binary files or the source code. You can also refer to https://cmake.org/download/

To download the source code and compile the HH-suite execute the following commands:

git clone https://github.com/soedinglab/hh-suite.git
mkdir -p hh-suite/build && cd hh-suite/build
cmake -DCMAKE_INSTALL_PREFIX=. ..
make -j 4 && make install
export PATH="$(pwd)/bin:$(pwd)/scripts:$PATH"

For performing a single search iteration of HHblits, run HHblits with the following command:

hhblits -i <input-file> -o <result-file> -n 1 -d <database-basename>

*if you have a macOS and you are having some problems running HHblits, you can refer to the original github for other tips https://github.com/soedinglab/hh-suite

SPIDER-3

from https://sparks-lab.org/server/spider3/ go to the download sections (it redirects into another website). In the 'Protein Local Structural Prediction' section downlaod 'spider3-Single (Numpy): Single-sequence-based prediction of structural features for proteins'.

Requirements:python == 2.7, sqlparse==0.4.1

You will see in the brief readme file they show how to input the files needed for the prediction. You can enter only the FASTA file but for the process of the project we add the PSIblast (.pssm) file and the HHfile (.hhm). Let protein_id be the "sequence name", its unique identifier

protein_id ./path_pssm_file ./path_hhm_file

You will also need to modify the variables SAVE_DIR and INPUT_LIST as needed in the input_script_np.sh

And to run ./impute_scipt_np.sh

You can find more information in the readme file in the download zip

DCA

download package from https://github.com/etaijacob/CMA

to install the library, in a new R file:

install.package(("PATH_TO_PACKAGE", repos = NULL, type = "source"))
library("CMA")
cma <- dca("aas.a3m", seqid_or_excluded_indices = 3, outputfile = "result_file.csv")

*before running, set up working directory to your current file. Session/Set Working Directory/ Choose Directory

**make sure the library appears in the environment

CCMPRED

from https://github.com/soedinglab/CCMpred

It also runs with CMake, you can reuse the previous commands to create its build directory. You will see in the readme file how to best run the program given your graphics card. There are some requirements for the memory available. All is in their readme file!

This repository is the faster, more accurate version of https://github.com/susannvorberg/CCMpredPy_master. We didn't try this one, but they have one collaborator in common.

SPOT CONTACT

Also from https://sparks-lab.org In the download webpage select 'SPOT-Contact: Sequence-based contact map prediction'. It requires: tensorflow v1.4, pandas, numpy, tqdm, cPickle.

As you can see in the latest table, SPOT needs all generated files of the above programs. You need to add these into the "input folder" of the project (per protein). Indeed this is very time consuming, but its the last step before putting the generated file into the Graph ML of graph_sol .

you can then run with ./run_spotcontact.sh

**a possibility is that you only run SPOT and it tecnically generates all the files it needs (as in the .pssm or .spd33) that you dont provide in the input folder! So if it works it can save you a lot of time, nevertheless, it was advised to run all programs independently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly