This repository contains a small end-to-end pipeline to reproduce, at the molecular (protein) level, the RNA-binding protein (RBP) prediction setup described in:
Prediction of RNA-Binding Proteins Using a Random Forest Algorithm Combined with an RNA-Binding Residue Predictor (PRBP)
The implementation uses:
- EIPP: Evolution-Inspired Physicochemical Profiles derived from PSI-BLAST PSSMs
- AAC: Global amino-acid composition features
- Random Forests for binary protein-level RBP vs non-RBP classification
Contributors: Riddhi Mehta and Nirvisha Soni
All files live under the Bioinformatics/ folder (GitHub will show this as the root of the repo):
Bioinformatics/
├── Positive_Final.fasta # Final positive-class proteins (experimentally validated RBPs)
├── Negative_Final.fasta # Final negative-class proteins (no known nucleic-acid binding)
├── all_sequences.fasta # Combined positive + negative FASTA
│
├── split.py # Script: split all_sequences.fasta into one-FASTA-per-protein
├── generate_pssm.bat # Windows batch script to run PSI-BLAST and generate PSSMs
│
├── mini_db.p* # Local PSI-BLAST (BLAST+) protein database files (mini_db)
│
├── split_seqs/ # Folder of per-sequence FASTA files (created by split.py)
└── pssm_out/ # Folder of per-sequence PSSMs (.pssm) from PSI-BLAST
The main modelling code is in:
Bioinformatics/PRBP_Prediction_EIPP_AAC_RF.ipynb
This Jupyter notebook:
- Loads sequences and (optionally) PSSMs
- Builds EIPP + AAC + length features
- Trains a RandomForestClassifier
- Evaluates with train/test split and 5-fold cross-validation
- Prints metrics and example predictions for a few random proteins
- Python 3.8+
- Recommended: create a conda/venv environment
Python packages used in the notebook:
pip install numpy pandas scikit-learn matplotlibTo recompute EIPP features from scratch you need:
- NCBI BLAST+ installed (so the
psiblastcommand is available on your PATH) - The provided
mini_db.*BLAST database files in the same folder as the scripts
On Windows, you can install BLAST+ from the NCBI site and then add the bin folder to your PATH.
If you prefer not to re-run PSI-BLAST, you can:
- Use any pre-computed PSSMs already shipped in
pssm_out/(if present), or - Skip the EIPP step and adapt the notebook to use only AAC features (for quick testing).
-
Clone or download the repository
git clone https://github.com/riddhimehta15/Bioinformatics.git cd Bioinformatics/Bioinformatics -
Create and activate a virtual environment (optional but recommended)
Using
conda:conda create -n rbp-prbp python=3.10 conda activate rbp-prbp pip install numpy pandas scikit-learn matplotlib
Or using
venv:python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate pip install numpy pandas scikit-learn matplotlib
-
Prepare per-sequence FASTA files
The script
split.pytakesall_sequences.fastaand splits it into one FASTA file per protein in thesplit_seqs/folder.From the
Bioinformatics/directory:python split.py
After this step you should have many
*.fastafiles created under:Bioinformatics/split_seqs/ -
Generate PSSMs with PSI-BLAST
This step computes a PSSM for each protein using the local
mini_dbBLAST database.From the
Bioinformatics/directory (or wherevergenerate_pssm.batexpects to be run; typically apssm/subfolder):generate_pssm.bat
The script will:
- Loop over
split_seqs/*.fasta - Run
psiblastfor each sequence - Save an ASCII PSSM to
pssm_out/<sequence_id>.pssm
You can adjust parameters (e.g., number of iterations) inside
generate_pssm.batif needed. - Loop over
-
Run the Random Forest notebook
Start Jupyter:
jupyter notebook
Then open:
Bioinformatics/PRBP_Prediction_EIPP_AAC_RF.ipynband run all cells in order. The notebook will:
- Parse the FASTA sequences (positive + negative)
- Match them with available PSSMs in
pssm_out/ - Compute:
- EIPP features from PSSMs (6 physicochemical groups × 20 amino acids = 120 dims)
- AAC features (20-dimensional amino-acid composition)
- Optional length feature
- Train a Random Forest classifier with stratified train/test split
- Report:
- Accuracy, precision, recall, F1, ROC-AUC on the test set
- 5-fold cross-validation scores
- Top 20 most important features
- Example predictions for 5 random proteins
-
Changing train/test split
Inside the notebook, adjust thetest_sizeandRANDOM_STATEarguments in thetrain_test_splitcall. -
Changing Random Forest hyperparameters
Modify theRandomForestClassifierparameters (e.g.,n_estimators,max_depth,class_weight) to explore different models. -
Using only AAC features
If computing PSSMs is too expensive, you can comment out the EIPP feature extraction and use only AAC + length features. This is useful for quick experimentation or for machines without BLAST+. -
Different datasets
You can swap outPositive_Final.fastaandNegative_Final.fastawith your own datasets:- Make sure FASTA headers have unique IDs.
- Rebuild
all_sequences.fastaand re-runsplit.py+generate_pssm.bat.
The notebook prints:
- Test set metrics (accuracy, precision, recall, F1, ROC-AUC)
- 5-fold cross-validation performance
- Feature importance plot for EIPP + AAC features
- Example predictions (predicted label and probability) for a few randomly chosen proteins.
Exact numbers will depend on:
- The dataset version
- PSSM generation parameters
- Random seed and train/test split
This project is inspired by:
Wang, L., Huang, C., Yang, M.-Q., & Yang, J. Y. (2013). Prediction of RNA-binding proteins using a random forest algorithm combined with a novel RNA-binding residue predictor. Journal of Theoretical Biology, 1–9.
We also acknowledge the authors and maintainers of:
- UniProtKB/Swiss-Prot, for curated protein sequence data
- NCBI BLAST+, for PSI-BLAST and PSSM generation
- scikit-learn, NumPy, pandas, and matplotlib for model implementation and analysis