RNA-Binding Protein Prediction (PRBP-Style Random Forest)

This repository contains a small end-to-end pipeline to reproduce, at the molecular (protein) level, the RNA-binding protein (RBP) prediction setup described in:

Prediction of RNA-Binding Proteins Using a Random Forest Algorithm Combined with an RNA-Binding Residue Predictor (PRBP)

The implementation uses:

EIPP: Evolution-Inspired Physicochemical Profiles derived from PSI-BLAST PSSMs
AAC: Global amino-acid composition features
Random Forests for binary protein-level RBP vs non-RBP classification

Contributors: Riddhi Mehta and Nirvisha Soni

Repository Structure

All files live under the Bioinformatics/ folder (GitHub will show this as the root of the repo):

Bioinformatics/
├── Positive_Final.fasta           # Final positive-class proteins (experimentally validated RBPs)
├── Negative_Final.fasta           # Final negative-class proteins (no known nucleic-acid binding)
├── all_sequences.fasta            # Combined positive + negative FASTA
│
├── split.py                       # Script: split all_sequences.fasta into one-FASTA-per-protein
├── generate_pssm.bat              # Windows batch script to run PSI-BLAST and generate PSSMs
│
├── mini_db.p*                     # Local PSI-BLAST (BLAST+) protein database files (mini_db)
│
├── split_seqs/                    # Folder of per-sequence FASTA files (created by split.py)
└── pssm_out/                      # Folder of per-sequence PSSMs (.pssm) from PSI-BLAST

The main modelling code is in:

Bioinformatics/PRBP_Prediction_EIPP_AAC_RF.ipynb

This Jupyter notebook:

Loads sequences and (optionally) PSSMs
Builds EIPP + AAC + length features
Trains a RandomForestClassifier
Evaluates with train/test split and 5-fold cross-validation
Prints metrics and example predictions for a few random proteins

Requirements

Python environment

Python 3.8+
Recommended: create a conda/venv environment

Python packages used in the notebook:

pip install numpy pandas scikit-learn matplotlib

BLAST+ / PSI-BLAST (for EIPP/PSSM features)

To recompute EIPP features from scratch you need:

NCBI BLAST+ installed (so the psiblast command is available on your PATH)
The provided mini_db.* BLAST database files in the same folder as the scripts

On Windows, you can install BLAST+ from the NCBI site and then add the bin folder to your PATH.

If you prefer not to re-run PSI-BLAST, you can:

Use any pre-computed PSSMs already shipped in pssm_out/ (if present), or
Skip the EIPP step and adapt the notebook to use only AAC features (for quick testing).

Getting Started

Clone or download the repository

git clone https://github.com/riddhimehta15/Bioinformatics.git
cd Bioinformatics/Bioinformatics

Create and activate a virtual environment (optional but recommended)

Using conda:

conda create -n rbp-prbp python=3.10
conda activate rbp-prbp
pip install numpy pandas scikit-learn matplotlib

Or using venv:

python -m venv .venv
source .venv/bin/activate      # On Windows: .venv\Scripts\activate
pip install numpy pandas scikit-learn matplotlib

Prepare per-sequence FASTA files

The script split.py takes all_sequences.fasta and splits it into one FASTA file per protein in the split_seqs/ folder.

From the Bioinformatics/ directory:
```
python split.py
```
After this step you should have many *.fasta files created under:
```
Bioinformatics/split_seqs/
```
Generate PSSMs with PSI-BLAST

This step computes a PSSM for each protein using the local mini_db BLAST database.

From the Bioinformatics/ directory (or wherever generate_pssm.bat expects to be run; typically a pssm/ subfolder):
```
generate_pssm.bat
```
The script will:
- Loop over split_seqs/*.fasta
- Run psiblast for each sequence
- Save an ASCII PSSM to pssm_out/<sequence_id>.pssm
You can adjust parameters (e.g., number of iterations) inside generate_pssm.bat if needed.
Run the Random Forest notebook

Start Jupyter:
```
jupyter notebook
```
Then open:
```
Bioinformatics/PRBP_Prediction_EIPP_AAC_RF.ipynb
```
and run all cells in order. The notebook will:
- Parse the FASTA sequences (positive + negative)
- Match them with available PSSMs in pssm_out/
- Compute:
  - EIPP features from PSSMs (6 physicochemical groups × 20 amino acids = 120 dims)
  - AAC features (20-dimensional amino-acid composition)
  - Optional length feature
- Train a Random Forest classifier with stratified train/test split
- Report:
  - Accuracy, precision, recall, F1, ROC-AUC on the test set
  - 5-fold cross-validation scores
  - Top 20 most important features
  - Example predictions for 5 random proteins

Customization

Changing train/test split
Inside the notebook, adjust the test_size and RANDOM_STATE arguments in the train_test_split call.
Changing Random Forest hyperparameters
Modify the RandomForestClassifier parameters (e.g., n_estimators, max_depth, class_weight) to explore different models.
Using only AAC features
If computing PSSMs is too expensive, you can comment out the EIPP feature extraction and use only AAC + length features. This is useful for quick experimentation or for machines without BLAST+.
Different datasets
You can swap out Positive_Final.fasta and Negative_Final.fasta with your own datasets:
- Make sure FASTA headers have unique IDs.
- Rebuild all_sequences.fasta and re-run split.py + generate_pssm.bat.

Results

The notebook prints:

Test set metrics (accuracy, precision, recall, F1, ROC-AUC)
5-fold cross-validation performance
Feature importance plot for EIPP + AAC features
Example predictions (predicted label and probability) for a few randomly chosen proteins.

Exact numbers will depend on:

The dataset version
PSSM generation parameters
Random seed and train/test split

Contributors

Riddhi Mehta
Nirvisha Soni

Acknowledgements

This project is inspired by:

Wang, L., Huang, C., Yang, M.-Q., & Yang, J. Y. (2013). Prediction of RNA-binding proteins using a random forest algorithm combined with a novel RNA-binding residue predictor. Journal of Theoretical Biology, 1–9.

We also acknowledge the authors and maintainers of:

UniProtKB/Swiss-Prot, for curated protein sequence data
NCBI BLAST+, for PSI-BLAST and PSSM generation
scikit-learn, NumPy, pandas, and matplotlib for model implementation and analysis

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Other Algorithms		Other Algorithms
pssm_out		pssm_out
split_seqs		split_seqs
Negative_Final.fasta		Negative_Final.fasta
PRBP_Prediction_EIPP_AAC_RF.ipynb		PRBP_Prediction_EIPP_AAC_RF.ipynb
Positive_Final.fasta		Positive_Final.fasta
README.md		README.md
all_sequences.fasta		all_sequences.fasta
generate_pssm.bat		generate_pssm.bat
mini_db.pdb		mini_db.pdb
mini_db.phr		mini_db.phr
mini_db.pin		mini_db.pin
mini_db.pot		mini_db.pot
mini_db.psq		mini_db.psq
mini_db.ptf		mini_db.ptf
mini_db.pto		mini_db.pto
split.py		split.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNA-Binding Protein Prediction (PRBP-Style Random Forest)

Repository Structure

Requirements

Python environment

BLAST+ / PSI-BLAST (for EIPP/PSSM features)

Getting Started

Customization

Results

Contributors

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RNA-Binding Protein Prediction (PRBP-Style Random Forest)

Repository Structure

Requirements

Python environment

BLAST+ / PSI-BLAST (for EIPP/PSSM features)

Getting Started

Customization

Results

Contributors

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages