Skip to content

imagebind/hackathon_bioactivity_drug_discovery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌟 Usefulness and Purpose of ML Implementation for Bioactivity Prediction of Target Proteins

🔹 Drug Discovery Acceleration – ML helps in predicting the bioactivity of compounds against a target protein, significantly reducing time and cost in drug discovery.

🔸 Feature Selection – Variance Thresholding eliminates low-variance features, ensuring the dataset retains only informative features.

🟢 Handling Large Data – ML models efficiently process large-scale bioactivity datasets, identifying patterns humans might miss.

🔵 Regression Analysis – Random Forest Regression predicts bioactivity values (pIC50), providing insights into compound effectiveness.

🟠 Statistical Validation – R-squared value helps evaluate model performance, ensuring reliability in bioactivity predictions.

🔻 Automated Workflow – Streamlit-based UI streamlines dataset selection, preprocessing, and model execution interactively.

🟣 Model Generalization – Train-test split ensures robust performance assessment, avoiding overfitting.

🔵 Compound Filtering – Removing non-informative molecular descriptors improves predictive accuracy.

🟡 Data Visualization – Regression plots (Seaborn) visually depict experimental vs. predicted pIC50 values for better interpretation.

🟠 Biological Relevance – Helps in identifying high-affinity drug candidates for specific target proteins in diseases like Alzheimer's.

🔹 Customizable Analysis – Supports different diseases and descriptor files, making the implementation flexible.

🟢 Non-Experimental Predictions – Reduces the need for extensive lab testing by providing preliminary computational insights.

🔻 Pattern Recognition – Captures nonlinear relationships between molecular descriptors and bioactivity using Random Forest.

🔸 Scalability – ML methods can be adapted to analyze multiple target proteins for different diseases.

🟣 Preprocessing Automation – Eliminates missing values and redundant features, ensuring a cleaner dataset.

🟠 Facilitates Hypothesis Testing – Helps researchers validate drug-target interaction hypotheses before experimental validation.

🟡 Cross-Disciplinary Utility – Useful for computational chemistry, pharmacology, and bioinformatics researchers.

🔵 Interactive Decision-Making – Streamlit UI allows real-time selection and visualization of results.

🔹 Reproducibility – The ML pipeline ensures standardized processing, making results consistent and comparable.

🟢 Future Applications – Can be extended to deep learning models for even more accurate bioactivity predictions.

🌟Implementation Steps:

1️⃣ The ChemBL ID, SMILES and IC50 values fetched automatically from chembl_webresource_client and molecular fingerprints from PubChem API, ensuring the data is stored in a structured format (CSV or DataFrame). The dataset should contain: Molecular Identifiers (e.g., ChemBL ID) Fingerprint Descriptors (Bit-vector/Binary representation) IC50 Values (Bioactivity measurement in nM)

💡 Converted IC50 values to pIC50 for better statistical modeling:

2️⃣ Data Preprocessing 🔹 Load the Data 🔹 Handle Missing Data 🔹 Remove Low-Variance Features

3️⃣ Train-Test Splitting 4️⃣ Train the ML Model 5️⃣ Model Evaluation 6️⃣ Predictions & Visualization 7️⃣ Interactive Streamlit App Future Enhancements:- Use Deep Learning (Neural Networks) for Bioactivity Prediction. Implement Explainable AI (SHAP values) to interpret feature importance. Automate descriptor extraction from PubChem API.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published