This project seeks to develop a in silico pathway for the identification of new SARS-CoV-2 inhibitors through the integration of machine learning and molecular modeling techniques.
SARS-CoV-2 is the virus that causes the Coronavirus Disease 2019 (COVID-19) and was responsible for the worldwide pandemic in 2019. As of the most recent reporting period, the total number of confirmed COVID-19 deaths reported to the WHO is over 7 million, while the estimated deaths related to the pandemic are much higher. The COVID-19 pandemic has had a profound and multifaceted impact on global public health and the economy. The immense prevalence of the virus overwhelmed healthcare systems worldwide, leading to a disruption of essential medical services and a rise in COVID-19-related deaths. The development of SARS-CoV-2 vaccines, hailed as a game changer in vaccine development, took less than a year, a fraction of the usually required 10+ years. Despite the success of vaccine development, the need for effective therapeutic drugs to treat those already infected remained a high priority. Early in the pandemic, Dexamethasone and Remdesivir were repurposed to treat COVID-19. Dexamethasone, an inexpensive steroid, was used to reduce inflammation in severely ill patients and was shown to significantly decrease mortality, while Remdesivir was found to reduce recovery time. Building on the urgent need for effective COVID-19 therapeutics noted above, our approach seeks to train different Machine Learning models using the sars-cov-2 dataset from PubChem to discriminate potential inhibitors of the main protease. Subsequently, we will further screen the predicted compounds against a Main Protease structure for downstream analysis.
- Train machine learning models on SARS-CoV-2 dataset
- Evaluate ML performance
- Predict potential SARS-CoV-2 inhibitors (actives)
- Virtual screening of predicted actives into the identified protein crystal structure and elucidate their interactions.
- Perform pharmacological and physicochemical profiling on the hits
- Conduct MD simulations to determine compounds' binding mode stability and binding free energy, and MMPBSA computations.
- Compile results and observations.
This picture depicts the proposed sars2drugml pipeline designed to develop effective SARS-CoV-2 inhibitors.
- The compound library was obtained from PubChem BioAssay ID: 1919861.
- The dataset was experimentally generated using a confirmatory biochemical high-throughput screening (HTS) assay targeting the SARS-CoV-2 Main Protease (Mpro).
- Compounds were tested at concentrations of 20 µM and 50 µM, and those showing greater than 50% inhibition at 50 µM were advanced to dose–response confirmation. The assay utilized fluorescence-based detection to measure Mpro activity.
- The chemical structure of a compound is numerically represented by Molecular Fingerprints which allows ML algorithms to learn and make predictions. Morgan Fingerprints, generated from SMILES strings allows to capture substantial features for each compound.
- Data preprocessing also involved removal of duplicated compounds based of Fingerprint values. Fingerprints with missing values were filled with zero.
- Scikit-learn's VarianceThreshold value of 0.1 was applied to reduce the number of features generated as it helps in eliminating low-information bits.
- The data was then standardized using the mean and standard deviation and finally, was split into training (70%) and test (30%) sets, stratified by class to maintain the approximate 50:50 active/inactive ratio.
- Trained seven classification models (KNN, NB, SVM, AdaBoost, XGBoost, RF, LR) using Morgan fingerprints generated with RDKit.
- The models' prediction ability was assessed via 10-fold cross-validation and independent test set.
- The model's prediction accuracy was determined using Accuracy, Precision, Recall, F1, and AUC-ROC, with ROC curves plotted for test data.
- The best-performing model was employed for external validation using a curated dataset of known SARS-CoV-2 Mpro inhibitors.
- Compounds such as Nirmatrelvir, Baicalein, Ebselen, and GRL-1720 were included for the assessment of predictive capability.
- Morgan fingerprints were generated and SMILES strings were preprocessed in accordance with the procedures utilized during model training.
- The applicability domain (AD) was determined to establish the chemical space within which QSAR model predictions are considered reliable.
- For assessing interpolation regions in QSAR analyses, leverage-based method was utilized for AD estimation
- We computed the leverage threshold value for identification of structural and response outliers within the dataset.
- The XGBoost model was deployed as the publicly accessible web server SARS2Pred to enable broad usage.
- SMILES strings, provided individually, in lists, or via uploaded files, are processed using Morgan fingerprints (2048-bit, radius 2) and the same preprocessing pipeline as in training, then passed to the model for prediction.
From the various analyses and computations performed throughout this in silico exploration, the following results were obtained and observations made.
The Team members include:
| Number | Name | Email address | ORCID ID |
|---|---|---|---|
| 1 | George Hanson | [email protected] | 0009-0007-2720-9102 |
| 2 | Anirudh Bhatia | [email protected] | 0009-0009-8626-5651 |
| 3 | Emmanuel Ankomah Kissi | [email protected] | 0009-0003-4506-6230 |
| 4 | Daniella Ene Obong | [email protected] | 0009-0002-2901-5997 |
| 5 | Nathalie Ouaré | [email protected] | 0009-0001-1348-2665 |
| 6 | Olaitan Igbagbo Awe, Ph.D. | [email protected] | 0000-0002-4257-3611 |
