Introducing MolVault from VaultChem
MolVault is an application that enables predicting properties of confidential molecules using an untrusted server without sharing them publicly.
Machine learning (ML) has become a cornerstone of modern drug discovery. However, the data used to evaluate the ML models is often confidential. This is especially true for the pharmaceutical industry where new drug candidates 💊 are considered as the most valuable asset. Therefore chemical companies are reluctant to share their data with third parties, for instance to use ML services provided by other companies.
Using MolVault only an encrypted version of the molecular data is sent to as server. The data will remain encrypted EVEN when the prediction is computed. Only the party that holds a private key can decrypt the result.
🪄 The magic? 🪄
The server on which the prediction is computed will never see the molecule in clear text, but will still compute an encrypted prediction. Why is this magic? Because this is equivalent to computing the prediction on the molecule in clear text, but without sharing the molecule with the server. Even if organization "B" - or in fact any other party - would try to steal the data, they would only see the encrypted molecular data. Only the party that has the private key (organization "A") can decrypt the prediction. This is possible using a method called "Fully homomorphic encryption" (FHE). This special encryption scheme allows to perform computations on encrypted data.
What are the steps involved?
Find out below! 👇 Or try it out yourself on huggingface:
MolVault is built on Zama library Concrete-ml for fully homomorphic encryption (FHE), to learn more about FHE, click here.
Consider pharma company "A" to be interested in the properties of drug candidate molecules: for instance, understanding pharmacokinetic properties is crucial for determining a drug candidate's effectiveness [1].
The organization "A" may not have sufficient data available for reliable ML predictions. Instead, "A" can securely obtain predictions on molecular data from an untrusted party "B" which owns a secret database and an ML model with sufficient training data. This is only possible using FHE to guarantee "A" will not reveal the secret query to party "B". Here this scenario is simulated with open-source chemistry datasets and tools based on cheminformatics rdkit
and FHE Concrete-ml
. We give an end-to-end solution to the problem of privacy-preserving prediction for molecules.
In this tutorial we will walk you through the details of training a machine learning model on a chemistry dataset, deploying the model with fully homomorphic encryption, and using the model to make predictions on new molecules.
First clone the repository:
git clone https://github.com/vaultchem/molvault.git
Navigate to the molvault
folder.
Create a new python 3.10
environment, for instance using miniconda:
conda create -n molvault python=3.10
Now activate the environment:
conda activate molvault
and install the requirements using
pip install -r requirements.txt
and then run
pip install -e .
to install the package.
Apart from Conda you can use MolVault through a Docker environment. See commands below to install.
docker build -t molvault .
docker run -it -v /media/vaultchem/molvault:/app -p 8501:8501 molvault
export PYTHONPATH=$PYTHONPATH:/app/examples/huggingface/app/
How to use MolVault to fit a fully homomorphic encrypted (FHE) model to your data and make predictions on new molecules: Your CSV file data.csv
with the chemistry data should be of the following format:
SMILES, target_name
CC, 2.312
...
Note that the name of the column containing the SMILES strings must be "SMILES" and cannot be changed. The name of the target column can be changed by using the --target
option. To perform regression, including hyperparameter optimization, and deployment of the FHE model into a subfolder called "deploy", run the following command:
python regress.py --data example_data.csv --target "example_target"
The default regression model is support vector regression "SVR"
, passed in the --regtype
option. To change to XGBoost, use "XGB"
.
As an output you will first get the hyperparameters of the best model. If needed you can change the hyperparameter grid in the regress_utils.py
file.
Predictions on the same points using the sklearn model and its FHE counterpart will be printed as well as their Pearson correlation coefficient. Also the error and Pearson correlation compared to the true values will be printed for the FHE model. Finally, the model is saved in the deploy
folder. You may change the folder name using the --deploy
option.
The tutorial.ipynb
notebook for MolVault provides a guide on using the application to predict properties of molecules using machine learning. The notebook includes the following sections:
-
How to prepare the chemistry dataset for machine learning. Load the data, preprocess it, and split into training and testing sets.
-
Outlines process of model training. It includes selecting a model (either Support Vector Regression or XGBoost), performing hyperparameter optimization, and evaluating the model's performance.
-
Deploy the model with Fully Homomorphic Encryption. How to encrypt the model, evaluate it using encrypted data, and compare the predictions of the encrypted model with those of the plaintext model.
-
Use the model to make predictions on new molecules, including examples with molecules like Vitamin D, Ethanol, and Ibuprofen. This section highlights the application's capability to make predictions without revealing the molecular structures to the server.
The easiest way to run the app is via the huggingface space (see above). Alternatively, after following the installation instruction you can launch the app with:
- Navigate to the
examples/huggingface/app
folder - If you want to use the SVR go to
SVR
folder, if you want to use the XGB go toXGB
folder - Move the
deployment.zip
file to theapp
folder - Unzip the
deployment.zip
file usingunzip deployment.zip
- Run
bash run_app.sh
to start the app - Open you browser and go to
http://localhost:8501/
Note that the huggingface space is already set up to use the SVR model.
To fit FHE models with SVR and XGB to all properties contained in the ADME dataset [1], run:
python fit_all.py
This will result in a subfolder models
containing the fitted models that can readily be used for the app and a file FHE_timings.json
containing the timings for the FHE models.
Note: By default, this involves extensive hyperparameter search. In particular, computing the XGB models can be computationally demanding. You can also download the pretrained models from the huggingface space.
[1] Fang, C., Wang, Y., Grater, R., Kapadnis, S., Black, C., Trapa, P., & Sciabola, S. (2023). Prospective Validation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective. Journal of Chemical Information and Modeling, 63(11), 3263-3274. https://doi.org/10.1021/acs.jcim.3c00160
The dataset used to traing the ML models for the app can be found at:
https://github.com/molecularinformatics/Computational-ADME
Visit our website : VaultChem
[Copyright notice]: Server_rack icon by DBCLS https://togotv.dbcls.jp/en/pics.html is licensed CC-BY 4.0 Unported, rat-adult icon by Servier https://smart.servier.com/ is licensed under CC-BY 3.0 Unported https://creativecommons.org/licenses/by/3.0/