Built around the model AttentiveFP provided by PyG (PyTorch Geometric), this project aims to train versions of the model to predict the UV-absorption spectra in organic molecules. The dataset is provided by Oak Ridge National Laboratory (ORNL), which includes a wide range of organic molecules and their corresponding UV-Vis absorption spectra.
This project seeks to leverage graph neural networks to understand and predict how organic molecules absorb UV light, which is crucial for various applications in chemistry and materials science. By accurately predicting these spectra, we can facilitate the design of new materials and molecules with desired optical properties.
The prediction of absorption spectra using deep learning models has become a significant topic in spectroscopy. Traditional methods for predicting UV-Vis spectra often involve complex quantum mechanical calculations, which are computationally expensive and time-consuming. Deep learning models, especially graph neural networks like AttentiveFP, offer a promising alternative by learning directly from data.
AttentiveFP, a model based on attention mechanisms within graph neural networks, has shown promising results in various molecular property prediction tasks. In this project, we adapt AttentiveFP and explore different graph attention methods such as GAT, GATv2, MoGAT(v2), and DenseGAT to enhance the prediction accuracy of UV-Vis spectra.
The main objectives of this project are:
- Data Preparation: Prepare a dataset of molecules with corresponding UV-Vis spectra.
- Model Training: Train AttentiveFP with different Graph Attention methods, such as GAT, GATv2, MoGAT(v2), and DenseGAT.
- Prediction and Evaluation: Predict UV-Vis spectra for unseen molecules and evaluate the performance of the models.
- Python 3.11+
- Anaconda
Follow these steps to set up your environment and install all necessary dependencies:
-
Clone the Repository
git clone https://github.com/williamnyren/AttentiveFP-UV.git cd AttentiveFP-UV
-
Create and Activate the Conda Environment
This will create the environment and install all conda and pip dependencies specified in the environment.yml file.
conda env create -f environment.yml conda activate attentive_fp
-
Make the
postBuild
Script ExecutableThe
postBuild
script is used to installPyTorch
,torchvision
, andtorchaudio
with the specified CUDA version.chmod +x postBuild
-
Run the postBuild Script
Execute the script to complete the installation of the required packages.
./postBuild
By following these steps, you will set up your development environment with all the necessary dependencies to run the AttentiveFP-UV project.
To retrieve the dataset required for this project, follow these detailed steps:
Based on the Work Here Refer to the article Two excited-state datasets for quantum chemical UV-vis spectra of organic molecules.
-
Go to the ORNL Data Transfer Guide
- Follow the instructions on how to install and use Globus from the ORNL Data Transfer Guide:
-
Install Globus
- Download and install the Globus app from the official Globus site:
-
Connect to Globus
- Open the Globus app and sign in using your credentials. Follow the on-screen instructions to connect to the Globus file transfer service.
-
Locate the Files on Globus
-
Transfer the Files
- Ensure that you are connected to Globus in the installed app.
- Navigate to the file you want to download from Globus via the references mentioned above.
- Select a directory on your local system where you want to transfer the files.
- Follow the instructions in the Globus app to initiate and complete the file transfer.
-
Extract data files and directories
- Extract the two files transferred from Globus and put them into the directory
ORNL_data
- ORNL_data - extracted - gdb9_ex.csv - ornl_aisd_ex_1.csv - ornl_aisd_ex_2.csv - ... - ornl_aisd_ex_1000.csv
- Extract the two files transferred from Globus and put them into the directory
By following these steps, you will be able to download the dataset required for this project.
Before training and running the model the data has to be prepared, do so by running preprocess.py
python -m src.processing.preprocess
Train the model using command line arguments or using a WandB configuration file. You can
Override the default parameters directly through command line arguments. Defaults:
default_config = {
'lr': 5e-4,
'hidden_channels': 250,
'num_layers': 4,
'num_timesteps': 2,
'dropout': 0.025,
'seed': None,
'num_workers': 4,
'total_epochs': 10,
'warmup_epochs': 1,
'run_id': None,
'batch_size': 0,
'Attention_mode': 'MoGATv2',
'heads': 2,
'loss_function': 'mse_loss',
'metric': 'srmse',
'savitzkey_golay': [],
'window_length': 5,
'polyorder': 3,
'padding': True,
'lr_ddp_scaling': 0,
'batch_ddp_scaling': 1,
'with_fake_edges': 0,
'LOSS_FUNCTION': '',
'METRIC_FUNCTION': 'srmse',
'NUM_PROCESSES': 1,
'DATA_DIRECTORY': 'data'
}
--lr', type=float, default=default_config['lr'], help='Learning rate')
--hidden_channels', type=int, default=default_config['hidden_channels'], help='Hidden channels')
--num_layers', type=int, default=default_config['num_layers'], help='Number of layers')
--num_timesteps', type=int, default=default_config['num_timesteps'], help='Number of timesteps')
--dropout', type=float, default=default_config['dropout'], help='Dropout')
--seed', type=int, default=default_config['seed'], help='Seed')
--num_workers', type=int, default=default_config['num_workers'], help='Number of workers')
--run_id', type=str, default=default_config['run_id'], help='Run ID for resuming training')
--total_epochs', type=int, default=default_config['total_epochs'], help='Number of epochs to train')
--batch_size', type=int, default=default_config['batch_size'], help='Batch size')
--Attention_mode', type=str, default=default_config['Attention_mode'], help='Attention mode')
--heads', type=int, default=default_config['heads'], help='Number of heads')
--loss_function', type=str, default=default_config['loss_function'], help='Loss function')
--metric', type=str, default=default_config['metric'], help='Metric')
--savitzkey_golay', type=list, default=default_config['savitzkey_golay'], help='Savitzkey Golay filter')
--with_fake_edges', type=int, default=default_config['with_fake_edges'], help='Data with fake edges')
--DATA_DIRECTORY', type=str, default=default_config['DATA_DIRECTORY'], help='Data directory')
--lr_ddp_scaling, type=int, default=default_config['lr_ddp_scaling'], help='Scale learning rate with number of GPUs')
--batch_ddp_scaling, type=int, default=default_config['batch_ddp_scaling'], help='Scale batch size with number of GPUs')
--warmup_epochs, type=int, default=default_config['warmup_epochs'], help='Warmup epochs')
Run the script with your desired parameters:
python train_ddp.py --lr 0.001 --hidden_channels 256 --num_layers 6
You can also use a WandB configuration file to set the parameters. Create a config.yaml
file with the following content:
#Program to run
program: 'train_ddp.py'
#Sweep search method: random, grid or Bayes
method: 'random'
# Project this sweep is part of
project: 'example_project'
entity: <WANDB_USER>
# Metric to optimize
metric:
name: 'val_srmse'
goal: 'minimize'
# Parameters search space
parameters:
lr:
values: [0.0001]
hidden_channels:
values: [600]
num_layers:
values: [8]
num_timesteps:
values: [2]
dropout:
values: [0.05]
num_workers:
value: 3
total_epochs:
value: 50
warmup_epochs:
value: 2
batch_size:
values: [0]
Attention_mode:
values: ['GATv2']
heads:
values: [3]
loss_function:
values: ['mse_loss']
with_fake_edges:
value: 0
lr_ddp_scaling:
value: 0
batch_ddp_scaling:
value: 1
savitzkey_golay:
values: [0]
# seed:
# values: [42, 13, 7]
With a configuration file setup, we are now able to utilize the W&B sweep functionality.
wandb sweep config.yaml
You should now receive a in the terminal. The project and sweep should also be present on your W&B page.
We are now able to start a new run in the sweep:
wandb agent <WANDB_USER>/example_project/<sweep-ID>
The final step to get everything operational is to edit path variables and potential variables related to the wandb
setup.
This is changes are done in src/config/params.py
.