An auto-regressive causal language model for molecule (SMILES) and reaction template (SMARTS) generation. Based on the Hugging Face implementation of OpenAI’s GPT-2 transformer decoder model.
This work focuses on the world of chemistry, with the goal of supporting the discovery of drugs to cure diseases or sustainable materials for cleaner energy. The research explores the potential of a transformer decoder model in generating chemically feasible molecules and reaction templates. We begin with contrasting the performance of GuacaMol for molecule generation with a transformer decoder architecture, assessing the influence of various tokenizers on performance. The study also involves fine-tuning a pre-trained language model and comparing its outcomes with a model trained from scratch. It utilizes multiple metrics, including the Fréchet ChemNet Distance, to evaluate the model's ability to generate new, valid molecules similar to the training data. The research indicates that the transformer decoder model outperforms the GuacaMol model in terms of this metric, and is also successful in generating known reaction templates.
- How well does this model perform for molecule generation, using the GuacaMol paper as a benchmark?
- What is the effect of different tokenization approaches (different RegEx expressions as pre-tokenizers, tokenization algorithms such as BPE, WordPiece)?
- Can we use a model pre-trained on natural language as a basis for fine-tuning a “molecule language” model?
- Can we use this approach/model to generate reaction templates?
A local (editable package) installation requires python ≥
3.9, poetry ≥ 1.0.0 and pip ≥ 22.3. Experiment results are logged
to weights and biases.
git clone --recurse-submodules https://github.com/hogru/molreactgen
cd molreactgen
python -m pip install -e .
prepare_data.pydownloads and prepares the datasetstrain.pytrains the model on a given dataset, configured via (optionally multiple).argsfile(s) in theconfdirectory (see example files)generate.pygenerates molecules (SMILES) or reaction templates (SMARTS)assess.py(for molecules only) calculates the Fréchet ChemNet Distance (FCD) between the generated molecules and a reference set of molecules (e.g. the GuacaMol dataset) along with some other metricsmolecule.pycovers helpers for the chemical domain of the tasktokenizer.pyprovides the various tokenizershelpers.pyis a set of misc helpers/utils (logging etc.)
train_tokenizers.pypre-trains the tokenizers on a given dataset for later use during model trainingcheck_tokenizer.pycan be used to check if a tokenizer can successfully encode and decode a datasetcompute_fcd_stats.pycomputes the model activations that are needed to calculate the FCD. This is a separate script because it is computationally expensive and the results can be reused for later model comparisons.collect_metrics.pycollects metrics from various files andwandband provides them in several formats (csv,json,md); used during experimentsstatistical_tests.ipynbis a Jupyter notebook that performs statistical tests on the results; used for experiment results evaluationcreate_plots.ipynbis a Jupyter notebook that creates plots from the datasets; used for presentation purposesmerge_items.pyandrename_files.pyare one-offs for file manipulation
*.share sample shell scripts to show potential uses of the main.pyscripts
- the (default) directory
prepare_data.pydownloads the datasets to - a subdirectory is created for each dataset, containing the raw data files
- the (default) directory
prepare_data.pyprepares the datasets in - a subdirectory is created for each dataset, containing the prepared data files
- the (default) directory
generate.pysaves the generated items into
checkpoints: the (default) directorytrain.pysaves the models intologs: the (default) directorytrain.pysaves the logs into, including thewandblogspresentations: presentation, poster, master thesisresults: sample resultssrc/molreactgen/conf: the (default) directorytrain.pyreads the configuration files fromtokenizers: the pre-trained tokenizers
- Local repository installation (see above)
- Python 3.9 - it should work with ≥ 3.10 as well, but I haven't tested it
poetryinstalled (see here)poetry shellin directorymolreactgen/src/molreactgento activate the virtual environment- Optional:
wandbaccount and API key (see here); should work with an anonymous account, but I haven't tested it
Note: the Hugging Face
traineruses its ownacceleratelibrary under the hood. This library is supposed to support a number of distributed training backends. It should work with its default values for a simple setup, but you might want /need to change theaccelerateparameters. You can do this by issuing theaccelerate configcommand. This is my current setup:compute_environment: LOCAL_MACHINE distributed_type: 'NO' downcast_bf16: 'no' machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
cdinto themolreactgen/src/molreactgendirectory and run the following commands:
# Download and prepare dataset
python prepare_data.py guacamol
# Train the model
# add --fp16 false if your GPU does not support fp16 or you run it on a CPU (not recommended)
python train.py --config_file conf/guacamol.args # this also reads the default train.args file
# Generate ≥ 10000 molecules
python generate.py smiles \
--model "../../checkpoints/<your_model>" \
--known "../../data/prep/guacamol/csv/guacamol_v1_train.csv"
--num 10000
# Calculate the stats of GuacaMol training set (needed for FCD calculation)
# This is computationally expensive and can be reused for model comparison
python utils/compute_fcd_stats.py \
"../../data/prep/guacamol/csv/guacamol_v1_train.csv" \
--output "../../data/prep/guacamol/fcd_stats/guacamol_train.pkl"
# Evaluate the generated molecules
python assess.py smiles \
--mode stats \
--generated "../../data/generated/<generation directory>/generated_smiles.csv" \
--reference "../../data/prep/guacamol/csv/guacamol_v1_train.csv" \
--stats "../../data/prep/guacamol/fcd_stats/guacamol_train.pkl" \
--num_molecules 10000# Download and prepare dataset
python prepare_data.py uspto50k
# Train the model
# add --fp16 false if your GPU does not support fp16 or you run it on a CPU (not recommended)
python train.py --config_file conf/uspto50k.args # this also reads the default train.args file
# Generate ≥ 10000 reaction templates
# In this case the evaluation is done during generation
python generate.py smarts \
--model "../../checkpoints/<your_model>" \
--known "../../data/prep/uspto50k/csv/USPTO_50k_known.csv"
--num 10000
# Evaluate the generated reaction templates
# At the moment, the assessment is fully done during the generation alreadyAlternatively you can inspect and adapt the shell scripts provided in the scripts directory.
Pre-trained models are available on Hugging Face, both for molecules (SMILES) and reaction templates (SMARTS).
- 1.0: First release along with the master thesis submission
- Ran only on a local GPU, not configured/tested for distributed training
- Starting with
transformersv5 (not out as of this writing)...- the optimizer must be instantiated manually; this requires a code change in
train.py - the
oauth_tokenusage intrain.pymust be replaced
- the optimizer must be instantiated manually; this requires a code change in
- Does not detect Apple devices automatically; you can use command line argument
--use_mps_device trueto take advantage of Apple Silicon (assumingpytorchis configured correctly) - The current
pyproject.tomldoes not update to the following versions due to required testing and, in some cases, their potential breaking changes:- python 3.10 (not tested)
- pandas ≥ 2.0 (not tested)
- transformers 5.0 (not tested, breaking change, see above)
- Generally, all known open issues are also tagged with
TODOin the code
- Stephan Holzgruber - [email protected]
- Distributed under the MIT license. See
LICENSEfor more information. - https://github.com/hogru/MolReactGen