Run ./server-setup.sh
to get the packages and data you need installed.
This repository contains the code for an exploration of molecular multimodal foundation models for molecule generation from natural language.
--MoMu
--base-contrast # original MoMu pre-training implementation
--base-downstream # original MoMu downstream tasks
-- graph-retrieval # graph retrieval task
-- molecule-caption # molecular captioning task
-- molecule-generation # molecular generation task
-- molecule-prediction # properties prediction task
--data # datasets to be downloaded
--contast-pretrain
--S # small dataset: 89 molecules <- in repo
--XL # full size base dataset: 15,613 molecules
--text # text corpus from S2ORC
--graph # molecule graphs from PubChem
-- graph-retrieval # graph retrieval datasets
-- molecule-caption # molecular captioning datasets
-- molecule-generation # molecular generation datasets
--new-contrast # new MoMu contrastive pre-training (WIP)
--new-downstream # new MoMu downstream tasks benchmarking (WIP)
--text-preprocess # relevance scoring to improve text retrieval (WIP)
base-MoMu
: original model trained on contrast-XL base dataset.
base-contrast
: code for contrastive pre-trainingbase-downstream
: code for fine-tuning on downstream tasks
new-MoMu
[WIP]: lightweight model family trained for experimental purposes.
new-contrast
[WIP]: code to perform contrastive pre-training with mini-MoMum on the smaller contrast-pretrain datasetsnew-downstream
[WIP]: code to evaluate performance on a downstream task.
The multimodal models are pre-trained on a joint molecular graph-text corpus dataset through contrastive learning. We then benchmark the performance of these pre-trained models by fine-tuning them on downstream tasks below.
Task: given the name of a molecule, generating the corresponding molecular graph.
Task: given a molecule graph, generate natural language describing the molecule.
Task: given a molecule, predict its properties.
Task: given a natural language input, generate a molecule that fits the description.
We are working on providing online access to the various datasets on HuggingFace.
This repository builds on the original MoMu implementation from Su et al. 2022 available on GitHub. Thanks to the original authors for their work!
The original implementation also uses some code from Graphormer.