IMPORTANT: setting up

Run ./server-setup.sh to get the packages and data you need installed.

Molecular Multimodal Foundation Models

This repository contains the code for an exploration of molecular multimodal foundation models for molecule generation from natural language.

Directory Structure

--MoMu
	--base-contrast   # original MoMu pre-training implementation
	--base-downstream   # original MoMu downstream tasks
		-- graph-retrieval   # graph retrieval task
		-- molecule-caption   # molecular captioning task
		-- molecule-generation   # molecular generation task
		-- molecule-prediction   # properties prediction task
	--data   # datasets to be downloaded
  		--contast-pretrain
			--S   # small dataset: 89 molecules <- in repo
			--XL   # full size base dataset: 15,613 molecules
				--text   # text corpus from S2ORC
				--graph   # molecule graphs from PubChem
		-- graph-retrieval   # graph retrieval datasets
		-- molecule-caption   # molecular captioning datasets
		-- molecule-generation   # molecular generation datasets
	--new-contrast   # new MoMu contrastive pre-training (WIP)
	--new-downstream   # new MoMu downstream tasks benchmarking (WIP)
	--text-preprocess   # relevance scoring to improve text retrieval (WIP)

Models

base-MoMu: original model trained on contrast-XL base dataset.

base-contrast: code for contrastive pre-training
base-downstream: code for fine-tuning on downstream tasks

new-MoMu [WIP]: lightweight model family trained for experimental purposes.

new-contrast [WIP]: code to perform contrastive pre-training with mini-MoMum on the smaller contrast-pretrain datasets
new-downstream [WIP]: code to evaluate performance on a downstream task.

Tasks

Contrastive pre-training

The multimodal models are pre-trained on a joint molecular graph-text corpus dataset through contrastive learning. We then benchmark the performance of these pre-trained models by fine-tuning them on downstream tasks below.

Graph retrieval

Task: given the name of a molecule, generating the corresponding molecular graph.

Molecular caption

Task: given a molecule graph, generate natural language describing the molecule.

Molecular prediction

Task: given a molecule, predict its properties.

Molecular generation

Task: given a natural language input, generate a molecule that fits the description.

Data

We are working on providing online access to the various datasets on HuggingFace.

Acknowledgments

This repository builds on the original MoMu implementation from Su et al. 2022 available on GitHub. Thanks to the original authors for their work!

The original implementation also uses some code from Graphormer.

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
base-contrast		base-contrast
base-downstream		base-downstream
new-contrast		new-contrast
new-downstream		new-downstream
text-preprocess		text-preprocess
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bert_cosine_similarity_old.py		bert_cosine_similarity_old.py
download_data.py		download_data.py
requirements.txt		requirements.txt
server-setup.sh		server-setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IMPORTANT: setting up

Molecular Multimodal Foundation Models

Directory Structure

Models

Tasks

Contrastive pre-training

Graph retrieval

Molecular caption

Molecular prediction

Molecular generation

Data

Acknowledgments

About

Releases

Packages

Contributors 3

Languages

License

rlacombe/new-MoMu

Folders and files

Latest commit

History

Repository files navigation

IMPORTANT: setting up

Molecular Multimodal Foundation Models

Directory Structure

Models

Tasks

Contrastive pre-training

Graph retrieval

Molecular caption

Molecular prediction

Molecular generation

Data

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages