GitHub - Gleghorn-Lab/Protify: Low code molecular property prediction

Protify

A low code solution for computationally predicting the properties of chemicals.
Explore the docs »

View Demo · Report Bug · Request Feature

Table of Contents

About The Project
Getting Started
- Installation
Usage
Contributing
Built With
License
Contact
Cite

About The Project

Protify is an open source platform designed to simplify and democratize workflows for chemical language models. With Protify, deep learning models can be trained to predict chemical properties at the click of a button, without requiring extensive coding knowledge or computational resources.

Why Protify?

Benchmark multiple models efficiently: Need to evaluate 10 different protein language models against 15 diverse datasets with publication-ready figures? Protify makes this possible without writing a single line of code.
Flexible for all skill levels: Build custom pipelines with code or use our no-code interface depending on your needs and expertise.
Accessible computing: No GPU? No problem. Synthyra offers precomputed embeddings for many popular datasets, which Protify can download for analysis with scikit-learn on your laptop.
Cost-effective solutions: The upcoming Synthyra API integration will offer affordable GPU training options, while our Colab notebook provides an accessible entry point for GPU-reliant analysis.

Protify is currently in beta. We're actively working to enhance features and documentation to meet our ambitious goals.

Currently Supported Models

Click to expand model list

pLM - Protein Language Model

Model Name	Description	Size (parameters)	Type
ESM2-8	Very small pLM from Meta AI that learns evolutionary information from millions of protein sequences.	8M	pLM
ESM2-35	Small-sized pLM trained on evolutionary data.	35M	pLM
ESM2-150	Medium-sized pLM with improved protein structure prediction capabilities.	150M	pLM
ESM2-650	Large pLM offering state-of-the-art performance on many protein prediction tasks.	650M	pLM
ESM2-3B	Largest ESM2 pLM with exceptional capability for protein structure and function prediction.	3B	pLM
ESMC-300	pLM optimized for representation learning.	300M	pLM
ESMC-600	Larger pLM for representations.	600M	pLM
ProtBert	BERT-based pLM trained on protein sequences from UniRef.	420M	pLM
ProtBert-BFD	BERT-based pLM trained on BFD database with improved performance.	420M	pLM
ProtT5	T5-based pLM capable of both encoding and generation tasks.	3B	pLM
ANKH-Base	Base version of the ANKH pLM focused on protein structure understanding.	400M	pLM
ANKH-Large	Large version of the ANKH pLM with improved structural predictions.	1.2B	pLM
ANKH2-Large	Improved second generation ANKH pLM.	1.2B	pLM
GLM2-150	Medium-sized general language model adapted for protein sequences.	150M	pLM
GLM2-650	Large general language model adapted for protein sequences.	650M	pLM
GLM2-GAIA	Specialized GLM pLM fine-tuned with contrastive learning.	650M	pLM
DPLM-150	Diffusion pLM focused on protein structure.	150M	pLM
DPLM-650	Larger diffusion pLM focused on protein structure.	650M	pLM
DPLM-3B	Largest deep protein language model in the DPLM family.	3B	pLM
DSM-150	Diffusion sequence model 150 parameter version.	150M	pLM
DSM-650	Diffusion sequence model 650 parameter version.	650M	pLM
DSM-PPI	DSM model optimized for protein-protein interactions.	Varies	pLM
ProtCLM-1b	Causal (auto regressive) pLM.	1B	pLM
OneHot-Protein	One-hot encoding baseline for protein sequences.	N/A	Baseline
OneHot-DNA	One-hot encoding baseline for DNA sequences.	N/A	Baseline
OneHot-RNA	One-hot encoding baseline for RNA sequences.	N/A	Baseline
OneHot-Codon	One-hot encoding baseline for codon sequences.	N/A	Baseline
Random	Baseline model with randomly initialized weights, serving as a negative control.	Varies	Negative control
Random-Transformer	Randomly initialized transformer model serving as a homology-based control.	Varies	Homology control

Currently Supported Datasets

Click to expand dataset list

BC - Binary Classification | SLC - Single-Label Classification | MLC - Multi-Label Classification | R - Regression

TC - Tokenwise classification | TR - Tokenwise regression

Dataset Name	Description	Type	Task	Tokenwise	Multiple inputs
EC	Enzyme Commission numbers dataset for predicting enzyme function classification.	MLC	Protein function prediction	No	No
GO-CC	Gene Ontology Cellular Component dataset for predicting protein localization in cells.	MLC	Protein localization prediction	No	No
GO-BP	Gene Ontology Biological Process dataset for predicting protein involvement in biological processes.	MLC	Protein function prediction	No	No
GO-MF	Gene Ontology Molecular Function dataset for predicting protein molecular functions.	MLC	Protein function prediction	No	No
MB	Metal ion binding dataset for predicting protein-metal interactions.	BC	Protein-metal binding prediction	No	No
DeepLoc-2	Binary classification dataset for predicting protein localization in 2 categories.	BC	Protein localization prediction	No	No
DeepLoc-10	Multi-class classification dataset for predicting protein localization in 10 categories.	MCC	Protein localization prediction	No	No
Subcellular	Dataset for predicting subcellular localization of proteins.	MCC	Protein localization prediction	No	No
enzyme-kcat	Dataset for predicting enzyme catalytic rate constants (kcat).	R	Enzyme kinetics prediction	No	No
solubility	Dataset for predicting protein solubility properties.	BC	Protein solubility prediction	No	No
localization	Dataset for predicting subcellular localization of proteins.	MCC	Protein localization prediction	No	No
temperature-stability	Dataset for predicting protein stability at different temperatures.	BC	Protein stability prediction	No	No
optimal-temperature	Dataset for predicting the optimal temperature for protein function.	R	Protein property prediction	No	No
optimal-ph	Dataset for predicting the optimal pH for protein function.	R	Protein property prediction	No	No
material-production	Dataset for predicting protein suitability for material production.	BC	Protein application prediction	No	No
fitness-prediction	Dataset for predicting protein fitness in various environments.	BC	Protein fitness prediction	No	No
number-of-folds	Dataset for predicting the number of structural folds in proteins.	BC	Protein structure prediction	No	No
cloning-clf	Dataset for predicting protein suitability for cloning operations.	BC	Protein engineering prediction	No	No
stability-prediction	Dataset for predicting overall protein stability.	BC	Protein stability prediction	No	No
SecondaryStructure-3	Dataset for predicting protein secondary structure in 3 classes.	MCC	Protein structure prediction	Yes	No
SecondaryStructure-8	Dataset for predicting protein secondary structure in 8 classes.	MCC	Protein structure prediction	Yes	No
fluorescence-prediction	Dataset for predicting protein fluorescence properties.	R	Protein property prediction	Yes	No
plastic	Dataset for predicting protein capability for plastic degradation.	BC	Enzyme function prediction	No	No
gold-ppi	Gold standard dataset for protein-protein interaction prediction.	SLC	PPI prediction	No	Yes
human-ppi-saprot	Human protein-protein interaction dataset from SAProt paper.	SLC	PPI prediction	No	Yes
human-ppi-pinui	Human protein-protein interaction dataset from PiNUI.	SLC	PPI prediction	No	Yes
yeast-ppi-pinui	Yeast protein-protein interaction dataset from PiNUI.	SLC	PPI prediction	No	Yes
peptide-HLA-MHC-affinity	Dataset for predicting peptide binding affinity to HLA/MHC complexes.	SLC	Binding affinity prediction	No	Yes
shs27-ppi-raw	Raw SHS27k with single-label labels.	SLC	PPI type prediction	No	Yes
shs148-ppi-raw	Raw SHS148k with single-label labels.	SLC	PPI type prediction	No	Yes
shs27-ppi-random	SHS27k	MLC	PPI prediction	No	Yes
shs148-ppi-random	SHS148k CD-Hit 40%, multi-label lables, randomized data splits.	MLC	PPI type prediction	No	Yes
shs27-ppi-dfs	SHS27k CD-Hit 40%, multi-label lables, data splits via depth first search.	MLC	PPI type prediction	No	Yes
shs148-ppi-dfs	SHS148k CD-Hit 40%, multi-label lables, data splits via depth first search.	MLC	PPI type prediction	No	Yes
shs27-ppi-bfs	SHS27k CD-Hit 40%, multi-label lables, data splits via breadth first search.	MLC	PPI type prediction	No	Yes
shs148-ppi-bfs	SHS148k CD-Hit 40%, multi-label lables, data splits via breadth first search.	MLC	PPI type prediction	No	Yes
string-ppi-random	STRING CD-Hit 40%, multi-label lables, randomized data splits.	MLC	PPI type prediction	No	Yes
string-ppi-dfs	STRING CD-Hit 40%, multi-label lables, data splits via depth first search.	MLC	PPI type prediction	No	Yes
string-ppi-bfs	STRING CD-Hit 40%, multi-label lables, data splits via breadth first search.	MLC	PPI type prediction	No	Yes
ppi-mutation-effect	Compare wild type, mutated, and target sequence to determine if PPI is stronger or not.	SLC	PPI effect prediction	No	Yes
PPA-ppi	Protein-Protein Affinity dataset from Bindwell.	R	protein-protein affinity prediction	No	Yes
foldseek-fold	Dataset for protein fold classification using Foldseek.	MCC	Protein structure prediction	No	No
foldseek-inverse	Inverse protein fold prediction dataset.	MCC	Protein structure prediction	No	No
ec-active	Dataset for predicting active enzyme classes.	MCC	Enzyme function prediction	No	No
taxon_domain	Taxonomic classification at domain level.	MCC	Taxonomic prediction	No	No
taxon_kingdom	Taxonomic classification at kingdom level.	MCC	Taxonomic prediction	No	No
taxon_phylum	Taxonomic classification at phylum level.	MCC	Taxonomic prediction	No	No
taxon_class	Taxonomic classification at class level.	MCC	Taxonomic prediction	No	No
taxon_order	Taxonomic classification at order level.	MCC	Taxonomic prediction	No	No
taxon_family	Taxonomic classification at family level.	MCC	Taxonomic prediction	No	No
taxon_genus	Taxonomic classification at genus level.	MCC	Taxonomic prediction	No	No
taxon_species	Taxonomic classification at species level.	MCC	Taxonomic prediction	No	No
diff_phylogeny	Differential phylogeny dataset.	Various	Phylogeny prediction	No	No
plddt	AlphaFold pLDDT confidence score prediction.	TR	Confidence prediction	Yes	No
realness	Protein realness dataset.	BC	Authenticity prediction	No	No
million_full	Large-scale enzyme variant dataset, from Millionfull preprint October 2025	R	Protein fitness prediction	No	No

For more details about supported models and datasets, including programmatic access and command-line utilities, see the Resource Listing Documentation.

Current Key Features

Multiple interfaces: Run experiments via an intuitive GUI, CLI, or prepared YAML files
Efficient embeddings: Leverage fast and efficient embeddings from ESM2 and ESMC via FastPLMs
- Coming soon: Additional protein, SMILES, SELFIES, codon, and nucleotide language models
Flexible model probing: Use efficient MLPs for sequence-wise tasks or transformer probes for token-wise tasks
- Coming soon: Full model fine-tuning, hybrid probing, and LoRA
Automated model selection: Find optimal scikit-learn models for your data with LazyPredict, enhanced by automatic hyperparameter optimization
- Coming soon: GPU acceleration
Complete reproducibility: Every session generates a detailed log that can be used to reproduce your entire workflow
Publication-ready visualizations: Generate cross-model and dataset comparisons with radar and bar plots, embedding analysis with PCA, t-SNE, and UMAP, and statistically sound confidence interval plots
Extensive dataset support: Access 46+ protein datasets by default, or easily integrate your own local or private datasets
- Coming soon: Additional protein, SMILES, SELFIES, codon, and nucleotide property datasets
Advanced interaction modeling: Support for protein-protein interaction datasets
- Coming soon: Protein-small molecule interaction capabilities

Support Protify's Development

Help us grow by sharing online, starring our repository, or contributing through our bounty program.

(back to top)

Getting Started

Installation

From pip

pip install Protify

To get started locally

git clone https://github.com/Gleghorn-Lab/Protify.git
cd Protify
git submodule update --init --remote --recursive
python -m pip install -r requirements.txt
cd src/protify

With a Python VM (linux)

git clone https://github.com/Gleghorn-Lab/Protify.git
cd Protify
git submodule update --init --remote --recursive
chmod +x setup_protify.sh
./setup_protify.sh
source ~/protify_venv/bin/activate
cd src/protify

With Docker

git clone https://github.com/Gleghorn-Lab/Protify.git
cd Protify
git submodule update --init --remote --recursive
docker build -t protify-env:latest .
docker run --rm --gpus all -v ${PWD}:/workspace protify-env:latest python -m main

Note: You may need to include sudo before the docker commands.

(back to top)

Usage

Toggle

To launch the gui, run

python -m gui

It's recommended to use the user interface alongside an open terminal, as helpful messages and progressbars will show in the terminal while you press the GUI buttons.

An example workflow

Here, we will compare various protein models against a random vector baseline (negative control) and random transformer (homology based control).

1.) Start the session

2.) Select the models you would like to benchmark

3.) Select the datasets you are interested in. Here we chose Enzyme Commission numbers (multi-label classification), metal-ion binding (binary classification), solubility (binary classification), catalytic rate (kcat, regression), and protein localization (DeepLoc-2, binary classification).

4.) Embed the proteins in the selected datasets. If your machine does not have a GPU, you can download precomputed embeddings for many common sequences. Note: If you download embeddings, it will be faster to use the scikit model tab than the probe tab

5.) Select which probe and configuration you would like. Here, we will use a simple linear probe, a type neural network. It is the fastest (by a large margin) but worst performing option (by a small margin usually).

6.) Select your settings for training. Like most of the tabs, the defaults are pretty good. If you need information about what setting does what, the ? button provides a helpful note. The documentations has more extensive information

This will train your models!

7.) After training, you can render helpful visualizations by passing the log ID from before. If you forget it, you can look for the file generated in the logs folder.

Here's a sample of the many plots produced. You can find them all inside plots/your_log_id/*

8.) Need to replicate your findings for a report or paper? Just input the generated log into the replay tab

To run the same session from the command line instead, you would simply execute

python -m main --model_names ESM2-8 ESM2-35 ESMC-300 ProtBert ANKH-Base Random Random-Transformer --data_names EC DeepLoc-2 enzyme-kcat MB solubility --patience 3

Or, set up a yaml file with your desired settings (so you don't have to type out everything in the CLI)

python -m main --yaml_path yamls/your_custom_yaml_path.yaml

Replaying from the CLI is just as simple

python -m main --replay_path logs/your_log_id.txt

ProteinGym Benchmarking

Protify includes a zero-shot pipeline for the ProteinGym DMS benchmark with a standardized performance summary.

Run zero-shot scoring on ProteinGym substitutions
```
python -m main --proteingym \
  --model_names ESM2-8 ESM2-35 ProtBert \
  --dms_ids all \
  --scoring_method masked_marginal \
  --scoring_window optimal \
```
- Outputs per-assay CSVs at results/proteingym/*__zs_masked_marginal.csv
- After scoring, a standardized performance summary is written to results/proteingym/benchmark_performance/
  - This summary exactly matches the format expected by the ProteinGym repository for adding scores for a new model (ready to use in a PR)
Available options
- dms_ids. By default, all 217 substitution assays are used. You can specify DMS_ids by name to only use a subset.
- scoring_method
  - masked_marginal (default): Mask mutated positions in the wild-type window; score Δlog p(mutant) − log p(wildtype) at those positions
  - wildtype_marginal: Unmasked wild-type context; score Δlog p(mutant) − log p(wildtype) at mutated positions
  - mutant_marginal: Unmasked mutant context; score Δlog p(mutant) − log p(wildtype) at mutated positions
  - pll: Pseudo log-likelihood obtained by masking each position and summing true-token log-likelihoods (indels are scored with a length-normalized PLL across windows)
  - global_log_prob: Unmasked log-probability of the entire mutated sequence/window
- scoring_window
  - optimal (default): Single window (≤ model context) centered around the mutation barycenter
  - sliding: Non-overlapping contiguous windows across the full sequence (default for indels)

Compare performance & time for each scoring method for one or more models

python -m main --proteingym --compare_scoring_methods \
  --model_names ESM2-650 \
  --dms_ids AACC1_PSEAI_Dandage_2018 A4_HUMAN_Seuma_2022 \
  --results_dir results

Saves a summary CSV to results/scoring_methods_comparison.csv

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

We work with a bounty system. You can find bounties on this page. Contributing bounties will get you listed on the Protify consortium and potentially coauthorship on published papers involving the framework.

Simply open a pull request with the bounty ID in the title to claim one. For additional features not on the bounty list simply use a descriptive title.

For bugs and general suggestions please use GitHub issues.

(back to top)

Built With

(back to top)

License

Distributed under the Protify License. See LICENSE.md for more information.

(back to top)

Contact

Collaborations

[email protected]

Gleghorn Lab

Business / Licensing

[email protected]

Synthyra

(back to top)

Cite

If you use this package, please cite the following papers. (Coming soon)

Name		Name	Last commit message	Last commit date
Latest commit History 535 Commits
.github/workflows		.github/workflows
docs		docs
images		images
modal		modal
src/protify		src/protify
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
exp.ipynb		exp.ipynb
probe_package_colab.ipynb		probe_package_colab.ipynb
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup_protify.sh		setup_protify.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Protify

About The Project

Why Protify?

Currently Supported Models

Currently Supported Datasets

Current Key Features

Support Protify's Development

Getting Started

Installation

Usage

An example workflow

ProteinGym Benchmarking

Contributing

Built With

License

Contact

Collaborations

Business / Licensing

Cite

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

Gleghorn-Lab/Protify

Folders and files

Latest commit

History

Repository files navigation

Protify

About The Project

Why Protify?

Currently Supported Models

Currently Supported Datasets

Current Key Features

Support Protify's Development

Getting Started

Installation

Usage

An example workflow

ProteinGym Benchmarking

Contributing

Built With

License

Contact

Collaborations

Business / Licensing

Cite

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages