A low code solution for computationally predicting the properties of chemicals.
Explore the docs »
View Demo
·
Report Bug
·
Request Feature
Table of Contents
Protify is an open source platform designed to simplify and democratize workflows for chemical language models. With Protify, deep learning models can be trained to predict chemical properties at the click of a button, without requiring extensive coding knowledge or computational resources.
- Benchmark multiple models efficiently: Need to evaluate 10 different protein language models against 15 diverse datasets with publication-ready figures? Protify makes this possible without writing a single line of code.
- Flexible for all skill levels: Build custom pipelines with code or use our no-code interface depending on your needs and expertise.
- Accessible computing: No GPU? No problem. Synthyra offers precomputed embeddings for many popular datasets, which Protify can download for analysis with scikit-learn on your laptop.
- Cost-effective solutions: The upcoming Synthyra API integration will offer affordable GPU training options, while our Colab notebook provides an accessible entry point for GPU-reliant analysis.
Protify is currently in beta. We're actively working to enhance features and documentation to meet our ambitious goals.
Click to expand model list
pLM - Protein Language Model
| Model Name | Description | Size (parameters) | Type |
|---|---|---|---|
| ESM2-8 | Very small pLM from Meta AI that learns evolutionary information from millions of protein sequences. | 8M | pLM |
| ESM2-35 | Small-sized pLM trained on evolutionary data. | 35M | pLM |
| ESM2-150 | Medium-sized pLM with improved protein structure prediction capabilities. | 150M | pLM |
| ESM2-650 | Large pLM offering state-of-the-art performance on many protein prediction tasks. | 650M | pLM |
| ESM2-3B | Largest ESM2 pLM with exceptional capability for protein structure and function prediction. | 3B | pLM |
| ESMC-300 | pLM optimized for representation learning. | 300M | pLM |
| ESMC-600 | Larger pLM for representations. | 600M | pLM |
| ProtBert | BERT-based pLM trained on protein sequences from UniRef. | 420M | pLM |
| ProtBert-BFD | BERT-based pLM trained on BFD database with improved performance. | 420M | pLM |
| ProtT5 | T5-based pLM capable of both encoding and generation tasks. | 3B | pLM |
| ANKH-Base | Base version of the ANKH pLM focused on protein structure understanding. | 400M | pLM |
| ANKH-Large | Large version of the ANKH pLM with improved structural predictions. | 1.2B | pLM |
| ANKH2-Large | Improved second generation ANKH pLM. | 1.2B | pLM |
| GLM2-150 | Medium-sized general language model adapted for protein sequences. | 150M | pLM |
| GLM2-650 | Large general language model adapted for protein sequences. | 650M | pLM |
| GLM2-GAIA | Specialized GLM pLM fine-tuned with contrastive learning. | 650M | pLM |
| DPLM-150 | Diffusion pLM focused on protein structure. | 150M | pLM |
| DPLM-650 | Larger diffusion pLM focused on protein structure. | 650M | pLM |
| DPLM-3B | Largest deep protein language model in the DPLM family. | 3B | pLM |
| DSM-150 | Diffusion sequence model 150 parameter version. | 150M | pLM |
| DSM-650 | Diffusion sequence model 650 parameter version. | 650M | pLM |
| DSM-PPI | DSM model optimized for protein-protein interactions. | Varies | pLM |
| ProtCLM-1b | Causal (auto regressive) pLM. | 1B | pLM |
| OneHot-Protein | One-hot encoding baseline for protein sequences. | N/A | Baseline |
| OneHot-DNA | One-hot encoding baseline for DNA sequences. | N/A | Baseline |
| OneHot-RNA | One-hot encoding baseline for RNA sequences. | N/A | Baseline |
| OneHot-Codon | One-hot encoding baseline for codon sequences. | N/A | Baseline |
| Random | Baseline model with randomly initialized weights, serving as a negative control. | Varies | Negative control |
| Random-Transformer | Randomly initialized transformer model serving as a homology-based control. | Varies | Homology control |
Click to expand dataset list
BC - Binary Classification | SLC - Single-Label Classification | MLC - Multi-Label Classification | R - Regression
TC - Tokenwise classification | TR - Tokenwise regression
| Dataset Name | Description | Type | Task | Tokenwise | Multiple inputs |
|---|---|---|---|---|---|
| EC | Enzyme Commission numbers dataset for predicting enzyme function classification. | MLC | Protein function prediction | No | No |
| GO-CC | Gene Ontology Cellular Component dataset for predicting protein localization in cells. | MLC | Protein localization prediction | No | No |
| GO-BP | Gene Ontology Biological Process dataset for predicting protein involvement in biological processes. | MLC | Protein function prediction | No | No |
| GO-MF | Gene Ontology Molecular Function dataset for predicting protein molecular functions. | MLC | Protein function prediction | No | No |
| MB | Metal ion binding dataset for predicting protein-metal interactions. | BC | Protein-metal binding prediction | No | No |
| DeepLoc-2 | Binary classification dataset for predicting protein localization in 2 categories. | BC | Protein localization prediction | No | No |
| DeepLoc-10 | Multi-class classification dataset for predicting protein localization in 10 categories. | MCC | Protein localization prediction | No | No |
| Subcellular | Dataset for predicting subcellular localization of proteins. | MCC | Protein localization prediction | No | No |
| enzyme-kcat | Dataset for predicting enzyme catalytic rate constants (kcat). | R | Enzyme kinetics prediction | No | No |
| solubility | Dataset for predicting protein solubility properties. | BC | Protein solubility prediction | No | No |
| localization | Dataset for predicting subcellular localization of proteins. | MCC | Protein localization prediction | No | No |
| temperature-stability | Dataset for predicting protein stability at different temperatures. | BC | Protein stability prediction | No | No |
| optimal-temperature | Dataset for predicting the optimal temperature for protein function. | R | Protein property prediction | No | No |
| optimal-ph | Dataset for predicting the optimal pH for protein function. | R | Protein property prediction | No | No |
| material-production | Dataset for predicting protein suitability for material production. | BC | Protein application prediction | No | No |
| fitness-prediction | Dataset for predicting protein fitness in various environments. | BC | Protein fitness prediction | No | No |
| number-of-folds | Dataset for predicting the number of structural folds in proteins. | BC | Protein structure prediction | No | No |
| cloning-clf | Dataset for predicting protein suitability for cloning operations. | BC | Protein engineering prediction | No | No |
| stability-prediction | Dataset for predicting overall protein stability. | BC | Protein stability prediction | No | No |
| SecondaryStructure-3 | Dataset for predicting protein secondary structure in 3 classes. | MCC | Protein structure prediction | Yes | No |
| SecondaryStructure-8 | Dataset for predicting protein secondary structure in 8 classes. | MCC | Protein structure prediction | Yes | No |
| fluorescence-prediction | Dataset for predicting protein fluorescence properties. | R | Protein property prediction | Yes | No |
| plastic | Dataset for predicting protein capability for plastic degradation. | BC | Enzyme function prediction | No | No |
| gold-ppi | Gold standard dataset for protein-protein interaction prediction. | SLC | PPI prediction | No | Yes |
| human-ppi-saprot | Human protein-protein interaction dataset from SAProt paper. | SLC | PPI prediction | No | Yes |
| human-ppi-pinui | Human protein-protein interaction dataset from PiNUI. | SLC | PPI prediction | No | Yes |
| yeast-ppi-pinui | Yeast protein-protein interaction dataset from PiNUI. | SLC | PPI prediction | No | Yes |
| peptide-HLA-MHC-affinity | Dataset for predicting peptide binding affinity to HLA/MHC complexes. | SLC | Binding affinity prediction | No | Yes |
| shs27-ppi-raw | Raw SHS27k with single-label labels. | SLC | PPI type prediction | No | Yes |
| shs148-ppi-raw | Raw SHS148k with single-label labels. | SLC | PPI type prediction | No | Yes |
| shs27-ppi-random | SHS27k | MLC | PPI prediction | No | Yes |
| shs148-ppi-random | SHS148k CD-Hit 40%, multi-label lables, randomized data splits. | MLC | PPI type prediction | No | Yes |
| shs27-ppi-dfs | SHS27k CD-Hit 40%, multi-label lables, data splits via depth first search. | MLC | PPI type prediction | No | Yes |
| shs148-ppi-dfs | SHS148k CD-Hit 40%, multi-label lables, data splits via depth first search. | MLC | PPI type prediction | No | Yes |
| shs27-ppi-bfs | SHS27k CD-Hit 40%, multi-label lables, data splits via breadth first search. | MLC | PPI type prediction | No | Yes |
| shs148-ppi-bfs | SHS148k CD-Hit 40%, multi-label lables, data splits via breadth first search. | MLC | PPI type prediction | No | Yes |
| string-ppi-random | STRING CD-Hit 40%, multi-label lables, randomized data splits. | MLC | PPI type prediction | No | Yes |
| string-ppi-dfs | STRING CD-Hit 40%, multi-label lables, data splits via depth first search. | MLC | PPI type prediction | No | Yes |
| string-ppi-bfs | STRING CD-Hit 40%, multi-label lables, data splits via breadth first search. | MLC | PPI type prediction | No | Yes |
| ppi-mutation-effect | Compare wild type, mutated, and target sequence to determine if PPI is stronger or not. | SLC | PPI effect prediction | No | Yes |
| PPA-ppi | Protein-Protein Affinity dataset from Bindwell. | R | protein-protein affinity prediction | No | Yes |
| foldseek-fold | Dataset for protein fold classification using Foldseek. | MCC | Protein structure prediction | No | No |
| foldseek-inverse | Inverse protein fold prediction dataset. | MCC | Protein structure prediction | No | No |
| ec-active | Dataset for predicting active enzyme classes. | MCC | Enzyme function prediction | No | No |
| taxon_domain | Taxonomic classification at domain level. | MCC | Taxonomic prediction | No | No |
| taxon_kingdom | Taxonomic classification at kingdom level. | MCC | Taxonomic prediction | No | No |
| taxon_phylum | Taxonomic classification at phylum level. | MCC | Taxonomic prediction | No | No |
| taxon_class | Taxonomic classification at class level. | MCC | Taxonomic prediction | No | No |
| taxon_order | Taxonomic classification at order level. | MCC | Taxonomic prediction | No | No |
| taxon_family | Taxonomic classification at family level. | MCC | Taxonomic prediction | No | No |
| taxon_genus | Taxonomic classification at genus level. | MCC | Taxonomic prediction | No | No |
| taxon_species | Taxonomic classification at species level. | MCC | Taxonomic prediction | No | No |
| diff_phylogeny | Differential phylogeny dataset. | Various | Phylogeny prediction | No | No |
| plddt | AlphaFold pLDDT confidence score prediction. | TR | Confidence prediction | Yes | No |
| realness | Protein realness dataset. | BC | Authenticity prediction | No | No |
| million_full | Large-scale enzyme variant dataset, from Millionfull preprint October 2025 | R | Protein fitness prediction | No | No |
For more details about supported models and datasets, including programmatic access and command-line utilities, see the Resource Listing Documentation.
- Multiple interfaces: Run experiments via an intuitive GUI, CLI, or prepared YAML files
- Efficient embeddings: Leverage fast and efficient embeddings from ESM2 and ESMC via FastPLMs
- Coming soon: Additional protein, SMILES, SELFIES, codon, and nucleotide language models
- Flexible model probing: Use efficient MLPs for sequence-wise tasks or transformer probes for token-wise tasks
- Coming soon: Full model fine-tuning, hybrid probing, and LoRA
- Automated model selection: Find optimal scikit-learn models for your data with LazyPredict, enhanced by automatic hyperparameter optimization
- Coming soon: GPU acceleration
- Complete reproducibility: Every session generates a detailed log that can be used to reproduce your entire workflow
- Publication-ready visualizations: Generate cross-model and dataset comparisons with radar and bar plots, embedding analysis with PCA, t-SNE, and UMAP, and statistically sound confidence interval plots
- Extensive dataset support: Access 46+ protein datasets by default, or easily integrate your own local or private datasets
- Coming soon: Additional protein, SMILES, SELFIES, codon, and nucleotide property datasets
- Advanced interaction modeling: Support for protein-protein interaction datasets
- Coming soon: Protein-small molecule interaction capabilities
Help us grow by sharing online, starring our repository, or contributing through our bounty program.
From pip
pip install Protify
To get started locally
git clone https://github.com/Gleghorn-Lab/Protify.git
cd Protify
git submodule update --init --remote --recursive
python -m pip install -r requirements.txt
cd src/protifyWith a Python VM (linux)
git clone https://github.com/Gleghorn-Lab/Protify.git
cd Protify
git submodule update --init --remote --recursive
chmod +x setup_protify.sh
./setup_protify.sh
source ~/protify_venv/bin/activate
cd src/protifyWith Docker
git clone https://github.com/Gleghorn-Lab/Protify.git
cd Protify
git submodule update --init --remote --recursive
docker build -t protify-env:latest .
docker run --rm --gpus all -v ${PWD}:/workspace protify-env:latest python -m mainNote: You may need to include sudo before the docker commands.
Toggle
To launch the gui, run
python -m guiIt's recommended to use the user interface alongside an open terminal, as helpful messages and progressbars will show in the terminal while you press the GUI buttons.
Here, we will compare various protein models against a random vector baseline (negative control) and random transformer (homology based control).
1.) Start the session
2.) Select the models you would like to benchmark
3.) Select the datasets you are interested in. Here we chose Enzyme Commission numbers (multi-label classification), metal-ion binding (binary classification), solubility (binary classification), catalytic rate (kcat, regression), and protein localization (DeepLoc-2, binary classification).
4.) Embed the proteins in the selected datasets. If your machine does not have a GPU, you can download precomputed embeddings for many common sequences.
Note: If you download embeddings, it will be faster to use the scikit model tab than the probe tab
5.) Select which probe and configuration you would like. Here, we will use a simple linear probe, a type neural network. It is the fastest (by a large margin) but worst performing option (by a small margin usually).
6.) Select your settings for training. Like most of the tabs, the defaults are pretty good. If you need information about what setting does what, the ? button provides a helpful note. The documentations has more extensive information
This will train your models!
7.) After training, you can render helpful visualizations by passing the log ID from before. If you forget it, you can look for the file generated in the logs folder.
Here's a sample of the many plots produced. You can find them all inside plots/your_log_id/*
8.) Need to replicate your findings for a report or paper? Just input the generated log into the replay tab
To run the same session from the command line instead, you would simply execute
python -m main --model_names ESM2-8 ESM2-35 ESMC-300 ProtBert ANKH-Base Random Random-Transformer --data_names EC DeepLoc-2 enzyme-kcat MB solubility --patience 3
Or, set up a yaml file with your desired settings (so you don't have to type out everything in the CLI)
python -m main --yaml_path yamls/your_custom_yaml_path.yaml
Replaying from the CLI is just as simple
python -m main --replay_path logs/your_log_id.txt
Protify includes a zero-shot pipeline for the ProteinGym DMS benchmark with a standardized performance summary.
-
Run zero-shot scoring on ProteinGym substitutions
python -m main --proteingym \ --model_names ESM2-8 ESM2-35 ProtBert \ --dms_ids all \ --scoring_method masked_marginal \ --scoring_window optimal \
- Outputs per-assay CSVs at
results/proteingym/*__zs_masked_marginal.csv - After scoring, a standardized performance summary is written to
results/proteingym/benchmark_performance/- This summary exactly matches the format expected by the ProteinGym repository for adding scores for a new model (ready to use in a PR)
Available options
- dms_ids. By default, all 217 substitution assays are used. You can specify DMS_ids by name to only use a subset.
- scoring_method
- masked_marginal (default): Mask mutated positions in the wild-type window; score Δlog p(mutant) − log p(wildtype) at those positions
- wildtype_marginal: Unmasked wild-type context; score Δlog p(mutant) − log p(wildtype) at mutated positions
- mutant_marginal: Unmasked mutant context; score Δlog p(mutant) − log p(wildtype) at mutated positions
- pll: Pseudo log-likelihood obtained by masking each position and summing true-token log-likelihoods (indels are scored with a length-normalized PLL across windows)
- global_log_prob: Unmasked log-probability of the entire mutated sequence/window
- scoring_window
- optimal (default): Single window (≤ model context) centered around the mutation barycenter
- sliding: Non-overlapping contiguous windows across the full sequence (default for indels)
- Outputs per-assay CSVs at
-
Compare performance & time for each scoring method for one or more models
python -m main --proteingym --compare_scoring_methods \ --model_names ESM2-650 \ --dms_ids AACC1_PSEAI_Dandage_2018 A4_HUMAN_Seuma_2022 \ --results_dir results
- Saves a summary CSV to
results/scoring_methods_comparison.csv
- Saves a summary CSV to
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
We work with a bounty system. You can find bounties on this page. Contributing bounties will get you listed on the Protify consortium and potentially coauthorship on published papers involving the framework.
Simply open a pull request with the bounty ID in the title to claim one. For additional features not on the bounty list simply use a descriptive title.
For bugs and general suggestions please use GitHub issues.
Distributed under the Protify License. See LICENSE.md for more information.
If you use this package, please cite the following papers. (Coming soon)